top of page

Online Retail Dataset - Classification, Clustering and Regression

Updated: Nov 3, 2021



Description :


This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011. We can use this dataset for regression, clustering and classification for e.g. to predict the sale of items or to predict the products which have been purchased previously and the user is most likely to buy the same products in their next order etc.


Recommended Model :


Algorithms to be used, Apriori, Fp growth, Random forest regressor, linear regression, Ridge regression, Lasso regression etc.


Recommended Projects :


To predict the sale of items or to predict the products which have been purchased previously and the user is most likely to buy the same products in their next order etc.

Dataset link



Overview of data


Detailed overview of dataset

  • Records in the dataset = 525461 ROWS

  • Columns in the dataset = 8 COLUMNS

  1. InvoiceNo: A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'C', it indicates a cancellation (Nominal)

  2. StockCode: A 5-digit integral number uniquely assigned to each distinct product (Nominal)

  3. Description: Product (item) name. (Nominal)

  4. Quantity: The quantities of each product (item) per transaction ( Numeric)

  5. InvoiceDate: The day and time when each transaction was generated (Numeric)

  6. UnitPrice: Product price per unit in sterling (Numeric)

  7. CustomerID: A 5-digit integral number uniquely assigned to each customer (Nominal)

  8. Country: Name of the country where each customer resides (Nominal)

EDA[Code]


Dataset


import pandas as pd
#Load data
file_loc = "data\\online_retail_II.xlsx"

online_reatail_data = pd.read_excel(file_loc)
online_reatail_data.head()


Total Number of Rows and Columns


rows_col = online_reatail_data.shape
print("Total number of records in the dataset : ", rows_col[0])
print("Total number of columns in the dataset : ", rows_col[1])


Check Details


# data information
online_reatail_data.info()


Check the number of missing values


# check missing values
online_reatail_data.isna().sum()


Statistical Information


# statistical information
online_reatail_data.describe()


Data Visualization


Count number of country


sns.set_style("whitegrid")
plt.figure(figsize = (15,8))
plt.xticks(rotation=65,size=10)
sns.countplot(x='Country', data=online_reatail_data) 
plt.show()

Country count

Order per month


import datetime as dt
online_reatail_data['month'] = online_reatail_data['InvoiceDate'].dt.month
online_reatail_data['year'] = online_reatail_data['InvoiceDate'].dt.year
online_reatail_data['month_year'] = pd.to_datetime(online_reatail_data[['year', 'month']].assign(Day=1))
online_reatail_data['revenue'] = online_reatail_data['Price'] * online_reatail_data['Quantity']

import matplotlib.pyplot as plt
import seaborn as sns 
sns.set_theme(style="whitegrid")
plt.figure(figsize=(10,5))
plot = pd.DataFrame(online_reatail_data.groupby(['month_year'])['Invoice'].count()).reset_index()
ax = sns.lineplot(x="month_year", y="Invoice", data = plot)

Order per month

Revenue Per month


data2 = pd.DataFrame(online_reatail_data.groupby(['month_year'])['revenue'].sum()).reset_index()
plt.figure(figsize=(10,5))
ax = sns.lineplot(x = 'month_year', y='revenue', data = data2)


Revenue per month

The most sale by country


data3 = pd.DataFrame(online_reatail_data.groupby(['Country'])['revenue'].sum()).reset_index()
plt.figure(figsize=(15,5))
ax=sns.barplot(x='Country', y='revenue',data=data3)
plt.xticks(rotation=65,size=10)
plt.show()

The most sale by country

Other related data



If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.

bottom of page