Oct 29, 2021

Online Retail Dataset - Classification, Clustering and Regression

Updated: Nov 3, 2021

Description :

This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011. We can use this dataset for regression, clustering and classification for e.g. to predict the sale of items or to predict the products which have been purchased previously and the user is most likely to buy the same products in their next order etc.

Recommended Model :

Algorithms to be used, Apriori, Fp growth, Random forest regressor, linear regression, Ridge regression, Lasso regression etc.

Recommended Projects :

To predict the sale of items or to predict the products which have been purchased previously and the user is most likely to buy the same products in their next order etc.

Dataset link

Data set Link : UCI MLR - https://archive.ics.uci.edu/ml/datasets/Online+Retail+II

Kaggle : - https://www.kaggle.com/lakshmi25npathi/online-retail-dataset

Overview of data

Detailed overview of dataset

  • Records in the dataset = 525461 ROWS

  • Columns in the dataset = 8 COLUMNS

  1. InvoiceNo: A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'C', it indicates a cancellation (Nominal)

  2. StockCode: A 5-digit integral number uniquely assigned to each distinct product (Nominal)

  3. Description: Product (item) name. (Nominal)

  4. Quantity: The quantities of each product (item) per transaction ( Numeric)

  5. InvoiceDate: The day and time when each transaction was generated (Numeric)

  6. UnitPrice: Product price per unit in sterling (Numeric)

  7. CustomerID: A 5-digit integral number uniquely assigned to each customer (Nominal)

  8. Country: Name of the country where each customer resides (Nominal)

EDA[Code]

Dataset

import pandas as pd
 
#Load data
 
file_loc = "data\\online_retail_II.xlsx"
 

 
online_reatail_data = pd.read_excel(file_loc)
 
online_reatail_data.head()

Total Number of Rows and Columns

rows_col = online_reatail_data.shape
 
print("Total number of records in the dataset : ", rows_col[0])
 
print("Total number of columns in the dataset : ", rows_col[1])

Check Details

# data information
 
online_reatail_data.info()

Check the number of missing values

# check missing values
 
online_reatail_data.isna().sum()

Statistical Information

# statistical information
 
online_reatail_data.describe()

Data Visualization

Count number of country

sns.set_style("whitegrid")
 
plt.figure(figsize = (15,8))
 
plt.xticks(rotation=65,size=10)
 
sns.countplot(x='Country', data=online_reatail_data)
 
plt.show()

Country count

Order per month

import datetime as dt
 
online_reatail_data['month'] = online_reatail_data['InvoiceDate'].dt.month
 
online_reatail_data['year'] = online_reatail_data['InvoiceDate'].dt.year
 
online_reatail_data['month_year'] = pd.to_datetime(online_reatail_data[['year', 'month']].assign(Day=1))
 
online_reatail_data['revenue'] = online_reatail_data['Price'] * online_reatail_data['Quantity']
 

 
import matplotlib.pyplot as plt
 
import seaborn as sns
 
sns.set_theme(style="whitegrid")
 
plt.figure(figsize=(10,5))
 
plot = pd.DataFrame(online_reatail_data.groupby(['month_year'])['Invoice'].count()).reset_index()
 
ax = sns.lineplot(x="month_year", y="Invoice", data = plot)

Order per month

Revenue Per month

data2 = pd.DataFrame(online_reatail_data.groupby(['month_year'])['revenue'].sum()).reset_index()
 
plt.figure(figsize=(10,5))
 
ax = sns.lineplot(x = 'month_year', y='revenue', data = data2)

Revenue per month

The most sale by country

data3 = pd.DataFrame(online_reatail_data.groupby(['Country'])['revenue'].sum()).reset_index()
 
plt.figure(figsize=(15,5))
 
ax=sns.barplot(x='Country', y='revenue',data=data3)
 
plt.xticks(rotation=65,size=10)
 
plt.show()

The most sale by country

Other related data

Occupancy Detection Data Set - Classification

Census income Data Set - Classification

Wholesale customer - Classification and Clustering

Student performance dataset - Classification and Regression

Cervical Cancer Risk Factor Dataset - classification and clustering

Blood Transfusion service center dataset - Classification

Divorce Predictor Dataset -classification

Fire Forest Dataset - Regression

Heart Disease dataset -Classification

If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.