Online Retail Dataset - Classification, Clustering and Regression

Oct 29, 2021

Updated: Nov 3, 2021

Description :

This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011. We can use this dataset for regression, clustering and classification for e.g. to predict the sale of items or to predict the products which have been purchased previously and the user is most likely to buy the same products in their next order etc.

Recommended Model :

Algorithms to be used, Apriori, Fp growth, Random forest regressor, linear regression, Ridge regression, Lasso regression etc.

Recommended Projects :

To predict the sale of items or to predict the products which have been purchased previously and the user is most likely to buy the same products in their next order etc.

Dataset link

Data set Link : UCI MLR - https://archive.ics.uci.edu/ml/datasets/Online+Retail+II

Kaggle : - https://www.kaggle.com/lakshmi25npathi/online-retail-dataset

Overview of data

Detailed overview of dataset

Records in the dataset = 525461 ROWS
Columns in the dataset = 8 COLUMNS

InvoiceNo: A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'C', it indicates a cancellation (Nominal)
StockCode: A 5-digit integral number uniquely assigned to each distinct product (Nominal)
Description: Product (item) name. (Nominal)
Quantity: The quantities of each product (item) per transaction ( Numeric)
InvoiceDate: The day and time when each transaction was generated (Numeric)
UnitPrice: Product price per unit in sterling (Numeric)
CustomerID: A 5-digit integral number uniquely assigned to each customer (Nominal)
Country: Name of the country where each customer resides (Nominal)

EDA[Code]

Dataset

import pandas as pd
#Load data
file_loc = "data\\online_retail_II.xlsx"

online_reatail_data = pd.read_excel(file_loc)
online_reatail_data.head()

Total Number of Rows and Columns

rows_col = online_reatail_data.shape
print("Total number of records in the dataset : ", rows_col[0])
print("Total number of columns in the dataset : ", rows_col[1])

Check Details

# data information
online_reatail_data.info()

Check the number of missing values

# check missing values
online_reatail_data.isna().sum()

Statistical Information

# statistical information
online_reatail_data.describe()

Data Visualization

Count number of country

sns.set_style("whitegrid")
plt.figure(figsize = (15,8))
plt.xticks(rotation=65,size=10)
sns.countplot(x='Country', data=online_reatail_data) 
plt.show()

Order per month

import datetime as dt
online_reatail_data['month'] = online_reatail_data['InvoiceDate'].dt.month
online_reatail_data['year'] = online_reatail_data['InvoiceDate'].dt.year
online_reatail_data['month_year'] = pd.to_datetime(online_reatail_data[['year', 'month']].assign(Day=1))
online_reatail_data['revenue'] = online_reatail_data['Price'] * online_reatail_data['Quantity']

import matplotlib.pyplot as plt
import seaborn as sns 
sns.set_theme(style="whitegrid")
plt.figure(figsize=(10,5))
plot = pd.DataFrame(online_reatail_data.groupby(['month_year'])['Invoice'].count()).reset_index()
ax = sns.lineplot(x="month_year", y="Invoice", data = plot)

Revenue Per month

data2 = pd.DataFrame(online_reatail_data.groupby(['month_year'])['revenue'].sum()).reset_index()
plt.figure(figsize=(10,5))
ax = sns.lineplot(x = 'month_year', y='revenue', data = data2)

The most sale by country

data3 = pd.DataFrame(online_reatail_data.groupby(['Country'])['revenue'].sum()).reset_index()
plt.figure(figsize=(15,5))
ax=sns.barplot(x='Country', y='revenue',data=data3)
plt.xticks(rotation=65,size=10)
plt.show()