Description :
This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011. We can use this dataset for regression, clustering and classification for e.g. to predict the sale of items or to predict the products which have been purchased previously and the user is most likely to buy the same products in their next order etc.
Recommended Model :
Algorithms to be used, Apriori, Fp growth, Random forest regressor, linear regression, Ridge regression, Lasso regression etc.
Recommended Projects :
To predict the sale of items or to predict the products which have been purchased previously and the user is most likely to buy the same products in their next order etc.
Dataset link
Data set Link : UCI MLR - https://archive.ics.uci.edu/ml/datasets/Online+Retail+II
Overview of data
Detailed overview of dataset
Records in the dataset = 525461 ROWS
Columns in the dataset = 8 COLUMNS
InvoiceNo: A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'C', it indicates a cancellation (Nominal)
StockCode: A 5-digit integral number uniquely assigned to each distinct product (Nominal)
Description: Product (item) name. (Nominal)
Quantity: The quantities of each product (item) per transaction ( Numeric)
InvoiceDate: The day and time when each transaction was generated (Numeric)
UnitPrice: Product price per unit in sterling (Numeric)
CustomerID: A 5-digit integral number uniquely assigned to each customer (Nominal)
Country: Name of the country where each customer resides (Nominal)
EDA[Code]
Dataset
import pandas as pd
#Load data
file_loc = "data\\online_retail_II.xlsx"
online_reatail_data = pd.read_excel(file_loc)
online_reatail_data.head()
Total Number of Rows and Columns
rows_col = online_reatail_data.shape
print("Total number of records in the dataset : ", rows_col[0])
print("Total number of columns in the dataset : ", rows_col[1])
Check Details
# data information
online_reatail_data.info()
Check the number of missing values
# check missing values
online_reatail_data.isna().sum()
Statistical Information
# statistical information
online_reatail_data.describe()
Data Visualization
Count number of country
sns.set_style("whitegrid")
plt.figure(figsize = (15,8))
plt.xticks(rotation=65,size=10)
sns.countplot(x='Country', data=online_reatail_data)
plt.show()
Order per month
import datetime as dt
online_reatail_data['month'] = online_reatail_data['InvoiceDate'].dt.month
online_reatail_data['year'] = online_reatail_data['InvoiceDate'].dt.year
online_reatail_data['month_year'] = pd.to_datetime(online_reatail_data[['year', 'month']].assign(Day=1))
online_reatail_data['revenue'] = online_reatail_data['Price'] * online_reatail_data['Quantity']
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid")
plt.figure(figsize=(10,5))
plot = pd.DataFrame(online_reatail_data.groupby(['month_year'])['Invoice'].count()).reset_index()
ax = sns.lineplot(x="month_year", y="Invoice", data = plot)
Revenue Per month
data2 = pd.DataFrame(online_reatail_data.groupby(['month_year'])['revenue'].sum()).reset_index()
plt.figure(figsize=(10,5))
ax = sns.lineplot(x = 'month_year', y='revenue', data = data2)
The most sale by country
data3 = pd.DataFrame(online_reatail_data.groupby(['Country'])['revenue'].sum()).reset_index()
plt.figure(figsize=(15,5))
ax=sns.barplot(x='Country', y='revenue',data=data3)
plt.xticks(rotation=65,size=10)
plt.show()
Other related data
If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.
Comments