Description :
This data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. The wholesale distributor operating in different regions of Portugal has information on annual spending of several items in their stores across different regions and channels. The dataset consist of 440 large retailers annual spending on 6 different varieties of product in 3 different regions (lisbon , oporto, other) and across different sales channel ( Hotel, channel)
Recommended Model :
Algorithms to be used, XGBoost classifier, logistic regression, k means clustering etc.
Recommended Projects :
To predict which region and which channel will spend more and which region and channel to spend less.
Dataset link
Data set Link : UCI MLR - https://archive.ics.uci.edu/ml/datasets/wholesale+customers
Overview of data
Detailed overview of dataset
Records in the dataset = 440 ROWS
Columns in the dataset = 8 COLUMNS
FRESH: annual spending (m.u.) on fresh products (Continuous)
MILK:- annual spending (m.u.) on milk products (Continuous)
GROCERY:- annual spending (m.u.) on grocery products (Continuous)
FROZEN:- annual spending (m.u.) on frozen products (Continuous)
DETERGENTS_PAPER :- annual spending (m.u.) on detergents and paper products (Continuous)
DELICATESSEN:- annual spending (m.u.)on and delicatessen products (Continuous);
CHANNEL: - sales channel Hotel and Retailer
REGION:- three regions ( Lisbon, Oporto, Other)
EDA[Code]
Data
import pandas as pd
#Load Data
file_loc = "data\\Wholesale customers data.csv"
wholesale_cust_data = pd.read_csv(file_loc)
wholesale_cust_data.head()
Total number of Rows and Columns
row_col = wholesale_cust_data.shape
print("Tota number of rows in the dataset : {}".format(row_col[0]))
print("Total number of columns in the dataset : {}".format(row_col[1]))
Details about dataset
# check information
wholesale_cust_data.info()
Check the number of Missing values in the dataset
# missing values
wholesale_cust_data.isna().sum()
Statistical information
# statistical information
wholesale_cust_data.describe()
Data Visualization
Correlation
import seaborn as sns
import matplotlib.pyplot as plt
# correlation
corr = wholesale_cust_data.corr()
corr.style.background_gradient(cmap='cubehelix')
Count the value in Channel and Regions column
# Replace data values
wholesale_cust_data["Channel"] = wholesale_cust_data["Channel"].replace(1,"Hotel")
wholesale_cust_data["Channel"] = wholesale_cust_data["Channel"].replace(2,"Retail")
# Replace values
wholesale_cust_data["Region"] = wholesale_cust_data["Region"].replace(1,"Lisbon")
wholesale_cust_data["Region"] = wholesale_cust_data["Region"].replace(2,"Oporto")
wholesale_cust_data["Region"] = wholesale_cust_data["Region"].replace(3,"Other")
import matplotlib.pyplot as plt
import seaborn as sns
colmns = ['Channel','Region']
for col in colmns:
sns.set_style("whitegrid")
plt.figure(figsize = (8,5))
sns.countplot(x=wholesale_cust_data[col], data=wholesale_cust_data)
plt.show()
Channel count by Regions
sns.set_style('whitegrid')
sns.countplot(x="Channel",hue='Region',data=wholesale_cust_data)
Box Plots of Each columns
cols = wholesale_cust_data.select_dtypes(exclude ='object').columns
for i in cols:
sns.set_theme(style="whitegrid")
plt.figure(figsize=(10,3))
ax = sns.boxplot(x=wholesale_cust_data[i])
Other related data
If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.