Oct 29, 2021

Wholesale data set - classification and clustering

Updated: Nov 3, 2021

Description :

This data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. The wholesale distributor operating in different regions of Portugal has information on annual spending of several items in their stores across different regions and channels. The dataset consist of 440 large retailers annual spending on 6 different varieties of product in 3 different regions (lisbon , oporto, other) and across different sales channel ( Hotel, channel)

Recommended Model :

Algorithms to be used, XGBoost classifier, logistic regression, k means clustering etc.

Recommended Projects :

To predict which region and which channel will spend more and which region and channel to spend less.

Dataset link

Data set Link : UCI MLR - https://archive.ics.uci.edu/ml/datasets/wholesale+customers

Kaggle : - https://www.kaggle.com/binovi/wholesale-customers-data-set

Overview of data

Detailed overview of dataset

  • Records in the dataset = 440 ROWS

  • Columns in the dataset = 8 COLUMNS

  1. FRESH: annual spending (m.u.) on fresh products (Continuous)

  2. MILK:- annual spending (m.u.) on milk products (Continuous)

  3. GROCERY:- annual spending (m.u.) on grocery products (Continuous)

  4. FROZEN:- annual spending (m.u.) on frozen products (Continuous)

  5. DETERGENTS_PAPER :- annual spending (m.u.) on detergents and paper products (Continuous)

  6. DELICATESSEN:- annual spending (m.u.)on and delicatessen products (Continuous);

  7. CHANNEL: - sales channel Hotel and Retailer

  8. REGION:- three regions ( Lisbon, Oporto, Other)

EDA[Code]

Data

import pandas as pd
 
#Load Data
 

 
file_loc = "data\\Wholesale customers data.csv"
 
wholesale_cust_data = pd.read_csv(file_loc)
 
wholesale_cust_data.head()

Dataset

Total number of Rows and Columns

row_col = wholesale_cust_data.shape
 
print("Tota number of rows in the dataset : {}".format(row_col[0]))
 
print("Total number of columns in the dataset : {}".format(row_col[1]))
 

rows and column

Details about dataset

# check information
 
wholesale_cust_data.info()

data information

Check the number of Missing values in the dataset

# missing values
 
wholesale_cust_data.isna().sum()

Missing values output

Statistical information

# statistical information
 
wholesale_cust_data.describe()

Statistical information

Data Visualization

Correlation

import seaborn as sns
 
import matplotlib.pyplot as plt
 
# correlation
 
corr = wholesale_cust_data.corr()
 
corr.style.background_gradient(cmap='cubehelix')

Correlation

Count the value in Channel and Regions column

# Replace data values
 
wholesale_cust_data["Channel"] = wholesale_cust_data["Channel"].replace(1,"Hotel")
 
wholesale_cust_data["Channel"] = wholesale_cust_data["Channel"].replace(2,"Retail")
 

 
# Replace values
 
wholesale_cust_data["Region"] = wholesale_cust_data["Region"].replace(1,"Lisbon")
 
wholesale_cust_data["Region"] = wholesale_cust_data["Region"].replace(2,"Oporto")
 
wholesale_cust_data["Region"] = wholesale_cust_data["Region"].replace(3,"Other")
 

 
import matplotlib.pyplot as plt
 
import seaborn as sns
 
colmns = ['Channel','Region']
 
for col in colmns:
 
sns.set_style("whitegrid")
 
plt.figure(figsize = (8,5))
 
sns.countplot(x=wholesale_cust_data[col], data=wholesale_cust_data)
 
plt.show()

channel count

Region count

Channel count by Regions

sns.set_style('whitegrid')
 
sns.countplot(x="Channel",hue='Region',data=wholesale_cust_data)

channel count by region

Box Plots of Each columns

cols = wholesale_cust_data.select_dtypes(exclude ='object').columns
 

 
for i in cols:
 
sns.set_theme(style="whitegrid")
 
plt.figure(figsize=(10,3))
 
ax = sns.boxplot(x=wholesale_cust_data[i])

Box plots

Other related data

Occupancy Detection Data Set - Classification

Census income Data Set - Classification

Divorce Predictor Dataset -classification

Online retail dataset - classification, clustering and regression

Cervical Cancer Risk Factor Dataset - classification and clustering

Blood Transfusion service center dataset - Classification

Student performance dataset - Classification and Regression

Fire Forest Dataset - Regression

Heart Disease dataset -Classification

If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.