Wholesale data set - classification and clustering

Oct 29, 2021

Updated: Nov 3, 2021

Description :

This data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. The wholesale distributor operating in different regions of Portugal has information on annual spending of several items in their stores across different regions and channels. The dataset consist of 440 large retailers annual spending on 6 different varieties of product in 3 different regions (lisbon , oporto, other) and across different sales channel ( Hotel, channel)

Recommended Model :

Algorithms to be used, XGBoost classifier, logistic regression, k means clustering etc.

Recommended Projects :

To predict which region and which channel will spend more and which region and channel to spend less.

Dataset link

Data set Link : UCI MLR - https://archive.ics.uci.edu/ml/datasets/wholesale+customers

Kaggle : - https://www.kaggle.com/binovi/wholesale-customers-data-set

Overview of data

Detailed overview of dataset

Records in the dataset = 440 ROWS
Columns in the dataset = 8 COLUMNS

FRESH: annual spending (m.u.) on fresh products (Continuous)
MILK:- annual spending (m.u.) on milk products (Continuous)
GROCERY:- annual spending (m.u.) on grocery products (Continuous)
FROZEN:- annual spending (m.u.) on frozen products (Continuous)
DETERGENTS_PAPER :- annual spending (m.u.) on detergents and paper products (Continuous)
DELICATESSEN:- annual spending (m.u.)on and delicatessen products (Continuous);
CHANNEL: - sales channel Hotel and Retailer
REGION:- three regions ( Lisbon, Oporto, Other)

EDA[Code]

Data

import pandas as pd
#Load Data

file_loc = "data\\Wholesale customers data.csv"
wholesale_cust_data = pd.read_csv(file_loc)
wholesale_cust_data.head()

Total number of Rows and Columns

row_col = wholesale_cust_data.shape
print("Tota number of rows in the dataset : {}".format(row_col[0]))
print("Total number of columns in the dataset : {}".format(row_col[1]))

Details about dataset

# check information 
wholesale_cust_data.info()

Check the number of Missing values in the dataset

# missing values
wholesale_cust_data.isna().sum()

Statistical information

# statistical information 
wholesale_cust_data.describe()

Data Visualization

Correlation

import seaborn as sns
import matplotlib.pyplot as plt
# correlation
corr = wholesale_cust_data.corr()
corr.style.background_gradient(cmap='cubehelix')

Count the value in Channel and Regions column

# Replace data values
wholesale_cust_data["Channel"] = wholesale_cust_data["Channel"].replace(1,"Hotel")
wholesale_cust_data["Channel"] = wholesale_cust_data["Channel"].replace(2,"Retail")

# Replace values
wholesale_cust_data["Region"] = wholesale_cust_data["Region"].replace(1,"Lisbon")
wholesale_cust_data["Region"] = wholesale_cust_data["Region"].replace(2,"Oporto")
wholesale_cust_data["Region"] = wholesale_cust_data["Region"].replace(3,"Other")

import matplotlib.pyplot as plt
import seaborn as sns 
colmns = ['Channel','Region']
for col in colmns:
    sns.set_style("whitegrid")
    plt.figure(figsize = (8,5))
    sns.countplot(x=wholesale_cust_data[col], data=wholesale_cust_data) 
    plt.show()

Channel count by Regions

sns.set_style('whitegrid')
sns.countplot(x="Channel",hue='Region',data=wholesale_cust_data)

Box Plots of Each columns

cols = wholesale_cust_data.select_dtypes(exclude ='object').columns

for i in cols:
    sns.set_theme(style="whitegrid")
    plt.figure(figsize=(10,3))
    ax = sns.boxplot(x=wholesale_cust_data[i])