top of page

Wholesale data set - classification and clustering

Updated: Nov 3, 2021


Description :


This data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. The wholesale distributor operating in different regions of Portugal has information on annual spending of several items in their stores across different regions and channels. The dataset consist of 440 large retailers annual spending on 6 different varieties of product in 3 different regions (lisbon , oporto, other) and across different sales channel ( Hotel, channel)


Recommended Model :


Algorithms to be used, XGBoost classifier, logistic regression, k means clustering etc.


Recommended Projects :


To predict which region and which channel will spend more and which region and channel to spend less.

Dataset link



Overview of data


Detailed overview of dataset

  • Records in the dataset = 440 ROWS

  • Columns in the dataset = 8 COLUMNS

  1. FRESH: annual spending (m.u.) on fresh products (Continuous)

  2. MILK:- annual spending (m.u.) on milk products (Continuous)

  3. GROCERY:- annual spending (m.u.) on grocery products (Continuous)

  4. FROZEN:- annual spending (m.u.) on frozen products (Continuous)

  5. DETERGENTS_PAPER :- annual spending (m.u.) on detergents and paper products (Continuous)

  6. DELICATESSEN:- annual spending (m.u.)on and delicatessen products (Continuous);

  7. CHANNEL: - sales channel Hotel and Retailer

  8. REGION:- three regions ( Lisbon, Oporto, Other)


EDA[Code]


Data

import pandas as pd
#Load Data

file_loc = "data\\Wholesale customers data.csv"
wholesale_cust_data = pd.read_csv(file_loc)
wholesale_cust_data.head()

Dataset

Total number of Rows and Columns


row_col = wholesale_cust_data.shape
print("Tota number of rows in the dataset : {}".format(row_col[0]))
print("Total number of columns in the dataset : {}".format(row_col[1]))

rows and column

Details about dataset


# check information 
wholesale_cust_data.info()

data information

Check the number of Missing values in the dataset


# missing values
wholesale_cust_data.isna().sum()

Missing values output

Statistical information


# statistical information 
wholesale_cust_data.describe()

Statistical information

Data Visualization


Correlation

import seaborn as sns
import matplotlib.pyplot as plt
# correlation
corr = wholesale_cust_data.corr()
corr.style.background_gradient(cmap='cubehelix')

Correlation

Count the value in Channel and Regions column


# Replace data values
wholesale_cust_data["Channel"] = wholesale_cust_data["Channel"].replace(1,"Hotel")
wholesale_cust_data["Channel"] = wholesale_cust_data["Channel"].replace(2,"Retail")

# Replace values
wholesale_cust_data["Region"] = wholesale_cust_data["Region"].replace(1,"Lisbon")
wholesale_cust_data["Region"] = wholesale_cust_data["Region"].replace(2,"Oporto")
wholesale_cust_data["Region"] = wholesale_cust_data["Region"].replace(3,"Other")

import matplotlib.pyplot as plt
import seaborn as sns 
colmns = ['Channel','Region']
for col in colmns:
    sns.set_style("whitegrid")
    plt.figure(figsize = (8,5))
    sns.countplot(x=wholesale_cust_data[col], data=wholesale_cust_data) 
    plt.show()


channel count

Region count

Channel count by Regions


sns.set_style('whitegrid')
sns.countplot(x="Channel",hue='Region',data=wholesale_cust_data)


channel count by region

Box Plots of Each columns


cols = wholesale_cust_data.select_dtypes(exclude ='object').columns

for i in cols:
    sns.set_theme(style="whitegrid")
    plt.figure(figsize=(10,3))
    ax = sns.boxplot(x=wholesale_cust_data[i])



Box plots


Other related data



If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.

bottom of page