Nov 3, 2021

Blood Transfusion service center data set - Classification

Description :

This dataset was taken from a blood transfusion service center in taiwan. This dataset contains information about the blood donor, E.g. duration of last month blood donation, number of times blood donated, how much blood donated, how many times blood donated etc.This dataset consists of 748 instances and 5 attributes. We can use this dataset to predict the whether he/she donated blood in March 2007.

Recommended Model :

Algorithms to be used, TPOT Classifier, logistic regression etc.

Recommended Projects :

To predict the whether he/she donated blood in March 2007

Dataset link

Data set Link : UCI MLR - https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center

Kaggle : - https://www.kaggle.com/shabbir94/blood-transfusion

Overview of data

Detailed overview of dataset

  • Records in the dataset = 748 ROWS

  • Columns in the dataset = 5 COLUMNS

  1. Recency (months) - The number of months since the most recent donation

  2. Frequency (times) - Total number of blood donation of particular donor

  3. Monetary (c.c. blood) - Total amount of blood that the donor has donated in C.C

  4. Time (months) - Number of months since the donor's first donation

Target Variable

  1. whether he/she donated blood in March 2007 - This is a binary variable which represents whether the donor donated blood in March 2007 (0 - not donate blood and 1 - blood donate)

EDA[Code]

Blood donation Dataset

import pandas as pd
 
# Load Data
 
file_loc = "data\\transfusion.DATA"
 
blood_transfusion_data = pd.read_csv(file_loc)
 
blood_transfusion_data.head()

Total number of rows and column in the dataset.

# Number of Rows and columns
 
rows_col = blood_transfusion_data.shape
 
print("Total number of Rows in the dataset : {}".format(rows_col[0]))
 
print("Total number of columns in the dataset : {}".format(rows_col[1]))
 

Dataset information

# Data information
 
blood_transfusion_data.info()

Check the number of missing values in the dataset.

# Check the number of Missing Values in each columns
 
blood_transfusion_data.isna().sum()

Statistical information.

# Statistical information
 
blood_transfusion_data.describe()

Data Visualization

Correlation

import seaborn as sns
 
import matplotlib.pyplot as plt
 
# correlation
 
corr = blood_transfusion_data.corr()
 
corr.style.background_gradient(cmap='coolwarm')

Plot the count plot of Target Variable

# 0 means no, 1 - yes
 
sns.set_style("whitegrid")
 
plt.figure(figsize=(8,5))
 
sns.countplot(x= "whether he/she donated blood in March 2007",data=blood_transfusion_data)

Countplot of Recency (month)

plt.figure(figsize=(8,5))
 
sns.countplot(x= "Recency (months)",data=blood_transfusion_data)

Count plot of Frequency (times)

plt.figure(figsize=(8,5))
 
sns.countplot(x= "Frequency (times)",data=blood_transfusion_data)

Count plot of Monetary(c.c. blood)

plt.figure(figsize=(18,5))
 
sns.countplot(x= "Monetary (c.c. blood)",data=blood_transfusion_data)

Count plot of Time (months)

plt.figure(figsize=(20,5))
 
sns.countplot(x= "Time (months)",data=blood_transfusion_data)

num_cols = blood_transfusion_data.columns
 
num_cols=num_cols[:-1]
 
for col in num_cols:
 
sns.set_theme(style="whitegrid")
 
plt.figure(figsize=(10,5))
 
ax = sns.boxplot(x=blood_transfusion_data[col])

Other related data

Occupancy Detection Data Set - Classification

Census income Data Set - Classification

Wholesale customer - Classification and Clustering

Online retail dataset - classification, clustering and regression

Cervical Cancer Risk Factor Dataset - classification and clustering

Divorce Predictor Dataset -classification

Student performance dataset - Classification and Regression

Fire Forest Dataset - Regression

Heart Disease dataset -Classification

If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us