Blood Transfusion service center data set

Description :

This dataset was taken from a blood transfusion service center in taiwan. This dataset contains information about the blood donor, E.g. duration of last month blood donation, number of times blood donated, how much blood donated, how many times blood donated etc.This dataset consists of 748 instances and 5 attributes. We can use this dataset to predict the whether he/she donated blood in March 2007.

Recommended Model :

Algorithms to be used, TPOT Classifier, logistic regression etc.

Recommended Projects :

To predict the whether he/she donated blood in March 2007

Dataset link

Data set Link : UCI MLR - https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center

Kaggle : - https://www.kaggle.com/shabbir94/blood-transfusion

Overview of data

Detailed overview of dataset

Records in the dataset = 748 ROWS
Columns in the dataset = 5 COLUMNS

Recency (months) - The number of months since the most recent donation
Frequency (times) - Total number of blood donation of particular donor
Monetary (c.c. blood) - Total amount of blood that the donor has donated in C.C
Time (months) - Number of months since the donor's first donation

Target Variable

whether he/she donated blood in March 2007 - This is a binary variable which represents whether the donor donated blood in March 2007 (0 - not donate blood and 1 - blood donate)

EDA[Code]

Blood donation Dataset

import pandas as pd
# Load Data
file_loc = "data\\transfusion.DATA"
blood_transfusion_data = pd.read_csv(file_loc)
blood_transfusion_data.head()

Total number of rows and column in the dataset.

# Number of Rows and columns 
rows_col = blood_transfusion_data.shape
print("Total number of Rows in the dataset : {}".format(rows_col[0]))
print("Total number of columns in the dataset : {}".format(rows_col[1]))

Dataset information

# Data information
blood_transfusion_data.info()

Check the number of missing values in the dataset.

# Check the number of Missing Values in each columns
blood_transfusion_data.isna().sum()

Statistical information.

# Statistical information
blood_transfusion_data.describe()

Data Visualization

Correlation

import seaborn as sns
import matplotlib.pyplot as plt
# correlation
corr = blood_transfusion_data.corr()
corr.style.background_gradient(cmap='coolwarm')

Plot the count plot of Target Variable

# 0 means no, 1 - yes
sns.set_style("whitegrid")
plt.figure(figsize=(8,5))
sns.countplot(x= "whether he/she donated blood in March 2007",data=blood_transfusion_data)

Countplot of Recency (month)

plt.figure(figsize=(8,5))
sns.countplot(x= "Recency (months)",data=blood_transfusion_data)

Count plot of Frequency (times)

plt.figure(figsize=(8,5))
sns.countplot(x= "Frequency (times)",data=blood_transfusion_data)

Count plot of Monetary(c.c. blood)

plt.figure(figsize=(18,5))
sns.countplot(x= "Monetary (c.c. blood)",data=blood_transfusion_data)

Count plot of Time (months)

plt.figure(figsize=(20,5))
sns.countplot(x= "Time (months)",data=blood_transfusion_data)

num_cols = blood_transfusion_data.columns
num_cols=num_cols[:-1]
for col in num_cols:
    sns.set_theme(style="whitegrid")
    plt.figure(figsize=(10,5))
    ax = sns.boxplot(x=blood_transfusion_data[col])