Description :
The dataset has been collected at Hospital Universitario de Caracas in Caracas, Venezuela. This dataset contains the detailed information of habits, demographic and historical medical records of 858 patients. In this dataset, there are 55 patients diagnosed with cervical cancer disease and the number of healthy patients is 803. Cervical cancer is the leading gynecological malignancy worldwide and According to the WHO report it is most common among women in developing countries.
Recommended Model :
Algorithms to be used, Decision tree, Logistic regression, support vector machines, KNN
etc.
Recommended Projects :
To diagnose the cervical cancer possibility.
Dataset link
Data set Link : UCI MLR - https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29
Overview of data
Detailed overview of dataset
Records in the dataset = 858 ROWS
Columns in the dataset = 36 COLUMNS
Age : Age of patients
Number of Sexual partner (Numerical)
First sexual intercourse (age)
Number of pregnancies
Smokes (0-no, 1-yes)
Smokes (years)
Smokes (packs/year)
Hormonal Contraceptives
Hormonal Contraceptives (years)
IUD
IUD (years)
STDs
STDs STDs number (integer)
STDs:condylomatosis
STDs: cervical condylomatosis
STDs:vaginalcondylomatosis
STDs:vulvo-perineal condylomatosis
STDs:syphilis
STDs:pelvic inflammatory disease
STDs:genital herpes
STDs:molluscumcontagiosum
STDs:AIDS
STDs:HIV
STDs:Hepatitis B
STDs:HPV
STDs: Number of diagnosis
STDs: Time since first diagnosis
STDs: Time since last diagnosis
Dx:Cancer
Dx:CIN
Dx:HPV
Dx
Target Variables - There are four target variable in this dataset
Hinselmann: target variable
Schiller: target variable
Cytology: target variable
Biopsy: class or target variable
EDA[Code]
Dataset
import pandas as pd
# Load Data
file_loc = "data\\risk_factors_cervical_cancer.csv"
cervical_cancer_data = pd.read_csv(file_loc)
cervical_cancer_data.head()
Total number of Rows and Column in the dataset
# Number of Rows and columns
rows_col = cervical_cancer_data.shape
print("Total number of Rows in the dataset : {}".format(rows_col[0]))
print("Total number of columns in the dataset : {}".format(rows_col[1]))
Dataset information
# Data information
cervical_cancer_data.info()
Check The number of missing values in the dataset
# Missing Values
cervical_cancer_data.isna().sum()
Statistical information about the dataset
# Statistical information
cervical_cancer_data.describe()
Data Visualization
The number of patients diagnosed with cervical cancer
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
plt.figure(figsize=(8,5))
sns.countplot(x= "Biopsy",data=cervical_cancer_data)
Age of Patients
# Histogram
plt.figure(figsize=(8,5))
sns.histplot(x="Age",data=cervical_cancer_data)
plt.figure(figsize=(8,5))
sns.countplot(x= "Smokes",data=cervical_cancer_data)
The number of patients who smokes, 0 - No, 1- Yes, ?- Dont know
Smoke years
# Histogram
plt.figure(figsize=(15,5))
sns.histplot(x="Smokes (years)",data=cervical_cancer_data)
Other related data
If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us
Comentários