Problem Statement
The aim of the project is to build a Machine Learning Model to predict whether an owner will initiate an auto insurance claim in the next year or not.
Dataset
In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder. This dataset has the following files:
train.csv contains the training data, where each row corresponds to a policyholder, and the target columns signifies that a claim was filed.
Implementation
Importing the required libraries
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Reading the Training csv file
main_df = pd.read_csv('./datasets/train.csv')
Since number of people who claim insurance will be less compared to all the insurances issued in a year the target variable will be imbalanced. To check the class imbalance
print(f'The class count for the class 0 is {main_df.target.value_counts()[0]}')
print(f'The class count for the class 1 is {main_df.target.value_counts()[1]}')
Now let's check how many categorical, binary and numerical features are present in the dataset.
cat_count, bin_count, cont_count = 0, 0, 0
for i in main_df.columns:
if i.endswith('cat'):
cat_count += 1
elif i.endswith('bin'):
bin_count += 1
elif i == 'id' or i == 'target':
pass
else:
cont_count += 1
print(f'The number of categorical feature is {cat_count}')
print(f'The number of binary feature is {bin_count}')
print(f'The number of continuous feature is {cont_count}')
Now after doing Exploratory Data Analysis (EDA) along with correlation analysis to study the features and perform the feature selection to select the important set of features. Then we perform the one hot Encoding for the categorical features, normalization for numerical features. Then we try to balance the class imbalance by using the resample method available in the pandas library.
class_0_df = main_df[main_df.target == 0]
class_1_df = main_df[main_df.target == 1]
# Upsample minority class
class_1_upsampled = resample(class_1_df,
replace=True, # sample with replacement
n_samples=573518, # to match majority class
random_state=9) # reproducible results
main_df_balanced = pd.concat([class_0_df, class_1_upsampled])
main_df_balanced.target.value_counts()
After class balancing we split the dataset into training and testing. Then use different ML models to try and predict this target feature and once we find the best model tune the hyperparameters to get the best model.
After hyperparameter tuning the classification report for the best model is
If you need implementation for the above problem or any of its variants, feel free to contact us.
Comments