The challenge will be to predict whether or not a drunk driver was involved in the accident based on data about the accident
Problem Statement
We have compiled records of every fatal car accident reported in the United States since 2003. The challenge will be to predict whether or not a drunk driver was involved in the accident based on other data about the accident. For each accident, information was collected on the accident itself, the vehicles involved, and all persons in each of the vehicles. A large amount of interesting variables are present, with examples including the manner of collision, the number of fatalities, type of injuries, the location, time, make/model of the cars, and detailed demographic information on the people involved.
Dataset
This dataset is a collection of statistics of US road traffic accidents. It contains 20 features with 100K examples.The dataset is called FARS dataset and is publicly available in Kaggle and is also available in many inbuilt datasets such as sklearn, tensorflow etc…
Implementation
Import the required set of libraries
# Importing Library
from pmlb import fetch_data
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
Reading the .CSV file using pandas read_csv function and viewing the data
# Reading the .csv files
main_df = pd.read_csv('./dataset_1/fars.csv')
main_df.head()
Now we encode the categorical features and then numerical features is scaled and normalised so that we remove the significance of the value from these features. Then we encode the labels into numerical values using the LaberEncoder function from the sklearn library.
After preprocessing we check for any missing values in the dataset as some ML models are not designed to handle missing values. Next we go for feature selection and feature extraction using correlation analysis and PCA to get important features. Then we check for class imbalance as it is an important factor to keep in mind while developing ML models. A plot the class imbalance is shown below.
It is clearly visible that that there is huge class imbalance in the dataset. In order to balance the dataset we use the resample function from sklearn library to oversample the all the classes to match the count of class 1 labels. Then we try out different ML models after splitting the dataset into training and testing and compare the ML models and then select the best performing model and do a hyper parameter tuning to optimize the model.
For the Full code on ML models comparison and how to optimise the hyper parameters visit or contact www.codersarts.com
Comentários