Building a Student Intervention System in Machine Learning using a pandas data frame
In this, our goal for this project is to identify students who might need early intervention before they fail to graduate.
Which technique is best for this: Classification Vs Regression
This should be a classification problem. This is because of there possibly two discrete outcomes, typical of a classification problem:
Students who need early intervention.
Students who do not need early intervention.
Exploring the Data
# Import libraries
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score
# Read student data
student_data = pd.read_csv("student-data.csv")
print("Student data read successfully!")
Now showing the data using head()
# Further Exploration using .head()
Checking the shape of dataset:
# This is a 395 x 31 DataFrame
(395, 31)
Data Exploration
# TODO: Calculate number of students
n_students = student_data.shape[0]
# TODO: Calculate number of features
n_features = student_data.shape[1] - 1
# TODO: Calculate passing students
passed = student_data.loc[student_data.passed == 'yes', 'passed']
n_passed = passed.shape[0]
# TODO: Calculate failing students
failed = student_data.loc[student_data.passed == 'no', 'passed']
n_failed = failed.shape[0]
# TODO: Calculate graduation rate
total = float(n_passed + n_failed)
grad_rate = float(n_passed * 100 / total)
# Print the results
print("Total number of students: {}".format(n_students))
print("Number of features: {}".format(n_features))
print("Number of students who passed: {}".format(n_passed))
print("Number of students who failed: {}".format(n_failed))
print("Graduation rate of the class: {:.2f}%".format(grad_rate))
Total number of students: 395 Number of features: 30 Number of students who passed: 265 Number of students who failed: 130 Graduation rate of the class: 67.09%
Preparing the Data
Identify feature and target columns
# Columns
Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'passed'], dtype='object')
Target Column:
# We want to get the column name "passed" which is the last student_data.columns[-1]
# This would get everything except for the last element that is "passed"
Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences'], dtype='object')
# Extract feature columns
feature_cols = list(student_data.columns[:-1])
# Extract target column 'passed'
target_col = student_data.columns[-1]
# Show the list of columns
print("Feature columns:\n{}".format(feature_cols))
print("\nTarget column: {}".format(target_col))
# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]
# Show the feature information by printing the first five rows
print("\nFeature values:")
Feature columns: ['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences'] Target column: passed Feature values: school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... \ 0 GP F 18 U GT3 A 4 4 at_home teacher ... 1 GP F 17 U GT3 T 1 1 at_home other ... 2 GP F 15 U LE3 T 1 1 at_home other ... 3 GP F 15 U GT3 T 4 2 health services ... 4 GP F 16 U GT3 T 3 3 other other ... higher internet romantic famrel freetime goout Dalc Walc health absences 0 yes no no 4 3 4 1 1 3 6 1 yes yes no 5 3 3 1 1 3 4 2 yes yes no 4 3 2 2 3 3 10 3 yes yes yes 3 2 2 1 1 5 2 4 yes no no 4 3 2 1 2 5 4 [5 rows x 30 columns]
Preprocess Feature Columns
def preprocess_features(X):
# Initialize new output DataFrame
output = pd.DataFrame(index = X.index)
# Investigate each feature column for the data
for col, col_data in X.iteritems():
# If data type is non-numeric, replace all yes/no values with 1/0
if col_data.dtype == object:
col_data = col_data.replace(['yes', 'no'], [1, 0])
# If data type is categorical, convert to dummy variables
if col_data.dtype == object:
# Example: 'school' => 'school_GP' and 'school_MS'
col_data = pd.get_dummies(col_data, prefix = col)
# Collect the revised columns
output = output.join(col_data)
return output
X_all = preprocess_features(X_all)
print("Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns)))
Processed feature columns (48 total features): ['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Training and Testing Data Split
# TODO: Import any additional functionality you may need here
from sklearn.model_selection import train_test_split
# For initial train/test split, we can obtain stratification by simply using stratify = y_all:
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, stratify = y_all, test_size=95, random_state=42)
# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
Training set has 300 samples.
Testing set has 95 samples.
Training and Evaluating Models
def train_classifier(clf, X_train, y_train):
# Start the clock, train the classifier, then stop the clock
start = time(), y_train)
end = time()
# Print the results
print("Trained model in {:.4f} seconds".format(end - start))
def predict_labels(clf, features, target):
# Start the clock, make predictions, then stop the clock
start = time()
y_pred = clf.predict(features)
end = time()
# Print and return results
print("Made predictions in {:.4f} seconds.".format(end - start))
return f1_score(target.values, y_pred, pos_label='yes')
def train_predict(clf, X_train, y_train, X_test, y_test):
# Indicate the classifier and the training set size
print("Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train)))
# Train the classifier
train_classifier(clf, X_train, y_train)
# Print the results of prediction for both training and testing
print("F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train)))
print("F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test)))
Finding Score:
# TODO: Import the three supervised learning models from sklearn
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# TODO: Initialize the three models
clf_A = GaussianNB()
clf_B = LogisticRegression(random_state=42)
clf_C = SVC(random_state=42)
# TODO: Set up the training set sizes
X_train_100 = X_train.iloc[:100, :]
y_train_100 = y_train.iloc[:100]
X_train_200 = X_train.iloc[:200, :]
y_train_200 = y_train.iloc[:200]
X_train_300 = X_train.iloc[:300, :]
y_train_300 = y_train.iloc[:300]
# train_predict(clf, X_train, y_train, X_test, y_test)
for clf in [clf_A, clf_B, clf_C]:
print("\n{}: \n".format(clf.__class__.__name__))
for n in [100, 200, 300]:
train_predict(clf, X_train[:n], y_train[:n], X_test, y_test)
GaussianNB: Training a GaussianNB using a training set size of 100. . . Trained model in 0.0010 seconds Made predictions in 0.0110 seconds. F1 score for training set: 0.7752. Made predictions in 0.0010 seconds. F1 score for test set: 0.6457. Training a GaussianNB using a training set size of 200. . . Trained model in 0.0020 seconds Made predictions in 0.0010 seconds. F1 score for training set: 0.8060. Made predictions in 0.0020 seconds. F1 score for test set: 0.7218. Training a GaussianNB using a training set size of 300. . . Trained model in 0.0030 seconds Made predictions in 0.0010 seconds. F1 score for training set: 0.8134. Made predictions in 0.0010 seconds. F1 score for test set: 0.7761. LogisticRegression: .............
