Building a Student Intervention System in Machine Learning using a pandas data frame
Introduction
In this, our goal for this project is to identify students who might need early intervention before they fail to graduate.
Which technique is best for this: Classification Vs Regression
This should be a classification problem. This is because of there possibly two discrete outcomes, typical of a classification problem:
Students who need early intervention.
Students who do not need early intervention.
Exploring the Data
# Import libraries
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score
# Read student data
student_data = pd.read_csv("student-data.csv")
print("Student data read successfully!")
Now showing the data using head()
# Further Exploration using .head()
student_data.head()
Output:
Checking the shape of dataset:
# This is a 395 x 31 DataFrame
student_data.shape
Output:
(395, 31)
Data Exploration
# TODO: Calculate number of students
n_students = student_data.shape[0]
# TODO: Calculate number of features
n_features = student_data.shape[1] - 1
# TODO: Calculate passing students
passed = student_data.loc[student_data.passed == 'yes', 'passed']
n_passed = passed.shape[0]
# TODO: Calculate failing students
failed = student_data.loc[student_data.passed == 'no', 'passed']
n_failed = failed.shape[0]
# TODO: Calculate graduation rate
total = float(n_passed + n_failed)
grad_rate = float(n_passed * 100 / total)
# Print the results
print("Total number of students: {}".format(n_students))
print("Number of features: {}".format(n_features))
print("Number of students who passed: {}".format(n_passed))
print("Number of students who failed: {}".format(n_failed))
print("Graduation rate of the class: {:.2f}%".format(grad_rate))
Output:
Total number of students: 395 Number of features: 30 Number of students who passed: 265 Number of students who failed: 130 Graduation rate of the class: 67.09%
Preparing the Data
Identify feature and target columns
# Columns
student_data.columns
Output:
Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'passed'], dtype='object')
Target Column:
# We want to get the column name "passed" which is the last student_data.columns[-1]
Output:
'passed'
# This would get everything except for the last element that is "passed"
student_data.columns[:-1]
Output:
Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences'], dtype='object')
# Extract feature columns
feature_cols = list(student_data.columns[:-1])
# Extract target column 'passed'
target_col = student_data.columns[-1]
# Show the list of columns
print("Feature columns:\n{}".format(feature_cols))
print("\nTarget column: {}".format(target_col))
# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]
# Show the feature information by printing the first five rows
print("\nFeature values:")
print(X_all.head())
Output:
Feature columns: ['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences'] Target column: passed Feature values: school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... \ 0 GP F 18 U GT3 A 4 4 at_home teacher ... 1 GP F 17 U GT3 T 1 1 at_home other ... 2 GP F 15 U LE3 T 1 1 at_home other ... 3 GP F 15 U GT3 T 4 2 health services ... 4 GP F 16 U GT3 T 3 3 other other ... higher internet romantic famrel freetime goout Dalc Walc health absences 0 yes no no 4 3 4 1 1 3 6 1 yes yes no 5 3 3 1 1 3 4 2 yes yes no 4 3 2 2 3 3 10 3 yes yes yes 3 2 2 1 1 5 2 4 yes no no 4 3 2 1 2 5 4 [5 rows x 30 columns]
Preprocess Feature Columns
def preprocess_features(X):
# Initialize new output DataFrame
output = pd.DataFrame(index = X.index)
# Investigate each feature column for the data
for col, col_data in X.iteritems():
# If data type is non-numeric, replace all yes/no values with 1/0
if col_data.dtype == object:
col_data = col_data.replace(['yes', 'no'], [1, 0])
# If data type is categorical, convert to dummy variables
if col_data.dtype == object:
# Example: 'school' => 'school_GP' and 'school_MS'
col_data = pd.get_dummies(col_data, prefix = col)
# Collect the revised columns
output = output.join(col_data)
return output
X_all = preprocess_features(X_all)
print("Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns)))
Output:
Processed feature columns (48 total features): ['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Training and Testing Data Split
# TODO: Import any additional functionality you may need here
from sklearn.model_selection import train_test_split
# For initial train/test split, we can obtain stratification by simply using stratify = y_all:
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, stratify = y_all, test_size=95, random_state=42)
# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
Output:
Training set has 300 samples.
Testing set has 95 samples.
Training and Evaluating Models
def train_classifier(clf, X_train, y_train):
# Start the clock, train the classifier, then stop the clock
start = time()
clf.fit(X_train, y_train)
end = time()
# Print the results
print("Trained model in {:.4f} seconds".format(end - start))
def predict_labels(clf, features, target):
# Start the clock, make predictions, then stop the clock
start = time()
y_pred = clf.predict(features)
end = time()
# Print and return results
print("Made predictions in {:.4f} seconds.".format(end - start))
return f1_score(target.values, y_pred, pos_label='yes')
def train_predict(clf, X_train, y_train, X_test, y_test):
# Indicate the classifier and the training set size
print("")
print("Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train)))
# Train the classifier
train_classifier(clf, X_train, y_train)
# Print the results of prediction for both training and testing
print("F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train)))
print("F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test)))
Finding Score:
# TODO: Import the three supervised learning models from sklearn
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# TODO: Initialize the three models
clf_A = GaussianNB()
clf_B = LogisticRegression(random_state=42)
clf_C = SVC(random_state=42)
# TODO: Set up the training set sizes
X_train_100 = X_train.iloc[:100, :]
y_train_100 = y_train.iloc[:100]
X_train_200 = X_train.iloc[:200, :]
y_train_200 = y_train.iloc[:200]
X_train_300 = X_train.iloc[:300, :]
y_train_300 = y_train.iloc[:300]
# train_predict(clf, X_train, y_train, X_test, y_test)
for clf in [clf_A, clf_B, clf_C]:
print("\n{}: \n".format(clf.__class__.__name__))
for n in [100, 200, 300]:
train_predict(clf, X_train[:n], y_train[:n], X_test, y_test)
Output:
GaussianNB: Training a GaussianNB using a training set size of 100. . . Trained model in 0.0010 seconds Made predictions in 0.0110 seconds. F1 score for training set: 0.7752. Made predictions in 0.0010 seconds. F1 score for test set: 0.6457. Training a GaussianNB using a training set size of 200. . . Trained model in 0.0020 seconds Made predictions in 0.0010 seconds. F1 score for training set: 0.8060. Made predictions in 0.0020 seconds. F1 score for test set: 0.7218. Training a GaussianNB using a training set size of 300. . . Trained model in 0.0030 seconds Made predictions in 0.0010 seconds. F1 score for training set: 0.8134. Made predictions in 0.0010 seconds. F1 score for test set: 0.7761. LogisticRegression: .............
If you need any other machine learning assignment help or need help in any other related topic then you can contact directly at here:
Get your project or assignment completed by Deep learning expert and experienced developers and researchers.
OR
If you have project files, You can send at codersarts@gmail.com directly