Answer the below topics 1:
Understand the measures that are used to evaluate the results of classification, describe what these are:
Confusion matrix
Precision, Recall, Accuracy rate
Precision-Recall Curve
ROC curve
Explain in simple terms the concept of n-fold cross validation
Ans
Confusion Matrix
It is a performance measurement for machine learning classification problem. It is represented by N*N matrix. Where N is the number of target classes.
TP: It called “true positive”
FP: It called “false positive”
FN: It called “false negative”
TN: It called “True negative”
Precision, Recall, Accuracy rate
These are also the metrices for measurement for accuracy. By Using above confusion matrices we can easily find the these metrices easily with the help of below formulas:
It can be represented by mathematical formula:
Precision= True positive/(True positive + False positive)
Recall = True positive/(True positive + False negative)
Accuracy Rate = 2*((Precision * Recall)/(Precision + Recall))
Precision-Recall Curve & ROC curve
Precision-Recall Curve
These curves are recommended for highly skewed domains where ROC curves may provide an excessively optimistic view of the performance.
This curve can be calculated in scikit-learn using the precision_recall_curve() function that takes the class labels and predicted probabilities for the minority class and returns the precision, recall, and thresholds.
ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:
True Positive Rate
False Positive Rate
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
TPR = TP/(TP + FN)
False Positive Rate (FPR) is defined as follows:
FPR = FP/(FP + TN)
Explain in simple terms the concept of n-fold cross validation
Cross-validation is a technique to evaluate predictive models by partitioning the original sample into training and test set which is used for:
A training set is used to train the model,
And a test set to evaluate it.
In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples
Steps which is used for this:
Split your entire dataset into k”folds”
For each k-fold, build your model on k – 1 folds of the dataset.
Record the error you see on each of the predictions
Repeat this until each of the k-folds has served as the test set
The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model
Answer the below topics 2:
Linear Regression
What is the cost function for Linear Regression
Polynomial Regression
Describe how polynomial regression works based on linear regression.
Ans:
Linear Regression : Cost Function of Linear Regression
Linear Regression is a machine learning algorithm based on supervised learning. It used to predicts a real-valued output based on an input value.
Cost function(F) of Linear Regression is the Root Mean Squared Error (RMSE) between predicted y value (pred) and true y value (y).
Where pred_i is predicted value and y_i is actual value
Polynomial Regression
Polynomial regression is a special case of linear regression where we fit a polynomial equation on the data with a curvilinear relationship between the target variable and the independent variables.
Equation for Linear Regression:
where, Y is the target, x is the predictor, 𝜃0 is the bias, and 𝜃1 is the weight in the regression equation
This linear equation can be used to represent a linear relationship. But, in polynomial regression, we have a polynomial equation of degree n represented as:
Equation for Polynomial Regression :
Answer the below topics 3:
Logistic Regression
The formula that updates the weights of attributes for each iteration
Softmax Regression
What is the purpose of Softmax Regression?
Given a softmax Regression model, please calculate the probability that the input attribute belongs to each class.
Support Vector Machine
Compare with logistic regression, what is the advantage of Support Vector Machine?
Ans:
Logistic Regression
The formula that updates the weights of attributes for each iteration
Logistic regression uses an equation as the representation, very much like linear regression. Input values (X) are combined linearly using weights or coefficient values to predict an output value (y).
Formula:
Softmax Regression
Softmax Regression (synonyms: Multinomial Logistic, Maximum Entropy Classifier, or just Multi-class Logistic Regression) is a generalization of logistic regression that we can use for multi-class classification.
In softmax regression (SMR), we replace the sigmoid logistic function by the so-called softmax function φ:
where we define the net input z as
Support Vector Machine
Logistic regression and support vector machines are supervised machine learning algorithms. They are both used to solve classification problems.
SVM tries to finds the “best” margin that separates the classes and this reduces the risk of error on the data, while logistic regression does not, instead it can have different decision boundaries with different weights that are near the optimal point.
Advantages of Support Vector Machine (SVM)
1. Regularization capabilities: SVM has L2 Regularization feature. So, it has good generalization capabilities which prevent it from over-fitting.
2. Handles non-linear data efficiently: SVM can efficiently handle non-linear data using Kernel trick.
3. Solves both Classification and Regression problems: SVM can be used to solve both classification and regression problems. SVM is used for classification problems while SVR (Support Vector Regression) is used for regression problems.
Answer the below topics 4:
Decision Tree
How a Decision Tree Model is trained?
How to make predication on an new instance and how to calculate the prediction probability?
What is Gini Impurity Measure?
How to calculate Gini Impurity Measure?
What is Regularization?
What are the typical way to regularize a tree model
Random Forest
How a Random Forest is trained?
Ans:
Decision Tree
How a Decision Tree Model is trained?
Below some basic steps which is used(Step 1- Step 3) before train the decision tree:
Step 1: Loading the Libraries and Dataset
Example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
# Importing dataset
df=pd.read_csv('dataset.csv')
df.head()
Step 2: Data Preprocessing
The most important part of Data Science is data preprocessing and feature engineering
In this we will dealing with the categorical variables in the data and also imputing the missing values.
Step 3: Creating Train and Test Sets
In this we split the data set in train and test set for predicting the result by using selecting target variable
Step 4: Building and Evaluating the Model(Train Model)
By using both the training and testing sets, it’s time to train our models and classify data. First, we will train a decision tree on this dataset:
Example:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 42)
dt.fit(X_train, Y_train)
dt_pred_train = dt.predict(X_train)
What is Gini Impurity Measure?
Gini Impurity measures the disorder of a set of elements. It is calculated as the probability of mislabeling an element assuming that the element is randomly labeled according the the distribution of all the classes in the set.
Formula:
Where p1, p2 are class 1 , 2 probabilities.
How to calculate Gini Impurity Measure?
Let suppose 3 apples, 3 bananas and 6 cherries are given then we will find the GI as per below mathematical example
apples bananas cherries
count = 3 3 6
p = 3/12 3/12 6/12
= 1/4 1/4 1/2
GI = 1 - [ (1/4)^2 + (1/4)^2 + (1/2)^2 ]
= 1 - [ 1/16 + 1/16 + 1/4 ]
= 1 - 6/16
= 10/16
= 0.625
What is Regularization?
What are the typical way to regularize a tree model?
Regularization?
It is used to reduce the complexity of the regression function without actually reducing the degree of the underlying polynomial function. Or We can say it is attempt to solve the overfitting problem in statistical models.
What are the typical way to regularize a tree model?
There are several simple regularization methods:
minimum number of points per cell: require that each cell (i.e., each leaf node) covers a given minimum number of training points.
maximum number of cells: limit the maximum number of cells of the partition (i.e., leaf nodes).
maximum depth: limit the maximum depth of the tree
How a Random Forest is trained?
Below example which is used to train the random forest in machine learning:
Example:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(criterion = 'entropy', random_state = 42)
rfc.fit(X_train, Y_train)
# Evaluating on Training set
rfc_pred_train = rfc.predict(X_train)
print('Training Set Evaluation F1-Score=>',f1_score(Y_train,rfc_pred_train))
Contact us to get machine learning project help, machine learning assignment help, or other help related to python, Contact Us NOW
Comments