Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. Use cross-validation to detect overfitting, ie, failing to generalize a pattern.
Cross-validation in machine learning is used to divide the datasets for prediction to over the problem to train_test_split().
In train_test_split data are divided into the train and test data randomly so it does not give the correct accuracy, to overcome this problem there are many Cross-Validation techniques are used which is divided the data sets as per selecting train and test data:
Leave one out Cross-Validation
k-fold Cross-Validation
Time series Cross-Validation
Stratified Cross-Validation
Leave one out Cross-Validation
In Leave one out Cross-Validation one field or column is selected as a test and the remaining all the fields are used for train data sets, in the next step second column is selected as a test and remaining all the used for train the model and it perform this again and again more steps or iterations to get the accuracy. It performs the many iterations and gives the accuracy of the model by an average of all the results. To overcome the problem of a large number of iteration another cross-validation is used called the k-fold cross-validation.
k-fold cross-validation
In K-fold cross-validation, the model performs prediction over the k number of iterations and finally finds the accuracy by finding the average of all k iteration result or accuracy.
For example, let us suppose datasets have 1000 records and we select k=4, this model performs over 5 iterations and divided data into the 1000/4 = 250 for test data and the remaining 750 to train the data. And this performs all remaining iterations.
Fig:
iteration1: [ T - - - ] iteration2: [ - T - - ] iteration3: [ - - T - ] iteration4: [ - - - T ]
Here T is the selected test field and remaining training field.
Machine Learning code to fit it into the model
Example:
In this example, we will split the data into K = 5 iterations.
NUM_SPLITS = 5
data = numpy.array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]])
kfold = KFold(n_splits=NUM_SPLITS)
split_data = kfold.split(data)
After this split it into the train and test using cross validaiton1:
print("Train: ", "TEST:")
for train_index, test_index in split_data:
print(train_index, test_index)
Output:
Train: TEST: [1 2 3 4] [0] [0 2 3 4] [1] [0 1 3 4] [2] [0 1 2 4] [3] [0 1 2 3] [4]
Time series Cross-Validation
This cross-validation is used for time series data. To do this there are two methods are used:
Predict Second Half
Day Forward-Chaining
Predict Second Half
In Predict Second Half, is only one train/test split. The advantage to this is that this method is easy to implement; however, it still suffers from the limitation of an arbitrarily-chosen test set. The first half of the data (split temporally) is assigned to the training set and the latter half becomes the test set.
Day Forward-Chaining
The technique we use, called Day Forward-Chaining is based on a method called forward-chaining. Using this method, we successively consider each day as the test set and assign all previous data into the training set.
Stratified K-Fold Cross-Validation
Stratification is the process of rearranging the data to ensure each fold is a good representative of the whole. For example in a binary classification problem where each class comprises 50% of the data, it is best to arrange the data such that in every fold, each class comprises around half the instances.
For example:
For example, we have a dataset with 80 class 0 records and 20 class 1 records. We may gain a mean response value of (80*0+20*1)/100 = 0.2 and we want 0.2 to be the mean response value of all folds. This is also a quick way in EDA to measure if the dataset given is imbalanced instead of counting.