Machine Learning Project Help - Decision Tree using scratch

Decision trees can handle both categorical and numerical data. They are used for classification and regression problems. They can handle missing data pretty well, too!

The algorithms for building trees breaks down a data set into smaller and smaller subsets while an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

Collect the data

In this, we implement it with the help of Banknote Case Study

You can collect the data from here(Banknote data)

Steps which we have performed in this:

1. Gini Index.

2. Create a Split.

3. Build a Tree.

4. Make a Prediction.

5. Banknote Case Study.

Step First- Load Data set

The first step is to load the dataset and convert the loaded data to numbers that we can use to calculate split points. For this, we will use the helper function load_csv() to load the file and str_column_to_float() to convert string numbers to floats.

We will evaluate the algorithm using k-fold cross-validation with 5 folds. This means that 1372/5=274.4 or just over 270 records will be used in each fold. We will use the helper functions evaluate_algorithm() to evaluate the algorithm with cross-validation and accuracy_metric() to calculate the accuracy of predictions.

A new function named decision_tree() was developed to manage the application of the CART algorithm, first creating the tree from the training dataset, then using the tree to make predictions on a test dataset.

Import libraries


#Import Libraries
from random import seed
from random import randrange
from csv import reader

Load CSV


# Load a CSV file
def load_csv(filename):
    file = open(filename, "rt")
    lines = reader(file)
    dataset = list(lines)
    return dataset

Convert non-numeric to numeric

# Convert string column to float
def str_column_to_float(dataset, column):
    for row in dataset:
    row[column] = float(row[column].strip())

Split Data Set into k-fold


# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
    dataset_split = list()
    dataset_copy = list(dataset)
    fold_size = int(len(dataset) / n_folds)
    for i in range(n_folds):
        fold = list()
        while len(fold) < fold_size:
           index = randrange(len(dataset_copy))
           fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    return dataset_split

Calculate Accuracy


# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
       correct = 0
       for i in range(len(actual)):
             if actual[i] == predicted[i]:
             correct += 1
       return correct / float(len(actual)) * 100.0

Applying k-cross-validation


# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
       folds = cross_validation_split(dataset, n_folds)
       scores = list()
       for fold in folds:
             train_set = list(folds)
             train_set.remove(fold)
             train_set = sum(train_set, [])
             test_set = list()
             for row in fold:
                   row_copy = list(row)
                   test_set.append(row_copy)
                   row_copy[-1] = None
            predicted = algorithm(train_set, test_set, *args)
            actual = [row[-1] for row in fold]
            accuracy = accuracy_metric(actual, predicted)
            scores.append(accuracy)
     return scores

Split Datasets


#Split a dataset based on an attribute and an attribute value
def test_split(index, value, dataset):
       left, right = list(), list()
       for row in dataset:
             if row[index] < value:
                      left.append(row)
             else:
                      right.append(row)
      return left, right

Calculating the Gini index


#Calculate the Gini index for a split dataset
def gini_index(groups, classes):
       # count all samples at split point
       n_instances = float(sum([len(group) for group in groups]))
       # sum weighted Gini index for each group
       gini = 0.0
       for group in groups:
             size = float(len(group))
             # avoid divide by zero
             if size == 0:
                    continue
             score = 0.0
            # score the group based on the score for each class
             for class_val in classes:
                    p = [row[-1] for row in group].count(class_val) / size
                    score += p * p
                    # weight the group score by its relative size
                    gini += (1.0 - score) * (size / n_instances)
       return gini

If you need help or need complete code then you can contact us or comment at the below section.

Get your project or assignment completed by Deep learning expert and experienced developers and researchers.

Submit a proposal

If you have project files, You can send at codersarts@gmail.com directly