Split the cells — Decision Tree algorithm from scratch

_>Decision Tree

This algorithm is the backbone of some of the best algorithms that were used in the industry like Random Forest and XGboost Trees. The hard algorithms start with this one, a little tough to understand but if you keep up, it starts getting very intuitive!

Photo by Irina Iriser on Unsplash

So let’s go!

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.

It builds a flowchart-like tree structure where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label. It is constructed by recursively splitting the training data into subsets based on the values of the attributes until a stopping criterion is met, such as the maximum depth of the tree or the minimum number of samples required to split a node.

The node is split based on a metric such as entropy / Gini impurity, the goal is to maximize information gain and reduce impurity after each split. Go ahead and give this a read, it is from the docs of sklearn — https://scikit-learn.org/stable/modules/tree.html for some better insights!

Let’s code!

We start by defining the Tree node and the libraries to import :

import math
import pandas as pd
import numpy as np

class TreeNode(object):
def __init__(self, parents=None):
self.children = [] self.split_feature = None
self.split_feature_value = None
self.parents = parents
self.label = None

Don’t worry about the Tree node variables we’ll get to them as we write the classifier.

Next we start with the heavy lifting, the Decision Tree –

Initializationclass DecisionTreeClassifier():
def __init__(self):
self.fit_tree = NoneCalculate the distribution of class labelsdef data_to_distribution(self, y_train):
types = set(y_train)
distribution = [] for i in types:
distribution.append(list(y_train).count(i)/float(len(y_train)))
return distribution

Here we are first finding out the unique labels in y_train and then for each unique occurrences in y_train and divide by the total number of instances to obtain the proportion.

Entropydef entropy(self, distribution):
return -sum([p * math.log(p,2) for p in distribution])

Entropy is a measure of the average amount of information (or uncertainty) in a set of data. In the context of classification, it quantifies the impurity of a collection of examples with respect to their class labels.

Entropy is maximum when the distribution is uniform (i.e., all classes are equally likely), and it decreases as the distribution becomes more skewed (i.e., one class becomes more dominant).

A decision tree aims to minimize entropy at each split, as it seeks to create subsets with increasingly pure (homogeneous) class distributions.

Splitting the datadef split_data(self, x_train, y_train, feature_index):
attribute_values = x_train[:,feature_index] for attribute in set(attribute_values):
data_subset = [] for index, point in enumerate(x_train):
if point[feature_index] == attribute:
data_subset.append([point, y_train[index]])
yield data_subset

This method is responsible for splitting the training data into subsets based on the values of a specified feature. These subsets will be used to construct decision nodes in the decision tree. The method yields each data subset as it is generated. Using yield allows the subsets to be generated lazily, one at a time, conserving memory.

For each unique attribute value, the method creates a subset of the training data consisting of instances with that attribute value.

Calculating gaindef gain(self, x_train, y_train, feature_index):
entropy_gain = self.entropy(self.data_to_distribution(y_train))
for data_subset in self.split_data(x_train, y_train, feature_index):
entropy_gain -=
self.entropy(self.data_to_distribution([label for (point, label) in data_subset]))
return entropy_gain

The gain method in a decision tree classifier calculates the information gain achieved by splitting the data on a particular feature. Information gain measures the reduction in entropy (or increase in homogeneity) that results from splitting the data.

The method calculates the initial entropy of the entire dataset before splitting. This represents the uncertainty or impurity in the class labels before any split.

For each data subset, the method calculates the entropy of the class labels within that subset. This measures the impurity or uncertainty remaining after the split. The entropy of the subset is subtracted from the initial entropy, representing the reduction in entropy achieved by the split.

Homogeneity and Majority Votedef homogeneous(self, y_train):
return len(set(y_train)) <= 1

def majority_vote(self, y_train, node):
labels = y_train
choice = max(set(labels), key=list(labels).count)
node.label = choice
return node

Homogeneity — It converts the list of class labels into a set to remove duplicates. If the length of the set is <= 1, it means that there is only one unique class label in the set, indicating homogeneity.

Majority vote — This process ensures that each leaf node in the decision tree is assigned a class label based on the majority of class labels present in the training data subset associated with that node.

Building the Treedef build_decision_tree(self, x_train, y_train, root, remaining_features):
if self.homogeneous(y_train):
root.label = y_train[0] return root

if len(remaining_features) == 0:
return self.majority_vote(y_train, root)

best_feature = max(remaining_features, key=lambda index:
self.gain(x_train, y_train, index))

if self.gain(x_train, y_train, best_feature) == 0:
return self.majority_vote(y_train, root)

root.split_feature = best_feature

for data_subset in self.split_data(x_train, y_train, best_feature):
child = TreeNode(parents = root)
child.split_feature_value = data_subset[0][0][best_feature] root.children.append(child)

new_x = np.array([point for (point, label) in data_subset])
new_y = np.array([label for (point, label) in data_subset])

self.build_decision_tree(new_x, new_y, child, remaining_features – set([best_feature]))

return root

This whole function is recursive so we start off with the base cases.

We start by checking if the class labels are same (homogeneous) and assign the label to the root label.

Next we check if there are no remaining features to split on, if not then we return the majority vote label for the current y_train data.

We find the best feature through calculating the information gain for each remaining feature and selecting the one with the highest gain as the best_feature to split on.

Then if the information gain for the best feature is zero, return the majority vote label for the current y_train data.

For each subset of data based on the best_feature value, create a child node, update its split feature value, and recursively build the subtree using the subset of data. Update the remaining_features set by removing the best_feature that has been used for splitting.

Finally return the root node of the tree that is built.

Find Nearestdef find_nearest(self, array, value):
nearest = (np.abs(array-value)).argmin()
return array[nearest]

This method finds the nearest value in an array to a given value. It calculates the absolute difference between each element in the array and the given value, then returns the element with the minimum absolute difference.

Classificationdef classify(self, tree, point):
if tree.children == []:
return tree.label
else:
try:
matching_children = [child for child in tree.children
if child.split_feature_value == point[tree.split_feature]] return self.classify(matching_children[0], point)
except:
array = [child.split_feature_value for child in tree.children] point[tree.split_feature] = self.find_nearest(array, point[tree.split_feature])
matching_children = [child for child in tree.children
if child.split_feature_value == point[tree.split_feature]] return self.classify(matching_children[0], point)

This method classifies a data point by traversing the decision tree. If the current node (tree) has no children, it returns the label of that node. Otherwise, it tries to find a child node whose split_feature_value matches the value of the data point at the split_feature. If found, it recursively calls itself with that child node and the data point. If no matching child is found, it calculates the nearest split feature value among the child nodes and updates the data point’s value accordingly. Then it recursively calls itself with the updated data point.

Text Classificationdef text_classification(self, x_train, tree):
predicted_labels = [self.classify(tree, point) for point in x_train] return predicted_labels

This method performs text classification using the trained decision tree (tree) on a set of input data points (x_train). It iterates over each data point in x_train and uses the classify method to predict the label for that data point. It returns a list of predicted labels for all input data points.

Fitdef fit(self, x_train, y_train):
tree = self.build_decision_tree(x_train, y_train, TreeNode(),
set(range(len(x_train[0]))))
self.fit_tree = tree

This method fits (trains) the decision tree classifier on the input training data (x_train, y_train). It builds a decision tree (tree) using the build_decision_tree method and assigns it to the fit_tree attribute of the classifier.

Predictdef predict(self, x_test):
predictions = self.text_classification(x_test, self.fit_tree)
return predictions

This method predicts the class labels for the input test data (x_test) using the trained decision tree. It calls the text_classification method with the test data and the trained decision tree (fit_tree) to obtain the predicted labels. It returns the list of predicted labels for the test data.

And that’s it for this blog, it got pretty big for the implementation but I think this will set the foundation for the incoming algorithms.

Hope you had fun reading this and trying to implement your own too! I’ll catch you next time with an even better algorithm!

Happy Coding! See ya next time!

Split the cells — Decision Tree algorithm from scratch was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

​ Level Up Coding – Medium

about Infinite Loop Digital

We support businesses by identifying requirements and helping clients integrate AI seamlessly into their operations.

Gartner
Gartner Digital Workplace Summit Generative Al

GenAI sessions:

  • 4 Use Cases for Generative AI and ChatGPT in the Digital Workplace
  • How the Power of Generative AI Will Transform Knowledge Management
  • The Perils and Promises of Microsoft 365 Copilot
  • How to Be the Generative AI Champion Your CIO and Organization Need
  • How to Shift Organizational Culture Today to Embrace Generative AI Tomorrow
  • Mitigate the Risks of Generative AI by Enhancing Your Information Governance
  • Cultivate Essential Skills for Collaborating With Artificial Intelligence
  • Ask the Expert: Microsoft 365 Copilot
  • Generative AI Across Digital Workplace Markets
10 – 11 June 2024

London, U.K.