Decision Tree From Scratch!! -Part I


In this blog post, I am going to talk about a powerful supervised learning algorithm that is often used in Machine Learning competitions. It is called the Decision Tree algorithm. It can be used for both classification & regression tasks. In this post, I will discuss the need for tree-based algorithms, the basics of Decision Trees & the math behind Decision Trees. In the next part, we will code the entire algorithm from scratch in Python.


So, without a further adieu let’s get into our today’s topic and try to understand the need for tree-based/complex algorithms.

The Need for Decision Trees

Before proceeding into this section, I would prefer if you have basic knowledge of concepts like Decision Boundary & linear classifiers like Logistic Regression. If not, please feel free to refer to the blog post attached below for a refresher. Here, I discuss the very basics of classification in Machine Learning & then talk about a linear classifier called Logistic Regression.

Logistic Regression in Machine Learning (from Scratch !!)

We are aware of the fact that models like Logistic Regression works best when the classes are well separated in the feature space.

    • The classes in the image on the right are well separated.
    • They have a Linear Decision Boundary.

    Let us recall that the Decision Boundary is the surface where the probability of being in a positive class is equal to the probability of being in a negative class.

    Okay, but what if the classes are not perfectly or linearly separable? The below-attached image shows an example of the same.

    Non-Linear problem

    Notice that in the dataset shown above, the classes are still well-separated in the feature space, but the decision boundary cannot easily be described by a single linear equation. So, to solve such problems we require models that allow complex decision boundaries & are easy to interpret at the same time. Here, the Decision Tree comes to our rescue. It is a tree-based model that allows for complex decision boundaries and is easy to interpret as well. Next, we look into the basics of the Decision Tree.

    Basics of a Decision Tree

    Formally, a Decision Tree model is defined as a tree-like structure/flow-chart in which the final outcome of the model is based on a series of comparisons of values of predictors against threshold values.

    After reading the above definition, I would define a Decision Tree to be nothing but a set of fancy if-else statements!!

    In a graphical representation (flow chart),

    • The internal nodes of the tree represent attribute testing.
    • Branching in the next level is determined by the attribute value (yes/no).
    • Terminal leaf nodes represent class assignments.
    A Decision Tree

    The above image represents a Decision Tree. The blocks in the flow chart are called nodes of the tree. The first node is known as the root node & the last nodes of the tree are known as the leaf nodes. Each comparison and branching represents splitting a region in the feature space on a single feature. Typically, at each iteration, we split once along one dimension (one predictor).

    Now, a few questions might come to your mind like, how do we decide which predictor gets to be the root node? till when we should grow the tree? how do we split the tree? We will answer all such questions in the next section where we talk about the math behind the Decision Tree.


    Decision Tree Algorithm & the Math behind it

    The algorithm used to train a Decision Tree model (also called “growing” trees) is called the CART algorithm & it stands for Classification And Regression Trees. The algorithm can be stated as follows,

    1. Start with an empty decision tree (undivided feature space).
    2. Choose the ‘optimal’ predictor on which to split and choose the‘ optimal’ threshold value for splitting.
    3. Recurse on each new node until the stopping condition is met.

    Now, we are aware of the algorithm & can focus on answering the questions like “How to split the tree?”, “How to decide the root node?”, “How to decide the stopping condition?” Next, we answer these questions one at a time.

    Q. How to split the tree?

    Generally, there are 2 criteria on which we can split the tree, one is called the Gini Index & the other is called the Entropy. We will discuss this one by one.

    1. Gini Index

    Gini Index of a node is a measure of its impurity. A node is said to be “pure” when it has a Gini Index of 0 i.e all the training points belong to the same class. Mathematically, Gini Index is given by,

    Gini Index

    Where pi is the ratio of the number of instances belonging to the ith class to the total number of instances in the given node.

    2. Entropy

    Entropy is another criterion that can be used to decide the split of the tree. In thermodynamics, it is the measure of randomness of molecules & can be used in Machine Learning as well. A node has an entropy of zero when all the instances in it belong to the same class i.e no randomness or uncertainty in determining the class of the data points in that particular node. Mathematically, it is given by,


    So, now we are aware of the 2 criteria that can be used to split the tree. Next, we will try to answer the question “How to decide the root node?”

    Q. How to decide the root node?

    To decide on the root node we define another metric called the Information Gain. It is defined as the reduction in entropy or randomness of the system.

    1. We compute the original Entropy for the entire system i.e entropy before any split has taken place.
    2. We take a predictor at random and split it on the basis of a threshold and compute the entropy for each child node.
    3. Next, we take the weighted sum of the entropies of both the child nodes where the weight is equal to the ratio of the number of data points in the respective child node to the total number of data points in both child nodes.
    4. Then we compute the difference between the Entropy of the original step i.e the one calculated in step 1 & the Entropy of the transformed system i.e the one calculated in step 3.
    5. The variable or the predictor that gives the maximum information gain out of all the predictors is selected as the root node.

    To understand the concept of Information Gain with the help of an example refer to this awesome video by StatQuest. Next, we will answer the question “How to decide the stopping condition?”


    Q. How to decide on the stopping condition?

    This one is a bit (no pun intended!) easier to answer. To decide on the stopping condition we need to take into account two things.

    1. The tree stops growing once it reaches its maximum depth, defined by the “max_depth” hyperparameter.
    2. The other way the tree stops growing is that it cannot find a split that will reduce the impurity further.


    I would now like to conclude part-I of the “Decision Tree From Scratch.” In the next part, we will code the entire algorithm from scratch in Python & test its performance on a mock dataset.

    I am giving FREE one-on-one sessions for those who want to enter the field of Data Science. I will be happy to share my experience with you. Please feel free to book a 30-minute one-on-one session with me using the calendar below.

    You can also connect with me on social media the link to my social handles is attached below. I hope you found this blog post informative. Please feel free to share your feedback in the feedback form attached below. Thanks for reading & Happy Learning 🙂

    Rating: 3.5 out of 5.

    Latest Blog Posts


    One response to “Decision Tree From Scratch!! -Part I”

    Leave a Reply

    Fill in your details below or click an icon to log in: Logo

    You are commenting using your account. Log Out /  Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out /  Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out /  Change )

    Connecting to %s

    %d bloggers like this: