How does a decision tree know the next best question to ask from the data?

Introduction

Decision trees are versatile machine learning algorithms that can perform both classification and regression tasks. They make decisions by asking questions about the data based on its features, using an IF-ELSE structure to follow a path, that ultimately leads to the final prediction. The challenge is to find out what question to ask at each step of the decision-making process, which is also equivalent to asking how to determine the best split at each decision node.

In this article, we will attempt to build a decision tree for a simple binary classification task. The objective of this article is to understand how an impurity measure (e.g. entropy) is used at each node to determine the best split, eventually constructing a tree-like structure that uses a rule-based approach to get to the final prediction.

To gain intuition behind Entropy and gini impurity (another metric used to measure randomness and determine the quality of split in decision trees), quickly check out this article.

Problem definition and data

Problem: Given its length and weight measurements, predict whether a fish is tuna or salmon.

The challenge is to predict the type (target variable) of fish given its weight and length. This is an example of a binary classification task since there are two possible values of our target variable type i.e., tuna and salmon.

You can download the dataset from here.

It’s highly encouraged to code along as you’re reading this article to get the maximum understanding 🙂

Code-along prerequisites

Let’s make sure you have everything to get started (I bet you already do, but just in case).

Python
Any code editor that lets you work with Python (.ipynb extension) notebooks, Visual Studio Code, Jupyter Notebook, and Google Colab to name a few.
Libraries: pandas and numpy for data manipulation; plotly for visualization. (Feel free to use any other data viz library of your choice if you want).
Data, of course.

That’s all we need, and most probably you are already good to go. Now let’s start coding!

A step-by-step solution

Create a new .ipynb file and import the libraries first.

import pandas as pd
import numpy as np
import plotly.graph_objects as go

Read the data into a pandas data frame.

# read the csv file
df = pd.read_csv("fish.csv") 

# how many rows and columns?
print(df.shape)

# print column names
print(df.columns)

# print class distribution
print(df["type"].value_counts())

Cell Output:

There are 1000 rows and 3 columns in our data frame. ‘length’ and ‘weight’ are the features and ‘type’ is the target variable. Since the ‘type’ column has values – ‘tuna’ and ‘salmon’, let’s label encode them.

Label encoding our target column:

df["type"] = df["type"].apply(lambda x: 1 if x=="tuna" else 0)

We have labeled our classes as: {salmon: 0 and tuna: 1}

Now, let’s plot a scatter plot to visualize our classes.

# create a Figure
fig = go.Figure()

# specify custom colors for the plot
color_map = {
    0: "red",
    1: "blue",
}

# apply the color map to 'type' column
colors = df["type"].map(color_map)

# add a scatter trace to the figure
fig.add_trace(go.Scatter(x=df["length"], 
            y=df["weight"],
            mode="markers",
            marker=dict(color=colors, size=8)))

# add x-label, y-label and title
fig.update_layout(
    width=800,
    height=600,  
    title_text="Scatter Plot of Data",
    xaxis=dict(title="length",
               tickvals=[i for i in range(10)]),
    yaxis=dict(title="weight",
               tickvals=[i for i in range(10)])
)

Cell Output:

We can now clearly see our two classes marked in red and blue colors. This was all about data preparation, let’s get into the decision tree now. Assign the feature column names to a list that we will use later.

features = ["length", "weight"]

Finding our first split

A split at a node in decision tree refers to the (feature, value) pair that divides the data into two (or more) partitions.

In case of numerical feature, the split will cause two partitions of data – one with feature ≤ value and the other with feature > value.

In case of categorical feature, the split will cause two partitions of data – one with feature=value and the other with feature not equal value.

# Finding the first split:

# initialize best_params which is a dictionary that will keep track
# of best feature and split value at each node.
best_params = {"feature": None, "impurity": np.inf, "split_value": None}

# for each feature in features, do the following:
### for val in all feature values 
### (starting from the min possible value of the feature until max possible value, 
### incrementing by 'step_size'), check the following:
###### if impurity at this feature val is less than previously recorded impurity, 
###### then update best_params

# Following is the code for above interpretation
for feature in features:
    curr_val = df[feature].min()
    step_size = 0.1
    while curr_val &lt;= df[feature].max():
        curr_feature_split_impurity = compute_impurity(df, feature, curr_val)
        if curr_feature_split_impurity &lt; best_params["impurity"]:
            best_params["impurity"] = curr_feature_split_impurity
            best_params["feature"] = feature
            best_params["split_value"] = curr_val
        curr_val += step_size

Running this cell will produce an error because we haven’t defined the function compute_impurity yet. We need this function to compare the impurity in data before making the split and after making the split. The (feature, value) pair that results in the lowest impurity after splitting will be chosen as the best split at the current node and we will update the _bestparams accordingly.

Define the function as follows:

def compute_impurity(df, feature, val, criterion):
    """
    Inputs:
    df: dataframe before splitting
    feature: colname to test for best split
    val: value of 'feature' to test for best split

    Output: float
    Returns the entropy after split
    """

    # Make the split at (feature, val)
    left = df[df[feature]&lt;=val]["type"]
    right = df[df[feature]&gt;val]["type"]

    # calculate impurity of both partitions

    if criterion=="entropy":
        left_impurity = compute_entropy(left)
        right_impurity = compute_entropy(right)
    else:
        left_impurity = compute_gini(left)
        right_impurity = compute_gini(right)

    # return weighted entropy
    n = len(df) # total number of data points
    left_n = len(left) # number of data points in left partition
    right_n = len(right) # number of data points in right partition

    return (left_n/n)*left_impurity + (right_n/n)*right_impurity

This function uses another helper function compute_entropy that gives us the entropy of a given list of classes. Let’s define this as well:

def compute_entropy(vals):
    """
    Input:
    vals: list of 0s and 1s corresponding to two classes

    Output:
    entropy: float"""

    # probability of class labeled as 1 
    # will be equal to the average of vals
    p1 = np.mean(vals)
    p0 = (1-p1)

    # it means data is homoegeneous
    # entropy is 0 in this case
    if p1==0 or p0==0: 
        return 0

    return - (p0*np.log2(p0) + p1*np.log2(p1)) # formula of entropy for two classes

Now as both of our helper functions are defined, run the cell again which previously led to an error, it should run successfully now. Print the _bestparams dictionary to check if it got updated or not.

print(best_params)

Cell Output:

Voila! We got our first best split at length = 3 which means we will split our data into two partitions now – data['length']<=3 and data['length']>3.

Note: Here we are going with the split that results in the minimum entropy. Decision trees can vary in terms of this split criterion, for example, ID3 uses Information Gain (which is the difference in entropy of data before split and the weighted sum of entropy of branches after split), and CART uses the Gini index as their respective split criteria.

Following is how our decision tree looks right now. Since the data at the left branch consists of a single class, we will make it a leaf node; so any data point with length<=3 will be predicted as tuna.

(Reminder: according to our color_map, we have assigned blue color to tuna and red color to salmon):

color_map = {
 0: "red", # salmon
 1: "blue", # tuna
}

For the data at the right branch, we can recursively follow the same process and find the best split, repeat for subsequent branches until we reach a maximum depth or no further splitting is possible.

Visualizing decision tree after making first split (Image by Author)

This process then gets repeated for each of the partitions until a stopping condition is met (such as the maximum depth of the tree is reached, or the number of samples in a leaf node is below a threshold, etc.). This is what hyperparameters allow us to define when we use classifiers or regressors from packages such as scikit-learn.

Generalize the code using recursion

We will wrap the above code in a function build_treethat can be called recursively to build the decision tree.

Helper functions:

compute_entropy: Returns the entropy of a dataset with two classes. Defined above.

compute_gini: Returns the gini index of a dataset with two classes. We can choose between entropy and gini index as our impurity measures.

def compute_gini(vals):
    """
    Input: vals is a list of 0s and 1s
    Output:
    gini: float
    """

    # probability of '1' will be equal to the average of vals
    p1 = np.mean(vals)
    p0 = (1-p1) # since there are just two classes and p0+p1 = 1

    if p1==0 or p0==0:
        return 0

    return 1 - p1**2 - p0**2

compute_impurity: An extension of compute_impurity function defined above, it returns the impurity of the dataset. It uses compute_entropy and compute_gini functions to calculate the entropy and the Gini index at the given split point, according to the criterion specified.

def compute_impurity(df, feature, val, criterion):
    """
    Inputs:
    df: dataframe before splitting
    feature: colname to test for best split
    val: value of 'feature' to test for best split

    Output: float
    Returns the entropy after split
    """

    # Make the split at (feature, val)
    left = df[df[feature]&lt;=val]["type"]
    right = df[df[feature]&gt;val]["type"]

    # calculate impurity of both partitions

    if criterion=="entropy":
        left_impurity = compute_entropy(left)
        right_impurity = compute_entropy(right)
    else:
        left_impurity = compute_gini(left)
        right_impurity = compute_gini(right)

    # return weighted entropy
    n = len(df) # total number of data points
    left_n = len(left) # number of data points in left partition
    right_n = len(right) # number of data points in right partition

    return (left_n/n)*left_impurity + (right_n/n)*right_impurity

get_best_params: Returns the _bestparams dictionary containing the feature and value to use for split at the current node.

def get_best_params(df, features, criterion):
    """
    A function to determine the best split at a node

    Input:
    df: dataframe before split
    features: list of features
    criterion: impurity measure to use (gini or entropy)

    Output: 
    best_params: dict
    """

    # Initialize best_params
    best_params = {"feature": None, "val": None, "impurity": np.inf}

    # iterate for all features
    for feature in features:
        curr_val = df[feature].min()
        step_size = 0.1
        # iterate for all values for a feature (according to step_size)
        while curr_val&lt;=df[feature].max():
            # calculate impurity (gini or entropy) for the current value of feature
            impurity = compute_impurity(df, feature, curr_val, criterion)

            # update best_params if impurity is less than previous impurity
            if impurity &lt;= best_params["impurity"]:
                best_params["feature"] = feature
                best_params["val"] = curr_val
                best_params["impurity"] = impurity
            curr_val += step_size

    best_params["val"] = np.round(best_params["val"], 2)
    best_params["impurity"] = np.round(best_params["impurity"], 2)

    return best_params

Main function

build_tree: It is the main driver function that makes use of helper functions to build the decision tree recursively for the given data. I have also added additional statements in an attempt to print the decision tree in an interpretable format as it is created.

def build_tree(data, features, curr_depth=0, max_depth=3, criterion="entropy"):
    """A function to buil the decision tree recursively.

    Input:
    data: dataframe with columns length, weight, type
    features: ['length', 'weight']
    curr_depth: Keep track of depth at current node
    max_depth: Decision tree will stop growing if max_depth reached
    criterion: "gini" or "entropy" 

    """

    # Base case: max depth reached 
    if curr_depth &gt;= max_depth:
        classes, counts = np.unique(data['type'].values, return_counts=True)
        predicted_class = classes[np.argmax(counts)]
        print(("--" * curr_depth) + f"Predict: {predicted_class}")
        return

    # Get the best feature and value to split the data
    best_params = get_best_params(data, features, criterion)

    # Base case: pure node, single class case
    if best_params["impurity"] == 0:
        predicted_class = data['type'].iloc[0]
        print(("--" * curr_depth) + f"Predict: {predicted_class}")
        return

    # If there's no feature that can improve the purity (not possible to split)
    if best_params["feature"] is None:
        predicted_class = data['type'].iloc[0]
        print(("--" * curr_depth) + f"Predict: {predicted_class}")
        return

    # Print the current question (decision rule)
    best_feature = best_params["feature"]
    best_split_val = best_params["val"]
    question = f"Is {best_feature} &lt;= {best_split_val}?"
    print(("--" * (curr_depth*2)) + "&gt;" + f"{question}")

    # Split the dataset
    left_df = data[data[best_feature] &lt;= best_split_val]
    right_df = data[data[best_feature] &gt; best_split_val]

    # Recursive calls for left and right subtrees
    if not left_df.empty:
        print(("--" * curr_depth) + f"Yes -&gt;")
        build_tree(left_df, features, curr_depth + 1, max_depth, criterion)

    if not right_df.empty:
        print(("--" * curr_depth) + f"No -&gt;")
        build_tree(right_df, features, curr_depth + 1, max_depth, criterion)

Now, we can pass our fish dataset and test the output of our code as follows.

build_tree(df, ["length", "weight"], max_depth=4, criterion="entropy")

Cell Output:

Following is what our final decision tree looks like:

This corresponds to the following decision boundaries:

Note: What happens if the leaf node is not pure i.e., there is more than one class in a leaf node partition? Simply go for the majority class.

Link to Code

You can get the final code notebook from here.

Takeaways

If you’ve come so far, you will now be much more comfortable making sense out of a lot of important things related to decision trees such as interpretability being an advantage, and overfitting being a top disadvantage – that you might already come across during your previous encounter with decision trees. And it will be unfair if we leave out these topics here, so touching briefly on them below.

Let’s first look at the advantages. There are more, but we are sticking with the most important ones.

What are the (top) advantages of a decision tree?

Interpretability: A decision tree prediction is easier to interpret as compared to other Machine Learning models since we can take a visual look at the path that was followed to get to the final prediction.

A decision tree is intuitive, and can be explained easily, even to a non-technical person.

For example, let’s say a bank is using a decision tree to predict whether it should grant a loan to a customer based on their attributes such as income, bank balance, age, profession, etc. If the classification system suggests that the bank should not grant a loan to a customer, then the bank needs to draft a proper response stating the reasons for rejection. Otherwise, it can harm their business and reputation.

No need for heavy-duty preprocessing: Decision trees don’t expect the data to be normalized (or standardized) as opposed to some other machine learning models.

You do minimal data pre-processing and the decision tree won’t mind much.

Moreover, it can intrinsically handle categorical features and we don’t need to worry about one-hot encoding (or other solutions) as distinct categories will simply be considered as different branches during a split.

And that major drawback -> Overfitting

Overfitting is when our model is toooo good to be real i.e., the model adapts to the training data so closely that it loses its ability to generalize and fails to show similar level of accuracy when presented with test data.

Decision trees when not controlled properly are highly prone to overfitting. Notice in our example above that if we keep splitting the training data without defining any limit then the decision tree would keep creating more decision boundaries, learning the noise in training data.

In order to retain the model’s generalizability property, it’s important to follow measures to avoid overfitting. In the case of decision trees, we can take the following steps to prevent overfitting:

Pre-pruning: Pruning refers to preventing the decision tree from growing at its full capacity. It can be done proactively by:

Setting _maxdepth: Don’t allow the tree to grow beyond a pre-defined depth.
Setting _min_samplessplit: Don’t allow the split to happen if the number of samples is below this value at a decision node
Setting _min_samplesleaf: Don’t allow the split to happen if the number of samples at any of the resulting leaf nodes is lower than this value.

These (plus many others) are the hyperparameters that we can tune as per our needs when we implement decision trees via packages such as scikit-learn. You can find all the hyperparameters and their definitions in this documentation.

2. Post-pruning: It refers to letting the decision tree grow at its full capacity and then discarding some sections/branches afterward that seem unnecessary or are leading to high variance.

Another possible solution is: don’t use decision trees! But rather opt for their advanced versions – such as random forests or gradient-boosted trees 🙂 which still requires you to have basic knowledge of their predecessors, so reading this article is not a waste of time at all!

Bonus Points

There are some other properties of decision trees that are worth noting:

Non-parametric: Decision trees are non-parametric machine learning models, which means that they do not make assumptions about the training data related to its distribution, independence of features, etc.
Greedy approach: Decision trees follow a greedy approach, which means that they opt for the split that they think is the best at the given node (i.e., locally optimal solution) and cannot backtrack, leading to a sub-optimal solution.

Conclusion

In this article, we learned to build a decision tree for a binary classification task without making use of any packages to get strong at the conceptual level. We went through a step-by-step process to understand how a decision rule is generated at each node using data impurity measures such as entropy, and then later implemented a recursive algorithm in Python to output the final decision tree. The goal of this article was to get the fundamentals of the decision tree by looking under the hood.

In practice, however, when dealing with real-life data and challenges at hand, we will never have the need to do this from scratch as there are numerous packages available that make things far more convenient and robust for us. But having a strong background is going to help us better utilize those packages, and we will be more confident while leveraging them.

Hopefully, next time we go on to build our next Decision Tree or Random Forest (which is an ensemble of multiple decision trees), we will be more thoughtful while configuring our models (and there will be less struggle to understand what a hyperparameter really means 😺 ).

I hope this was helpful. Open to any feedback or suggestions.

I’d like to acknowledge Ritvik Kharkar for his amazing YouTube video that helped me better understand decision trees conceptually. I’ve taken inspiration from his video to write this article, using the same example he used, and taking the solution a step ahead by adding recursive implementation and logic to print the decision tree. The link to his video is in the references below.

References

[1] Aurélien Géron, (2019). Hands-on machine learning with Scikit-Learn, Keras and TensorFlow: concepts, tools, and techniques to build intelligent systems (2nd ed.). O’Reilly.

[2] ritvikmath’s YouTube video

[3] StatQuest