Easy Methods for Causal Inference

Imagine that you have built an awesome machine learning model that can forecast your target value with great accuracy. In some cases, your job might be over at this point. However, often the business does not only want to know what will happen but how to influence the outcome as well. True to the motto:

Knowing the future is silver, being able to change it is golden.

This simple truth goes without saying, you know it from your personal life. Knowing the lottery numbers for the next week is good, but only if you can adjust your numbers accordingly.

As a business example, take the problem of customer churn, i.e., customers who stop doing business with you. Knowing that a customer wants to leave you is good, but the real question is: how to prevent this customer from churning?

The business wants some way of intervention for example by giving out a coupon or granting this customer some kind of membership upgrade. Something that the business can influence to decrease the probability of churn.

If x = "give the customer a coupon" and y = churn probability, we want to be able to make causal statements: If I do x, what happens with y? People also like to call it what-if scenarios. x is called a treatment variable.

This is more difficult than making correlational statements. Observing that ice cream sales correlate with shark attacks is easy. But does one cause the other? Probably not. It is rather the good weather that drives people to buy ice cream and take a swim in the sea, exposing them to sharks. So, closing all ice cream parlors as a way to decrease shark attacks will most likely not work.

In this article, I will show you how to reach the correct conclusions. The methods I am going to show you are extremely simple to carry out on an algorithmic level: it just involves training models and post-processing their predictions slightly. However, in Causal Inference, you have to be careful which features to include during training. More is not always better. That’s why I will first show you a simple tool to check which features you should include and only then the actual methods.

Be Careful With These Features

Let us take another example from one of my other articles. You can read it here, but I will also shortly recap it as well:

The Need for Causal Inference

Assume that we have a dataset about employees in a company. We know their Sense of duty, the Overtime they work, the Income they have, and their Happiness.

You want to answer the following question:

How does overtime influence income?

Here, Overtime is our treatment, and Income is the outcome. If you have never heard about causal inference before, the solution seems so obvious:

train a model with all features Sense of duty, Overtime, and Happiness to predict Income, and then
plug in many numbers for Overtime to see how Income changes.

If the model has a good predictive power, this should work out, right?

Unfortunately, not.

This is not universally true, as I have demonstrated in my other article from above. The reason is that the typical Machine Learning models only learn correlations, not causations. If you are not careful, the model will learn to lower ice cream sales in order to reduce the number of shark attacks.

However, by carefully selecting your features, the above method still works! Your feature set has to be what is called a sufficient adjustment set. Let us see what that means.

Sufficient adjustment sets

The concept of sufficient adjustment sets is pretty theoretical, and I will not go into detail in this article since I could fill a whole article series about it. I just want to point you to https://www.dagitty.net/dags.html where you can check yourself if your features form a sufficient adjustment set.

First of all, you have to specify a causal graph, which is by far the hardest part of doing causal inference. This is a graph that tells you which feature potentially influences which other features. For the data above, let us assume the following causal structure:

Image by the author. From https://www.dagitty.net/.

This graph encodes the following assumptions:

The sense of duty influences how much overtime somebody works, as well as the income. (If I have a high sense of duty, I might work longer, and also harder or more diligently, which might lead to higher income.)
Overtime influences happiness and income. (Too much overtime makes people unhappy, but results in more money.)
Income influences happiness as well. (More money, more happiness!)

There are also more hidden assumptions that you can recognize by the absence of arrows. For example, the graph encodes that income does not influence overtime, or happiness does not influence the sense of duty. We also assume that no other factors influence income or any of the other variables, which is a quite strong assumption.

You can debate whether these assumptions make sense or not. They are often not verifiable, which makes causal reasoning so damn hard. But for our purposes, let us go with this graph since it is sensible enough.

Ok, so we have decided on a causal structure of our dataset. Again, this was the hard part, and you have to be careful here. Finding a sufficient adjustment set now – the set that makes our naive method of training a model and plugging in different values for the treatment variable – is a purely graph theoretic task. Dagitty can help us with that for now.

Dagitty

You can easily click together the graph from the image above. Then, in the top left corner, you can flag

Overtime as Exposure (treatment), i.e., the thing you want to play around with to see how it changes the outcome, and
Income as Outcome.

The website will tell you that Sense of duty is a sufficient adjustment set in the top right corner.

You can now click Sense of duty and flag it as Adjusted, and the same box will tell you that you adjusted correctly. If you set Happiness to Adjusted as well, it will say that you adjusted incorrectly.

What this means for you

You should only train a model with the features

Overtime and
Sense of duty.

You should not include Happiness, you should not omit Sense of duty, ** if you want to make causal statements about how Overtime influences Income**.

Again, more features is not always better in causality.

For the remainder of this article, let us make the following simple assumption: All features in our dataset form a sufficient adjustment set.

Meta-Learners

Now, the last part got longer than anticipated, sorry for that. However, it is necessary since otherwise, you might use supposedly simple methods to come to the wrong conclusions. But with our sufficient adjustment set assumption – that you should always think about before you do anything causal—everything will work out. Let us now get real, and see how to find causal effects.

Meta-learners are the easiest method to compute causal effects for data scientists, quote me on that. Basically, you can take any machine learning model that you know and love, and plug it into another algorithm – the meta-learner – that uses this model to output the causal effects.

The meta algorithms that I am about to show you are simple to implement yourself, but you can also use libraries such as EconML or CausalML to achieve the same thing with fewer lines of code. But nothing is better than knowing what is going on under the hood, right?

Binary treatment

So far, I did not specify what kind of variable the treatment (Overtime, Coupon, …) is. In the following, I will assume that the treatment is binary since most meta-learners only work with those.

Note that this is often no problem since you can always discretize continuous features like overtime as "works less/more than 3 hours overtime a week". A treatment such as "customer got coupon" comes as binary already.

In this case, we can see what the uplift of our intervention/treatment is: what happens if we do a certain action vs. what happens if we don’t.

S-learner

Remember the naive method I proposed, i.e., just training a model and plugging in different values for your treatment variable? This is what is called an S-learner, where S stands for single. In the case of a binary treatment T, we plug in 1 and 0 for T and subtract the results. That’s it already.

First, train:

T is just one of our features, but the special one that we want to toggle to see how it influences the outcome. After training, we use the model like this:

Then, for each row, we get an estimate for the uplift, i.e., the result of setting the treatment from 0 to 1.

T-learner

This one is simple as well. When using the T-learner – the T stands for two – you train two models. One on the part of the dataset where T = 1, and one on the dataset where T = 0.

Then, use it in the natural way:

If you have these treatment effects, you can compute any other statistic you want, such as:

the average treatment effect over all observations, or
conditional treatment effects in subgroups of the observations.

Implementation

Let us implement both learners, so you see how easy it is. First, let us load the data.

!pip install polars
import polars as pl

data = pl.read_csv("https://raw.githubusercontent.com/Garve/datasets/main/causal_data.csv")

The data looks like this:

S-learner

Now, train a single model including the treatment variable.

from sklearn.ensemble import HistGradientBoostingRegressor

model = HistGradientBoostingRegressor()
X = data.select(pl.all().exclude("y"))
y = data.select("y").to_numpy().ravel()

model.fit(X.to_numpy(), y)

Now, plug in the train data – or any other dataset with the same format – one time replacing t with all ones, and all zeroes.

X_all_treatment = X.with_columns(t=1).to_numpy() # set treatment to one
X_all_control = X.with_columns(t=0).to_numpy() # set treatment to zero

treatment_effects = model.predict(X_all_treatment) - model.predict(X_all_control)

print(treatment_effects)

# Output:
# [ 0.02497236  3.95121282  4.15999904 ...  3.89655012  0.04411704
#  -0.06875453]

You can see that for some rows, the treatment increased the output by 4, but for some, it did not. With a simple

print(treatment_effects.mean())

you will find out that the average treatment effect is around 2. So, if you would treat every individual in your dataset, your outcome would increase by around 2 on average compared to if nobody gets any treatment.

T-learner

Here, we will train two models.

model_0 = HistGradientBoostingRegressor()
model_1 = HistGradientBoostingRegressor()

data_0 = data.filter(pl.col("t") == 0)
data_1 = data.filter(pl.col("t") == 1)

X_0 = data_0.select(["x1", "x2", "x3"]).to_numpy()
y_0 = data_0.select("y").to_numpy().ravel()
X_1 = data_1.select(["x1", "x2", "x3"]).to_numpy()
y_1 = data_1.select("y").to_numpy().ravel()

model_0.fit(X_0, y_0)
model_1.fit(X_1, y_1)

data_without_treatment = data.select(["x1", "x2", "x3"]).to_numpy()
treatment_effects = model_1.predict(data_without_treatment) - model_0.predict(data_without_treatment)

The results are similar to the S-learner’s output.

Conclusion

In this article, we have learned that estimating causal effects is not as trivial as one might think. The methodology of the S-learner is very natural and easy to implement, but it only yields valid causal insights if the training features form a sufficient adjustment set. The same is true for the T-learner, another option for coming to causal conclusions.

Both methods have their strength and weaknesses. When doing S-learning, the model could choose to ignore the treatment feature, for example. In this case, the predicted treatment effects would all be zero, even if the truth is different. The T-learner has the problem that one of the two datasets could be really small. If there are only 10 observations with a treatment value of 1, you probably cannot trust this model too much.

There are other methods, such as the X-learner by Künzet et al. to address these problems, and we might cover them in a future post.

I hope that you learned something new, interesting, and valuable today. Thanks for reading!

If you have any questions, write me on LinkedIn!

And if you want to dive deeper into the world of algorithms, give my new publication All About Algorithms a try! I’m still searching for writers!

All About Algorithms

Be Careful With These Features

Sufficient adjustment sets

Dagitty

What this means for you

Meta-Learners

Binary treatment

S-learner

T-learner

Implementation

S-learner

T-learner

Conclusion

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

3 AI Use Cases (That Are Not a Chatbot)

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained

Our Columns

Optimizing Marketing Campaigns with Budgeted Multi-Armed Bandits

Back to Basics, Part Tres: Logistic Regression