Dr. Robert Kübler, Author at Towards Data Science

Advanced Time Series Forecasting With sktime

Dr. Robert Kübler — Mon, 11 Nov 2024 17:32:12 +0000

In my previous article, we explored the basics of time series forecasting with sktime, looking at how to leverage this powerful library for straightforward forecasting tasks. Now, it’s time to take our journey further and dive into the advanced techniques that can help you optimize your forecasts and improve their accuracy. In this follow-up, we’ll explore how to build more sophisticated models, tune hyperparameters, and even do model architecture search with sktime.

Convenient Time Series Forecasting with sktime

Recap

First, for an easy start, let me demonstrate the basic sktime workflow again. This time, we will use the Longley dataset, which is part of sktime (BSD-3 license). It contains various US macroeconomic variables from the years 1947 to 1962 and looks like this:

The entire dataset. Image by the author, data by J. W. Longley.

The columns represent the following variables:

GNPDEFL: Gross National Product deflator
GNP: Gross National Product
UNEMP: Number of unemployed individuals
ARMED: Size of the armed forces
POP: Population
TOTEMP: Total employment

For this article, we can set aside the specific meanings of these variables and simply treat them as six time series that are correlated. Our goal is to forecast TOTEMP using the other variables. So, let us load the data, split it, and visualize it.

Python">import numpy as np
from sktime.datasets import load_longley
from Sktime.forecasting.model_selection import temporal_train_test_split
from sktime.utils import plot_series

y, X = load_longley()

y_train, y_test, X_train, X_test = temporal_train_test_split(y, X, test_size=5)

plot_series(y_train, y_test, labels=["Train", "Test"])

Image by the author.

In the previous article, we didn’t use any exogenous variable X, so let’s begin by ignoring it here as well. We’ll start by building an ARIMA model that uses only y up to the year 1957, where the data split occurs.

from sktime.forecasting.arima import ARIMA

arima = ARIMA()
arima.fit(y_train)
y_pred = arima.predict(fh=np.arange(1, 6))

plot_series(y_train, y_test, y_pred, labels=["Train", "Test", "Prediction"])

Image by the author.

Not a great fit, also partly because by default ARIMA is just an AR(1) model. However, let us use exogenous variables X to create a better forecast instead of tweaking hyperparameters. It is as easy as that:

arimax = ARIMA()
arimax.fit(y_train, X_train)
y_pred_x = arimax.predict(fh=np.arange(1, 6), X=X_test)

plot_series(y_train, y_test, y_pred_x, labels=["Train", "Test", "Prediction with exogenous variables"])

Image by the author.

Adding exogenous data results in a much better fit on the test set! However, note that we also need the values of X when calling the predict method. If these values aren’t available, we’ll need to forecast them separately—a task that can be challenging in itself.

Pipelines

Now, let’s explore how to build more sophisticated pipelines rather than simply fitting a single model in a one-step process. This approach is similar to scikit-learn, where we can construct a pipeline that imputes missing values, standardizes numerical features, one-hot encodes categorical variables, and trains a KNeighborsRegressor at the end.

To demonstrate the power of pipelines, we’ll replace ARIMA with scikit-learn’s GradientBoostingRegressor in a recursive approach. In this setup, we train a model for one-step-ahead forecasts and then recursively call the model to generate forecasts further into the future:

from sktime.forecasting.compose import make_reduction
from sklearn.ensemble import GradientBoostingRegressor

gb_forecaster = make_reduction(GradientBoostingRegressor(), window_length=4)
gb_forecaster.fit(y_train, X_train)
y_pred = gb_forecaster.predict(fh=np.arange(1, 6), X=X_test)

plot_series(y_train, y_test, y_pred, labels=["Train", "Test", "GradientBoostingRegressor (recursive)"])

Image by the author.

The fit is poor, but the reason for this is simple. Since this isn’t always on the radar of many data science practitioners, let me highlight a fundamental characteristic of decision tree-based algorithms:

Trees can ever output values that are higher or lower than any target that they have seen during training.

This means that using tree-based algorithms without additional adjustments will result in poor models when there’s a trend in the data, as we see here.

Dealing with trends

There are several ways to help the tree produce more meaningful values, including:

imposing a trend, or
differencing the data.

Imposing a trend means that you fit a really simple model, such as a line, through the data y and subtract it from y, giving you detrended targets y‘ = y – detrend(y). Then, you train the model on y‘. While it is easy to do with numpy or scikit-learn, it is even easier with sktime.

from sktime.transformations.series.detrend import Detrender

detrender = Detrender().fit(y_train)
detrended_y = detrender.transform(y_train)
plot_series(y_train, y_train - detrended_y, labels=["Train", "Trend"])

Image by the author.

The actual detrended time series – subtracting the orange from the blue line – looks like this:

plot_series(detrended_y, labels=["Detrended"])

Image by the author.

You can see that the time series no longer shows a trend, making it easier to train a tree-based model. Fortunately, you don’t need to manually detrend the time series, train a model on the detrended version, and then reintroduce the trend – sktime has you covered!

from sktime.forecasting.compose import TransformedTargetForecaster

forecaster = make_reduction(GradientBoostingRegressor(), window_length=4)
pipeline = TransformedTargetForecaster([
    ("Detrend", Detrender()),
    ("Forecast", forecaster)
])
pipeline.fit(y_train, X_train)

y_pred = pipeline.predict(fh=np.arange(1, 6), X=X_test)

plot_series(y_train, y_test, y_pred, labels=["Train", "Test", "Prediction"])

Image by the author.

Much better! Also, note that without changing any hyperparameters, you get a linear trend. You can also change it to polynomial trends, an exponential trend, square root trend, and many more. You can even use Prophet’s trend logic using ProphetPiecewiseLinearTrendForecaster, if you like.

Note: You can also use the operator to create TransformedTargetForecaster pipelines! Just do `Detrender() forecaster` and see the magic!

As an alternative to imposing a trend, we can use differencing, which is precisely what the "I" in ARIMA represents. Essentially, instead of forecasting y directly, you forecast y.diff() in pandas logic – that is, the difference between two consecutive time series values.

from sktime.transformations.series.difference import Differencer

differencer = Differencer().fit(y_train)
detrended_y = differencer.transform(y_train)
plot_series(detrended_y, labels=["Difference"])

Image by the author.

And with this object, you can just change the old pipeline to

pipeline = TransformedTargetForecaster([
    ("Difference", Differencer()),
    ("Forecast", forecaster)
])

if you want.

Imputation and more

Let us assume that there are missing values in the time series y, and your model does not how to handle it. You can easily impute by using skime’s versatile Imputer class.

from sktime.transformations.series.impute import Imputer

y = pd.Series([1, 2, 3, np.nan, 5])
imputer = Imputer(method="linear")
imputer.fit_transform(y)

# Output:
# 0    1.0
# 1    2.0
# 2    3.0
# 3    4.0
# 4    5.0
# dtype: float64

You can also just add it to the pipeline to let it grow even further:

pipeline = TransformedTargetForecaster([
    ("Impute", Imputer()),
    ("Difference", Differencer()),
    ("Forecast", forecaster)
])

Or we can add a log transformation as well:

from sktime.transformations.series.boxcox import LogTransformer

pipeline = TransformedTargetForecaster([
    ("Impute", Imputer()),
    ("Log", LogTransformer()),
    ("Difference", Differencer()),
    ("Forecast", forecaster)
])

There are many more transformations available, and I recommend exploring the API reference of sktime. For now, though, let’s shift our focus to hyperparameter optimization.

Hyperparameter Optimization

Alright, we’ve chained several transformations, each consisting of different objects. Each of these objects has a set of hyperparameters, and naturally, we want to find the combination that yields the best results. Let us take a look at our pipeline again:

pipeline = TransformedTargetForecaster([
    ("Impute", Imputer()),
    ("Log", LogTransformer()),
    ("Difference", Differencer()),
    ("Forecast", make_reduction(GradientBoostingRegressor()))
])

We can use pipeline.get_params() to see all the hyperparameters we can tune.

...
'Impute__forecaster': None,
'Impute__method': 'drift',
...
'Log__offset': 0,
'Log__scale': 1,
'Difference__lags': 1,
'Difference__memory': 'all',
'Difference__na_handling': 'fill_zero',
...
'Forecast__window_length': 10,
'Forecast__estimator__alpha': 0.9,
'Forecast__estimator__ccp_alpha': 0.0,
'Forecast__estimator__criterion': 'friedman_mse',
...

Just like scikit-learns GridSearchCV or RandomizedSearchCV , sktime offers [ForecastingGridSearchCV](https://www.sktime.net/en/stable/api_reference/auto_generated/sktime.forecasting.model_selection.ForecastingGridSearchCV.html) , [ForecastingRandomizedSearchCV](https://www.sktime.net/en/stable/api_reference/auto_generated/sktime.forecasting.model_selection.ForecastingRandomizedSearchCV.html) , and even [ForecastingOptunaSearchCV](https://www.sktime.net/en/stable/api_reference/auto_generated/sktime.forecasting.model_selection.ForecastingOptunaSearchCV.html) . Let us stick with the grid search version for now. Here we go:

from sktime.forecasting.model_selection import ForecastingGridSearchCV
from sktime.split import ExpandingWindowSplitter

pipeline = TransformedTargetForecaster([
    ("Impute", Imputer()),
    ("Log", LogTransformer()),
    ("Difference", Differencer()),
    ("Forecast", make_reduction(GradientBoostingRegressor()))
])

# forecast 4 steps
cv = ExpandingWindowSplitter(fh=np.arange(1, 5), initial_window=5)

grid = ForecastingGridSearchCV(
    forecaster=pipeline,
    cv=cv,
    param_grid={
        "Forecast__window_length": [1, 2, 3, 4, 5],
        "Forecast__estimator__learning_rate": [0.1, 0.05, 0.01],
        "Log__scale": [1, 2, 3],
    }
)

grid.fit(y_train, X_train)

Just like in scikit-learn, we can get the best parameters via grid.best_params_ :

{'Forecast__estimator__learning_rate': 0.01,
 'Forecast__window_length': 2,
 'Log__scale': 1}

And make predictions using

y_pred = grid.predict(fh=np.arange(1, 6), X=X_test)

plot_series(y_train, y_test, y_pred, labels=["Train", "Test", "Prediction"])

Image by the author.

Optimize the architecture

A great feature of sktime is that we can optimize not only the hyperparameters but also the steps in the pipeline. For example, we’ve included a log transformer in the pipeline – should we keep it, or would it be better to exclude it? sktime offers the [OptionalPassthrough](https://www.sktime.net/en/latest/api_reference/auto_generated/sktime.transformations.compose.OptionalPassthrough.html) class to answer this question in an easy way:

from sktime.transformations.compose import OptionalPassthrough

pipeline = TransformedTargetForecaster([
    ("Impute", Imputer()),
    ("Log", OptionalPassthrough(LogTransformer())), # maybe use log
    ("Difference", Differencer()),
    ("Forecast", make_reduction(GradientBoostingRegressor()))
])

grid = ForecastingGridSearchCV(
    forecaster=pipeline,
    cv=cv,
    param_grid={
        "Forecast__window_length": [1, 2, 3, 4, 5],
        "Forecast__estimator__learning_rate": [0.1, 0.05, 0.01],
        "Log__passthrough": [True, False], # use log?
    }
)

grid.fit(y_train, X_train)

This can fundamentally change the architecture of the model and is a very powerful way to build your own AutoML pipeline.

Another architectural choice to consider is the order of the Log and Difference steps. It might be better to apply Difference first and then Log. However, you don’t want to manually define both versions. The problem grows as you add more steps to permute since you can reorder n steps in n! ways, which quickly becomes unmanageable. Let’s check out the Permute class!

from sktime.forecasting.compose import Permute

pipeline = TransformedTargetForecaster([
    ("Impute", Imputer()),
    ("Log", OptionalPassthrough(LogTransformer())),
    ("Difference", Differencer()),
    ("Forecast", make_reduction(GradientBoostingRegressor()))
])

permute = Permute(pipeline)

grid = ForecastingGridSearchCV(
    forecaster=permute,
    cv=cv,
    param_grid={
        "permutation": [
            ["Impute", "Log", "Difference", "Forecast"],
            ["Impute", "Difference", "Log", "Forecast"],
        ],
        "estimator__Forecast__window_length": [1, 2, 3, 4, 5],
        "estimator__Forecast__estimator__learning_rate": [0.1, 0.05, 0.01],
        "estimator__Log__passthrough": [True, False],
    }
)

grid.fit(y_train, X_train)

I think this flexibility is crazy! In our toy example, this is the result:

{'estimator__Forecast__estimator__learning_rate': 0.01,
 'estimator__Forecast__window_length': 1,
 'estimator__Log__passthrough': True,
 'permutation': ['Impute', 'Log', 'Difference', 'Forecast']}

So, essentially, our pipeline consists of imputation, differencing, and the actual forecasting step. The optimal window length is 1, and the learning rate has been reduced from the default 0.1 to 0.01.

Conclusion

sktime stands out as a powerful tool for Time Series Forecasting, offering a highly flexible and user-friendly framework that enables the seamless construction and optimization of complex pipelines. Whether you’re adding exogenous variables, experimenting with transformations like differencing and detrending, or fine-tuning hyperparameters, sktime allows you to quickly assemble the necessary components and optimize them with just a few lines of code.

Its ability to handle model architecture searches, such as testing different transformations or reordering steps, makes it a great choice for anyone looking to build sophisticated time series forecasting models.

I hope that you learned something new, interesting, and valuable today. Thanks for reading!

If you have any questions, write me on LinkedIn!

And if you want to dive deeper into the world of algorithms, give my new publication All About Algorithms a try! I’m still searching for writers!

All About Algorithms

The post Advanced Time Series Forecasting With sktime appeared first on Towards Data Science.

Convenient Time Series Forecasting with sktime

Dr. Robert Kübler — Wed, 25 Sep 2024 17:32:16 +0000

Ah, time series forecasting. It’s the quintessential task for many data scientists, somewhat universal across various industries. This field is so valuable because if you have a crystal ball to see some key numbers ahead of time, you can use that information to get a head start and prepare for what’s coming down the pipeline.

Image by the author.

Consider a call center: forecasting call volume allows for optimized staffing, ensuring efficient handling of customer inquiries. In retail, predicting when an item will go out of stock enables timely reordering, preventing lost sales and maximizing revenue. And of course, the holy grail of stock market prediction: if you could do this, you would be rich.

In this article, I want to show you how to do it the easy way using the awesome library sktime, the scikit-learn of time series forecasting.

Why Not Just Use scikit-learn?

Fair question! It’s kind of like asking, "Why use a fancy food processor when I have a knife and a cutting board?". Sure, you could chop everything by hand, but the food processor is designed specifically for that task and makes it way easier and faster. Let’s see how our food processor called sktime can help us.

Image by the author.

Assume that we are given a history of data points y = (y(1), y(2), …, y(T)). We want to use them to predict the next time step y(T + 1). T might be today, and T + 1 the day tomorrow.

Modeling via regression

We can use standard Machine Learning techniques to create forecasts, but we have to rephrase the forecasting problem as a regression problem first. We have seen that in time series forecasting, we only use a vector y to predict another value, while in supervised machine learning, we need features and labels to train a model. Luckily, the translation is quite simple:

Animation by the author.

Here, we use a sliding window approach using the first three y values as features and the fourth as the label. These numbers of features and labels are hyperparameters that you can tune later.

Manual Implementation

It is straightforward to implement the logic from above:

import numpy as np

def reduce(time_series: np.ndarray, n_lags: int):
    X = []
    y = []

    for window_start in range(len(time_series) - n_lags):
        X.append(time_series[window_start : window_start + n_lags])
        y.append(time_series[window_start + n_lags])

    return np.array(X), np.array(y)

Here, I have just implemented the special case with a single label. You can use it like this:

ts = np.array([1, 2, 3, 4, 5, 6, 7, 8])
X, y = reduce(ts, n_lags=3)

print(X)
print(y)

# Output:
# [[1 2 3]
#  [2 3 4]
#  [3 4 5]
#  [4 5 6]
#  [5 6 7]]
#
# [4 5 6 7 8]

You can now take any supervised model you like and train it on (X, y). So, why do we need another library for that again? Well, in this easy case, everything is fine with our custom code, but if we want to:

forecast several time steps with a special logic,
preprocess the time series, or
use time series methods that don’t rely on scikit-learn models

we’ll be thankful for Sktime’s flexibility. Let us see what I mean by that.

Introduction to sktime

I have told you now how awesome sktime is, but I still owe you the proof. So, let us express the sliding window logic and the model training with sktime’s syntax.

First example

Let us import some functionality and data. We will use the classical airline dataset created by Box & Jenkins that comes with sktime (BSD-3 license). It contains the monthly number of airline passengers over time, starting from January 1949. We want to extrapolate the passenger numbers into the future, so as an airline, we could deploy enough airplanes.

from sklearn.linear_model import LinearRegression
from sktime.datasets import load_airline
from sktime.forecasting.compose import make_reduction

y = load_airline()

Image by the author.

Now, we can train the model using the sliding window approach like this:

ml_model = make_reduction(LinearRegression(), window_length=12)
ml_model.fit(y)

That’s it. We use a window length of 12 – meaning that we use the past 12 months to forecast the next one – because we can observe yearly seasonality. We can also tune this number, but 12 is good already. Now, sktime lets us create nice plots using the function plot_series :

from sktime.utils.plotting import plot_series

plot_series(
    y, # plot the history
    ml_model.predict(fh=np.arange(1, 13)), # plot the prediction
    labels=["History", "Forecast"]
)

Image by the author.

A few things to mention here. First, look how easy it is to make predictions. We can just call .predict and give it the (relative) time steps we are interested in. fh stand for forecasting horizon, and in order to forecast the next 12 months, we can set it to [1, 2, …, 11, 12].

The .predict method yields a pandas series with a consistent time index which enables the plot function to properly align the original time series and the forecast.

Under the hood

We have developed a linear regression model designed to predict just the subsequent time step. So, how have we forecasted twelve steps ahead? There are multiple methods to accomplish this, but sktime’s default method is recursive forecasting. Here is how it functions: Initially, we forecast the next time step, which is straightforward.

Image by the author.

To predict the second one, we treat the first prediction we have created as an actual input.

Image by the author.

Just repeat this process until you have created as many predictions as you need!

Note: I won’t go into great detail about the recursive approach. Just bear in mind that eventually, the model begins to make predictions based only on its previous forecasts, which might reduce the accuracy as we predict further ahead. sktime also offers other methods such as direct forecasting, which uses a separate model for each time step, and multiforecasting, where a single model predicts multiple time steps at once.

Classic time series models

If you want to try out classical models such as ARIMA, Exponential Smoothing (ETS), or TBATS, sktime got you covered as well. For example, if you want to train a triple exponential smoothing (Holt-Winters) with multiplicative trend and seasonality, you can do it like this:

from sktime.forecasting.exp_smoothing import ExponentialSmoothing

classical_model = ExponentialSmoothing(trend="mul", seasonal="mul")
classical_model.fit(y)

You can also plot the two forecasts in the same image, just to eyeball which one looks better.

plot_series(
    y,
    ml_model.predict(fh=np.arange(1, 37)),
    classical_model.predict(fh=np.arange(1, 37)),
    labels=["History", "ML Forecast", "ETS Forecast"]
)

Image by the author.

We observe that both forecasts begin somewhat similarly, but the peaks in the exponential smoothing (ETS) version appear to escalate more rapidly. In my view, both forecasts seem plausible; however, to determine which one is objectively better, we would need to conduct a more thorough analysis using performance metrics. Let’s do it!

Cross-validation

We can import some more functions and classes to measure performance properly.

from sktime.forecasting.model_evaluation import evaluate
from sktime.performance_metrics.forecasting import MeanSquaredError, MeanAbsolutePercentageError, MeanAbsoluteError
from sktime.split import ExpandingWindowSplitter

cv = ExpandingWindowSplitter(
    initial_window=36,
    step_length=12,
    fh=np.arange(1, 13)
)

metrics = [
    MeanAbsolutePercentageError(), # MAPE
    MeanSquaredError(square_root=True), # RMSE by setting square_root = True, otherwise MSE
    MeanAbsoluteError() # MAE
]

ml_evaluation = evaluate(ml_model, cv, y, scoring=metrics)
classical_evaluation = evaluate(classical_model, cv, y, scoring=metrics)

We evaluate it using an expanding window. It starts with a length of 36 months, we shift it by 12 months always, and we always predict 12 months ahead. Visually:

Blue is the train, orange the test period. Image by the author.

For these 9 different train-test splits, we get the following results for the linear regression model:

Image by the author.

For each of the 9 splits, we can see MAPE, RMSE, and MAE on the test set. Furthermore, sktime gives the fit and prediction time durations as well as some information about the training dataset. Similarly, for the exponential smoothing model, we get:

Image by the author.

Since it is a bit hard to parse, let us just take the mean per dataframe to boil down the information. If we visualize the RMSE and MAE, we get:

Image by the author.

So, in this case, ETS wins. However, keep in mind that the linear regression model is not tuned. One fundamental hyperparameter to play around with is the window length. You can also use sktime for feature engineering, e.g., creating running means or standard deviations.

Conclusion

In this article, I have shown how simple it is to employ sktime for daily forecasting tasks. It is as user-friendly as scikit-learn, and you can even integrate your preferred scikit-learn models to make predictions!

However, we have only explored the basic features of sktime. In a future article, we will tackle:

transforming target variables,
utilizing pipelines,
performing hyperparameter tuning,
feature engineering,
and much more.

Additionally, sktime provides tools for reconciling forecasts, which is useful when working with hierarchical time series. Find out more here:

How to Forecast Hierarchical Time Series

I hope that you learned something new, interesting, and valuable today. Thanks for reading!

If you have any questions, write me on LinkedIn!

And if you want to dive deeper into the world of algorithms, give my new publication All About Algorithms a try! I’m still searching for writers!

All About Algorithms

The post Convenient Time Series Forecasting with sktime appeared first on Towards Data Science.

How to Forecast Hierarchical Time Series

Dr. Robert Kübler — Tue, 20 Aug 2024 17:32:17 +0000

Imagine you’re a data scientist at a charming little pet shop specializing in just five products: two varieties of cat food and three types of dog food. Your mission? To help this small business flourish by accurately forecasting the weekly sales for each product. The goal is to provide a comprehensive sales forecast – total sales, as well as detailed predictions for cat food and dog food sales, and even individual product sales.

The Data

You have data on the sales of the different types of cat food A and B, as well as the different types of dog food C, D, and E for 200 days. Lucky for us, the data is exceptionally clean, with no missing values, and no outliers. Also, there is no trend. It looks like this:

Image by the author.

Note: I generated the data myself.

In addition to the individual sales, we also have the aggregated sales for all cat food products, all dog food products, and all products. We call such a collection of time series hierarchical time series. In our case, they respect the following sales hierarchy:

Image by the author.

The individual sales A, B, C, D, and E are at the bottom of the hierarchy, the total sales are at the top.

This structure translates to the following equations:

sales_cat_A + sales_cat_B = sales_cat,
sales_dog_C + sales_dog_D + sales_dog_E = sales_dog,
sales_dog + sales_cat = sales_total.

So far so good.

Modeling

Now, the idea is to train a model for each of the eight time series and call it a day. You train the model on the first 140 days and try to predict the remaining 60. You want to produce forecasts for the next 60 days as well. After playing around it a bit, this is what you came up with (using exponential smoothing with an additive trend):

Image by the author.

The performance on the test set (days 140–200) looks quite good, so you decide to send the forecasts for days 200–260 to your stakeholders. But a day later, you receive the following message:

"Hi, thanks again for these awesome forecasts! But there is something weird about them: Somehow they don’t add up. For example, if I add the predicted sales of products A and B, I don’t get the predicted sales of the whole cat food category. Can you please look into that again?" – Bob

Sweating, you look at the first few rows of your produced outputs:

Image by the author.

You calculate the first row in the cat food category by hand: 93.85 + 160.42 = 254.27, but the direct model for sales_cat says 254.45, which is only 0.18 off, but still, it is not the same. Which numbers should you trust now?

Note: Often, when the time series are harder to learn, and the numbers get bigger, you can also be off by a few thousand or even worse.

Forecast Reconciliation

Oh damn, that makes sense from a technical perspective. You have just trained eight independent time series, so it would have been a miracle if they magically added up to one another. But don’t despair, there is a quite simple technique to save the day: forecast reconciliation!

The goal is to adjust the outputs that you have just created in a way that they respect the hierarchy again. The goal:

Individual cat food product sale forecasts add up to the cat food sale forecast. Individual dog food product sale forecasts add up to the dog food sale forecast. Last but not least, cat and dog food sales forecasts add up to the total sales forecast.

Forecast reconciliation sounds complicated, but it is a technique that comes in many flavors, some of which are very simple to understand and implement. Let us start with the easiest one.

Bottom-Up

The natural thing to do for me is to only forecast the individual product sales of A, B, C, D, and E – the bottom forecasts – and then add them up according to the hierarchy to create the forecasts of the higher levels.

So, basically:

sales_cat_forecast := sales_cat_A_forecast + sales_cat_B_forecast,
sales_dog_forecast := sales_dog_C_forecast + sales_dog_D_forecast + sales_dog_E_forecast,
sales_total_forecast := sales_dog_forecast + sales_cat_forecast.

This also means that you only have to train five models instead of eight. So, you can take your individual product forecasts again and just add up a bunch of columns.

The individual (bottom) level forecasts. Image by the author.

Now, you can add a column sales_cat by just summing sales_cat_A and sales_cat_B. Doing the same for sales_dog and finally sales_total, you create bottom-up reconciled forecasts:

Bottom-up reconciled forecasts. Image by the author.

Per definition, your forecasts respect the hierarchy now! That was easy, right?

But what happens with the quality of your forecasts in the higher hierarchy levels if you do that?

Sure, at the bottom level, your forecasts are as good as before. But what is better on the higher, more aggregated levels? The direct modeling approach from before, or summing up forecasts from lower levels? The answer, as so often: It depends.

Some time series are hard to forecast, but the disaggregated, lower series might be easier. In this case, the bottom-up approach – apart from giving you coherent forecasts – could even improve the forecast quality.

But sometimes, the disaggregated time series are very jumpy and hard to predict, while on an aggregated level, forecasting would be simpler. Here, a direct model is better from the performance perspective.

In our small example, the result for sales_total looks like this:

Image by the author.

Not a big difference visually, but in numbers, the direct forecast has a test MSE of 2547, while the reconciled bottom-up forecast has a higher MSE of 2699.

Top-Down

Where there is a bottom-up, there must also be a top-down. It is simple as well, but for me, it does not come as naturally as the bottom-up approach. In this approach, you only forecast a single time series: the topmost one! Once you have this, you can break it down to get forecasts for the lower levels. The idea is the following:

Assume you have an overall forecast for the total sales, and you know that historically, 40% of the sales come from cat food and 60% from dog food. Then, you can just multiply the total sales forecast by 40% to get the cat food sales, and multiply it by 60% to get the dog food sales.

You can do the same one level lower. If you know that 20% of the cat food sales come from product A and 80% from product B, multiply the cat food sales forecast by 20% to get a forecast for cat food product A, and multiply it by 80% to get a forecast for cat food product B. If you do it like that, you get a set of forecasts that respect the sales hierarchy.

So, you start with your direct model forecast of sales_total:

Image by the author.

compute the percentages from historical data, and end up with the following:

Image by the author.

However, the big problem that I have with this method is that all the time series that you create are perfectly positively correlated by construction.

Image by the author.

Of course, there could be cases where all products behave the same way, but usually, this is very unlikely. Think about a department store selling fans and heaters. In summer, they sell a lot of fans, but rarely any heater. In winter, it is the other way around. This means that the time series are negatively correlated, if one goes up, the other goes down.

With the top-down approach, both time series can only have the same pattern, i.e., both go up together or both go down together. That’s why you have to be careful if you want to use the top-down approach.

Here is a comparison between a direct forecast and the top-down reconciled forecast for sales_cat_A:

The top-down forecast is way off. Image by the author.

Note: There is also the middle-out approach, where you forecast time series on a level in the middle and work yourself up or down again, as described in the bottom-up and top-down approaches. However, it has the same disadvantage as the top-down approach for the lower levels.

General Reconciliation

Ok, cool, so we have seen two simple techniques to reconcile forecasts. Just presented like this they look quite different, but they follow a general pattern that can be used to develop even better forecast reconciliation techniques!

We can describe this general reconciliation as matrix multiplication with some matrix A, turning the raw, incoherent forecasts into reconciled ones.

Image by the author.

In order to find A, we define two other matrices first.

Summing Matrix

Let us define a matrix S – called the summing matrix – that encodes the hierarchy. We want this matrix to take all five bottom forecasts and sum them together to also get the higher forecasts in the hierarchy, like this:

Image by the author.

It is easy to derive this matrix, in our case, it is the following:

Image by the author.

If you do the matrix-vector multiplication (and you should!), you will see that you get what you want. But you can also read it as:

To get sales_total, you need all five bottom forecasts. To get sales_cat, you only need the first two bottom forecasts, which correspond to sales_cat_A and sales_cat_B, and so on.

This matrix is used for the bottom-up, top-down, and all other approaches.

Mapping Matrix

We need to define another matrix M – called the mapping matrix – that changes the reconciliation logic. This matrix should take all direct forecasts and turn them into a smaller number of forecasts – as many as there are bottom time series. In our example, it would take a vector of length 8 and turn it into a vector of length 5.

Since this is very vague, let us check how the M looks for the bottom-up approach in our example:

The bottom-up mapping matrix. Image by the author.

So, what can you do with this matrix now? Well, if you multiply this matrix with your eight incoherent base forecasts from the start, you end up with a shorter vector only containing the base vectors.

Image by the author.

And if we multiply the vector on the right-hand side with S, we get exactly what the bottom-up approach gives us. So, if define A := S · M, we get what we want. We go from eight unreconciled base forecasts to eight reconciled forecasts, using the bottom-up approach.

Let us also do the top-down approach now. We only have to define another matrix M as

The top-down mapping matrix. Image by the author.

Here, the pᵢ‘s are the percentages that you multiply by the top forecast to get the bottom forecasts. Try it out by multiplying the matrix above with a concrete vector of length 8 of your choice!

Technically, implementing it like this works a bit differently than I explained the top-down approach before, but the result is exactly the same. If you do it via matrix multiplication, you first compute the bottom-level forecasts by multiplying the top-level forecast with some percentages. And then you use the matrix S to build the higher levels again.

More Mapping Matrices

So, we have seen that you can express the bottom-up as well as the top-down reconciliation approaches via matrix multiplication. To be more precise, the reconciliation works as follows:

For each time series, train a model.
Compute the matrix S (defined by the hierarchy, unique), and choose a matrix M (defines the algorithm, your choice).
Make forecasts y with your models, and multiply the result with S · M (_y_rec = S · M · y_).
You have other, reconciled forecasts _y_rec_ now.

Here is some code, so you can see how it works in detail:

import numpy as np

y_raw = np.array([
    [1, 2], # forecasts of the total sales for the next 2 days
    [4, 2], # forecasts of the cat food sales for the next 2 days
    [2, 3], # ... dog food ...
    [1, 0], # ... cat food A ...
    [1, 2], # ... cat food B ...
    [2, 1], # ... dog food C...
    [0, 2], # ... dog food D...
    [4, 3], # ... dog food E...
]) # you see that the cat food sales is not the sum of A and B

S = np.array(
    [
        [1, 1, 1, 1, 1], # the total sales are the sum of all bottom sales
        [1, 1, 0, 0, 0], # cat food sales is the sum of only A and B
        [0, 0, 1, 1, 1], # dog food sales is the sum of only C, D, and E
        [1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0],
        [0, 0, 1, 0, 0],
        [0, 0, 0, 1, 0],
        [0, 0, 0, 0, 1],
    ],
)

M = np.array(
    [
        [0, 0, 0, 1, 0, 0, 0, 0], # bottom up matrix
        [0, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 0, 0, 0, 1],
    ],
)

y_reconciled = S @ M @ y_raw # @ = matrix multiplication

print(y_reconciled)

# Output:
# [[8 8]
# [2 2]
# [6 6]
# [1 0]
# [1 2]
# [2 1]
# [0 2]
# [4 3]]

# You can see that the bottom 5 rows stay the same, 
# as in the bottom up approach. The aggregates sales get replaced
# by sums of the bottom forecasts, e.g., [8 8] is the sum of the 
# 5 bottom forecasts [1 0] + [1 2] + [2 1] + [0 2] + [4 3].

But what happens if we choose a completely different M? There is no reason why the special forms of the bottom-up or top-down mapping matrices M should be any good.

For example, Wickramasuriya et al. [1] found a matrix M that minimizes the total forecast variance of the reconciled forecasts. This is called the MinT (Minimum Trace) optimal reconciliation approach. However, this method is a bit more complicated and has its problems since you have to estimate error variances that you do not observe. There are some heuristics to construct M that work well in practice, though.

Luckily, many libraries such as sktime and the libraries created by Nixtla, implement all of the approaches I mentioned earlier.

Conclusion

In this article, we have explored the challenges associated with training independent models when there’s an expectation for aggregate consistency in predictions. For instance, the total forecasts for sales of individual items within a category should ideally match the forecasted sales for the entire category. However, independent models do not inherently ensure this alignment. To achieve coherence, one must engage in a subsequent step known as reconciliation, where the outputs of these models are adjusted to ensure they sum up correctly.

There are two simple ways to do the reconciliation that you can even implement on your own right away. For the more complicated, but potentially better methods, you can also resort to Time Series Forecasting libraries.

References

[1] Wickramasuriya, S. L., Athanasopoulos, G., & Hyndman, R. J. (2019). Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization. Journal of the American Statistical Association, 114(526), 804–819.

I hope that you learned something new, interesting, and valuable today. Thanks for reading!

If you have any questions, write me on LinkedIn!

And if you want to dive deeper into the world of algorithms, give my new publication All About Algorithms a try! I’m still searching for writers!

All About Algorithms

The post How to Forecast Hierarchical Time Series appeared first on Towards Data Science.

A Small Python Library For Marketing Mix Modeling: MaMiMo

Dr. Robert Kübler — Fri, 17 May 2024 17:33:14 +0000

Marketing Analytics

Photo by Chris Lawton on Unsplash

Hello! I noticed that people are really interested in my articles about Marketing mix modeling, and that’s why I have created a small present for you: a small library that helps you create simple marketing mix models yourself! In this article, I will show you how to use it.

Obligatory warning : I tried my best to make this library as solid and error-free as possible. However, there might still be bugs, so please always do some sanity checks afterwards before reporting anything to the stakeholders. If you spot an error or request a feature, just drop me a message or – even better – create a pull request on Github!

In case you don’t know what Marketing Mix Modeling is, imagine you are in a company selling stuff. To sell even more of that said stuff, you do advertising. At some point, you want to know how good your advertising was performing per channel, i.e., TV, radio, web banners, … and answer questions like: "This 1000 € that I put into TV ads that week, by how much did this increase my revenue?". Marketing mix modeling is a simple way to do that. You can find more information in my articles about it:

Introduction to Marketing Mix Modeling in Python

An Upgraded Marketing Mix Modeling in Python

The library I created is creatively called MaMiMo, and if you know how to use scikit-learn, you can also use this one. Let the fun begin with a simple

pip install mamimo

A Small Example

If you have read both of the top articles – which I assume from now on – you probably remember that we used an artificial dataset there. We then defined some saturation and carryover transformations to conduct marketing mix modeling. I put a similar but slightly more complicated example dataset into mamimo to get you started:

from mamimo.datasets import load_fake_mmm

data = load_fake_mmm()

X = data.drop(columns=['Sales'])
y = data['Sales']

This gives us a dataset with a date index on a weekly basis, three media channels, and a sales column that we want to explain.

Image by the author.

Let us plot the sales to see what is going on:

Image by the author.

Here, we can see a general upward trend. There might also be seasonality, but it is hard to tell just by looking at it. Let us also address the elephant in the room: this big spike in the first week of January 2020. We will assume this was a very important day for our product, not just an outlier.

However, let us keep it simple for the moment and just try to explain the sales using the TV, radio, and banner spendings. As in the second article, we want to build the following model:

Image by the author.

Starting Out

We can straightforwardly implement this pipeline using mamimo’s carryover and saturation submodules, as well as its LinearRegression that is more flexible than scikit-learn’s version. What we need from scikit-learn, though, is the pipelining functionality.

As a reminder: ColumnTransformer applies different pipelines to different columns. We need it to have different hyperparameters for different channels.

from mamimo.carryover import ExponentialCarryover
from mamimo.saturation import ExponentialSaturation
from mamimo.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

adstock = ColumnTransformer(
    [
     ('tv_pipe', Pipeline([
            ('carryover', ExponentialCarryover()),
            ('saturation', ExponentialSaturation())
     ]), ['TV']),
     ('radio_pipe', Pipeline([
            ('carryover', ExponentialCarryover()),
            ('saturation', ExponentialSaturation())
     ]), ['Radio']),
     ('banners_pipe', Pipeline([
            ('carryover', ExponentialCarryover()),
            ('saturation', ExponentialSaturation())
     ]), ['Banners']),
    ]
)

model = Pipeline([
    ('adstock', adstock),
    ('regression', LinearRegression(positive=True))
])

This yields a model that is not very good on its own because it still needs hyperparameter tuning.

print(model.fit(X, y).score(X, y))

# Output:
# 0.10985072579909416

Even training and evaluation on the same set (never do that in production ) yields a pretty bad result – we are underfitting. So, let us tune some hyperparameters.

Hyperparameter Tuning

Let us use sklearn’s RandomSearchCV today to tune the hyperparameters. We will tune the saturation exponent and the carryover strength and length for all channels separately. You can address the radio carryover length via adstock__radio_pipe__carryover__window, for example.

from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV, TimeSeriesSplit

tuned_model = RandomizedSearchCV(
    model,
    param_distributions={
        'adstock__tv_pipe__carryover__window': randint(1, 10),
        'adstock__tv_pipe__carryover__strength': uniform(0, 1),
        'adstock__tv_pipe__saturation__exponent': uniform(0, 1),
        'adstock__radio_pipe__carryover__window': randint(1, 10),
        'adstock__radio_pipe__carryover__strength': uniform(0, 1),
        'adstock__radio_pipe__saturation__exponent': uniform(0, 1),
        'adstock__banners_pipe__carryover__window': randint(1, 10),
        'adstock__banners_pipe__carryover__strength': uniform(0, 1),
        'adstock__banners_pipe__saturation__exponent': uniform(0,1),
    },
    cv=TimeSeriesSplit(),
    random_state=0,
    n_iter=100
)

This basically tries to find the best hyperparameters in the ranges 0 to 1 for the carryover strengths and saturation exponents and integers between 1 and 10 (weeks) for the carryover length. The algorithm tries n_iter=100 different random hyperparameter combinations and evaluates the _r_² using a 5-fold expanding window time series split using sklearn’s TimeSeriesSplit(). If we feel like it, we could also use the MAPE, as business people enjoy this metric, using the scoring='neg_mean_absolute_percentage_error' keyword.

After training, we can check the best hyperparameters:

print(tuned_model.best_params_)

# Output:
# 'adstock__banners_pipe__carryover__strength': 0.6817399450693612,
# 'adstock__banners_pipe__carryover__window': 1,
# 'adstock__banners_pipe__saturation__exponent': 0.097493384215085,
# 'adstock__radio_pipe__carryover__strength': 0.8518536993666015,
# 'adstock__radio_pipe__carryover__window': 1,
# 'adstock__radio_pipe__saturation__exponent': 0.1598452868541913,
# >>> 'adstock__tv_pipe__carryover__strength': 0.04680635471218875,
# >>> 'adstock__tv_pipe__carryover__window': 4,
# 'adstock__tv_pipe__saturation__exponent': 0.0038603515102610952

So the model thinks that the TV carryover effect lasts four weeks, with about 4.68% of the effect carried into the next week, for example. Statements like this are gold for the business.

This output lets you explain your model to the stakeholders in a simple and comprehensible way.

Still, the model is terrible, so I would not trust the hyperparameters found so far.

Image by the author.

The output has the same problems as before: no trend, and the spike can also not be explained.

In times like this, we should incorporate more features that might explain the sales. This can be the price of the product, time features, google trends, the weather, holidays, and anything you want.

We will now enhance the model by giving it a trend, the one-hot encoded month, as well as the information that something happened in the first week of January 2020.

Incorporating Time Features

I also added some convenience functions to add more features. Just look at that:

from mamimo.time_utils import add_time_features, add_date_indicators

X = (X
     .pipe(add_time_features, month=True)
     .pipe(add_date_indicators, special_date=["2020-01-05"])
     .assign(trend=range(200))
)

This adds

a month column (integers between 1 and 12),
a binary column named special_date that is 1 on the 5th of January 2020 and 0 everywhere else, and
a (so far linear) trend which is only counting up from 0 to 199.

Image by the author.

All of these features get refined in a moment when we build the next model. In addition to the media channels, we will do the following preprocessing of the new features:

the month gets one-hot encoded
the linear trend may be raised to some power, e.g., trend² for a trend that grows quadratically (business would be happy)
the special date can have a carryover effect as well, i.e., we say that not only the week of the 5th of January 2020 is important, but also potentially the weeks after it, same logic as for carryover in media channels

from mamimo.time_utils import PowerTrend
from mamimo.carryover import ExponentialCarryover
from mamimo.saturation import ExponentialSaturation
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

cats =  [list(range(1, 13))] # different months, known beforehand

preprocess = ColumnTransformer(
    [
     ('tv_pipe', Pipeline([
            ('carryover', ExponentialCarryover()),
            ('saturation', ExponentialSaturation())
     ]), ['TV']),
     ('radio_pipe', Pipeline([
            ('carryover', ExponentialCarryover()),
            ('saturation', ExponentialSaturation())
     ]), ['Radio']),
     ('banners_pipe', Pipeline([
            ('carryover', ExponentialCarryover()),
            ('saturation', ExponentialSaturation())
     ]), ['Banners']),
    ('month', OneHotEncoder(sparse=False, categories=cats), ['month']),
    ('trend', PowerTrend(), ['trend']),
    ('special_date', ExponentialCarryover(), ['special_date'])
    ]
)

new_model = Pipeline([
    ('preprocess', preprocess),
    ('regression', LinearRegression(
        positive=True,
        fit_intercept=False) # no intercept because of the months
    )
])

Fitting this still-untuned model shows a much better performance:

Image by the author.

Wow! This is nearly a perfect fit already. Adding the trend and the indicator on the 5th of January 2020 helped a lot already. Just imagine what happens after we tune the hyperparameters.

The Last Tune

Let us tune the new hyperparameters as well. Maybe a linear trend is not the best we can do. Furthermore, all the carryovers are non-existent so far because this is the default behavior of ExponentialCarryover if you don’t provide hyperparameters. We will now create one more hyperparameter tuning job:

from scipy.stats import randint, uniform
from sklearn.model_selection import RandomizedSearchCV, TimeSeriesSplit

tuned_new_model = RandomizedSearchCV(
  new_model,
  param_distributions={
    'preprocess__tv_pipe__carryover__window': randint(1, 10),
    'preprocess__tv_pipe__carryover__strength': uniform(0, 1),
    'preprocess__tv_pipe__saturation__exponent': uniform(0, 1),
    'preprocess__radio_pipe__carryover__window': randint(1, 10),
    'preprocess__radio_pipe__carryover__strength': uniform(0,1),
    'preprocess__radio_pipe__saturation__exponent': uniform(0, 1),
    'preprocess__banners_pipe__carryover__window': randint(1, 10),
    'preprocess__banners_pipe__carryover__strength': uniform(0, 1),
    'preprocess__banners_pipe__saturation__exponent': uniform(0, 1),
    'preprocess__trend__power': uniform(0, 2),           # new
    'preprocess__special_date__window': randint(1, 10),  # new
    'preprocess__special_date__strength': uniform(0, 1), # new
  },
  cv=TimeSeriesSplit(),
  random_state=0,
  n_iter=1000, # some more iterations, takes more time
)

tuned_model.fit(X, y)

Image by the author.

The modeling of the spike got worse, but I would argue that the rest looks much better now. Maybe we could model the spike better again by trying more hyperparameter combinations, but let us assume we are satisfied with the result now. Let us get the hyperparameters:

print(tuned_new_model.best_params_)

# Output:
# 'preprocess__banners_pipe__carryover__strength': 0.98037507922965,
# 'preprocess__banners_pipe__carryover__window': 1,
# 'preprocess__banners_pipe__saturation__exponent': 0.1763329074644,
# 'preprocess__radio_pipe__carryover__strength': 0.9417421432655166,
# 'preprocess__radio_pipe__carryover__window': 1,
# 'preprocess__radio_pipe__saturation__exponent': 0.069184804692642,
# 'preprocess__special_date__strength': 0.8667029791268241,
# 'preprocess__special_date__window': 6,
# > 'preprocess__trend__power': 1.463860555363072,
# 'preprocess__tv_pipe__carryover__strength': 0.3422263312509606,
# 'preprocess__tv_pipe__carryover__window': 1,
# 'preprocess__tv_pipe__saturation__exponent': 0.3591065076533001

Great, so the trend is not linear, but of the form t^1.464 instead, meaning even stronger than linear.

If you are interested in the linear regression coefficients as well, you can get them via

import pandas as pd

best_model = tuned_new_model.best_estimator_
pd.Series(
    best_model.named_steps['regression'].coef_,
    index=best_model[:-1].get_feature_names_out()
)

# Output:
# tv_pipe__TV                    3389.936227
# radio_pipe__Radio              2278.722723
# banners_pipe__Banners          2455.014524
# month__month_1                 2724.333162
# month__month_2                 2991.294009
# month__month_3                 4080.414741
# month__month_4                 4542.696378
# month__month_5                 3484.384654
# month__month_6                 3785.648376
# month__month_7                 2497.006415
# month__month_8                 2068.016338
# month__month_9                 1883.746572
# month__month_10                2179.914547
# month__month_11                2135.526582
# month__month_12                2233.341158
# trend__trend                      9.801207
# special_date__special_date    96822.051131

Computing The Channel Contributions

Now that we have a trained model, we want to know how much every channel contributed to weekly sales. For convenience, I created a breakdown function to do just that.

from mamimo.analysis import breakdown

contributions = breakdown(tuned_new_model.best_estimator_, X, y)
ax = contributions.plot.area(
    figsize=(16, 10),
    linewidth=1,
    title="Predicted Sales and Breakdown",
    ylabel="Sales",
    xlabel="Date",
)

handles, labels = ax.get_legend_handles_labels()
ax.legend(
    handles[::-1],
    labels[::-1],
    title="Channels",
    loc="center left",
    bbox_to_anchor=(1.01, 0.5),
)

Image by the author.

Phew, that’s a mishmash of colors – highly incomprehensible. That is because the grouping is too unnecessarily fine-grained; we have a color for each channel and month, for example.

We can just take all the month__ variables, as well as trend__trend and put it into the baseline, i.e., the sales that we would have regardless of our media spendings, according to the model. We can tell the breakdown function to sum up these fine-grained contributions like this:

group_channels = {
    'Baseline': [f'month__month_{i}' for i in range(1, 13)] + ['Base', 'trend__trend']
} # read: 'Baseline consists of the months, base and trend.'
  # You can add more groups!

contributions = breakdown(
    tuned_new_model.best_estimator_,
    X,
    y,
    group_channels
)

This produces a smaller data frame

Image by the author.

Let’s give this data frame nicer names again

contributions.columns = [
    'TV', 'Radio', 'Banners',
    'Special Date', 'Baseline'
]

and then plot

Sweet! The content is relatively easy to understand now:

the baseline is increasing over time and seems to be seasonal as well
TV, radio, and banners all have some contribution
we can see the effect of the special date

Computing Return on Investments

I did not create a convenience function for this (yet?), but you can also calculate the ROIs for each channel like this:

for channel in ['TV', 'Radio', 'Banners']:
    roi = contributions[channel].sum() / X[channel].sum()
    print(f'{channel}: {roi:.2f}')

# Output: 
# TV: 0.33
# Radio: 0.47
# Banners: 1.23

From here, it seems like TV and radio do not perform well because the ROIs are below 1, i.e., each Euro that we spent on TV (radio) in the last 200 weeks only generated 0.33 € (0.47 €) in return on average.

Banners performed well on average; each Euro we spent there turned into 1.23 €.

Note: These statements might change if we consider other time periods, i.e. only the year 2019 or only 2020. For example, for 2019 we could compute it like

for channel in ['TV', 'Radio', 'Banners']:
    roi = contributions.loc['2019-01-01':'2019-12-31', channel].sum() / X.loc['2019-01-01':'2019-12-31', channel].sum()
    print(f'{channel}: {roi:.2f}')

# Output:
# TV: 0.36
# Radio: 0.50
# Banners: 1.22

That’s all, folks!

Conclusion

I presented my new library MaMiMo to you to make your marketing mix modeling life easier. It integrates well with scikit-learn and similar libraries, and it lets you do the following things (so far!):

define saturations (Exponential, Hill, Adbudg, BoxCox)
define carryovers (Exponential, Gaussian)
add time features (day, month, week of month, …, trend)
change the trend
analyze the model by checking the channel contributions

Install it via pip install mamimo !

I hope that you learned something new, interesting, and useful today. Thanks for reading!

If you have any questions, write me on LinkedIn!

And if you want to dive deeper into the world of algorithms, give my new publication All About Algorithms a try! I’m still searching for writers!

All About Algorithms

The post A Small Python Library For Marketing Mix Modeling: MaMiMo appeared first on Towards Data Science.

TARNet and Dragonnet: Causal Inference Between S- And T-Learners

Dr. Robert Kübler — Tue, 05 Mar 2024 17:32:18 +0000

Photo by Geranimo on Unsplash

Building machine learning models is fairly easy nowadays, but often, making good predictions is not enough. On top, we want to make causal statements about interventions. Knowing with high accuracy that a customer will leave our company is good, but knowing what to do about it – for example sending a coupon – is much better. This is a bit more involved, and I explained the basics in my other article.

Easy Methods for Causal Inference

I recommend reading this article before you continue. I showed you how you can easily come to causal statements whenever your features form a sufficient adjustment set, which I will also assume for the rest of the article.

The estimation works using so-called meta-learners. Among them, there are the S- and the T-learners, each with their own set of disadvantages. In this article, I will show you another approach that can be seen as a tradeoff between these two meta-learners that can give you better results.

Recap: Meta-Learners

Let us assume that you have a dataset (X, t, y), where X denotes some features, t is a distinct binary treatment, and y is the outcome. Let us briefly recap how the S- and T-learners work and when they don’t perform well.

S-learner

If you use an S-learner, you fix a model M and train it on the dataset such that M(X, t) _ ≈_ y. Then, you compute

Treatment Effects = M(X, 1) – M(X, 0)

and that’s it.

Image by the author.

The problem with this approach is that the mode could choose to ignore the feature t completely. This typically happens if you already have hundreds of features in X, and t drowns in this noise. If this happens, the treatment effects will all be zero, although there might be a causal relationship between the treatment t and the outcome y.

T-learner

Here, you fix two model _M_₀ and M₁, and train M₀ such that M₀(X₀) ≈ _y_₀ and M₁(X₁) ≈ y₁. Here, __ (X₀, y₀) and (X₁, y₁) are the subsets of the dataset (X, t, y) whe_r_e t = 0 a_n_d t = 1 respectively. Then, you compute

Treatment Effects = M₁(X) – M₀(X).

Image by the author.

Here, a problem arises if one of the two datasets is too small. For example, if there are not many treatments, then (_X_₁, _y_₁) will be really small, and the model _M_₁ might not generalize well.

TARNet

TARNet – short for Treatment-Agnostic Representation Network – is a simple neural network architecture by Shalit et al. in their paper Estimating individual treatment effect: generalization bounds and algorithms. You can describe it shortly as an S-learner forced to use the treatment feature t because it is baked into the architecture. Here is how it works:

Note: In this article, I simplify their model a bit by dropping their metric IPM. Still, it works just fine. Image by the author.

You can see that first, there are some feed-forward layers to preprocess the input. Then, t is explicitly used to decide which sub-network to choose on the right side. Note that we will only use this during training. If we have a trained model, we can use the model to output both versions at the same time – the prediction for t = 1 and t = 0 – and subtract them.

Implementation in TensorFlow

We will now learn how to implement it. If you know TensorFlow, you will see that it is straightforward and does not need much code. I will define a meta-model that does the following:

Take an input and process it with a common_model .
Take the result and process it via both the control_model and treatment_model .
Output the control_model result if t = 0, and the treatment_model output if t = 1.

import tensorflow as tf

class TARNet(tf.keras.Model):
    def __init__(self, common_model, control_model, treatment_model):
        super().__init__()
        self.common_model = common_model
        self.control_model = control_model
        self.treatment_model = treatment_model

    def call(self, inputs):
        x, t = inputs

        # Step 1
        split_point = common_model(x)

        # Step 2
        control = self.control_model(split_point)
        treatment = self.treatment_model(split_point)

        # Step 3
        concat = tf.concat([control, treatment], axis=1)
        indices = tf.concat([tf.reshape(tf.range(len(t)), (-1, 1)), t], axis=1)
        selected_output = tf.gather_nd(concat, indices)
        return selected_output

You can pass in any base models that you like. For example, if you have your features X and da dedicated treatment vector t , as well as the labels y , you can instantiate and fit it like this:

common_model = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(10, activation="relu"),
        tf.keras.layers.Dense(10, activation="relu"),
    ],
)

treatment_model = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(10, activation="relu"),
        tf.keras.layers.Dense(1),
    ],
)

control_model = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(10, activation="relu"),
        tf.keras.layers.Dense(1),
    ],
)

tar = TARNet(common_model, treatment_model, control_model)
tar.compile(loss="mse")

tar.fit(x=(X, t), y=y, epochs=3)

Now, what’s better than a neural network with two heads? Right, a neural network with three heads!

Created by the author using Leonardo AI.

Dragonnet

Dragonnet is a neat little extension of TARNet developed by Shi et al. in the paper Adapting Neural Networks for the Estimation of Treatment Effects in 2019. It is based on the following intuition:

You don’t need X itself, but only the information of X that is relevant for predicting whether a treatment was given or not.

I know that this is very abstract, but since we did not cover the theory about causality to make this statement precise, I am not going to slap random formulas in your face. You can find them in the paper if you are interested.

The authors depict their method like this:

From their paper.

Here, g(x) = P(t = 1 | Z = z), i.e., the probability that a treatment was given, also known as propensity score. The Q‘s are the predictions for the cases t = 1 and t = 0 respectively, just like in TARNet.

If you want to see how to implement Dragonnet, you can find the authors’ version here.

This small change of adding a third head that computes the probability of getting treated regularizes the network in a way that it can estimate uplifts more robustly than before. The network is forced to distill a good representation Z of the original data X such that the middle head can make good predictions for the propensity score.

They quantify by doing experiments with synthetic data and summarize it as follows:

From their paper.

Conclusion

We have seen that the S- and T-learners can be viewed as two extremes for computing causal impacts. Both have their drawbacks and sometimes it can be beneficial to have something in the middle.

TARNet and Dragonnet fill in this gap – they function similarly to an S-Learner, but with a mechanism that forces them to use the treatment variable explicitly. This might lead to improved performance, but as so often: there is no free lunch, also in Causal Inference. There might be cases where the S- or T-learner performs best, but it is still important to know alternative approaches for uplift modeling.

I hope that you learned something new, interesting, and valuable today. Thanks for reading!

If you have any questions, write me on LinkedIn!

And if you want to dive deeper into the world of algorithms, give my new publication All About Algorithms a try! I’m still searching for writers!

All About Algorithms

The post TARNet and Dragonnet: Causal Inference Between S- And T-Learners appeared first on Towards Data Science.

Easy Methods for Causal Inference

Dr. Robert Kübler — Tue, 20 Feb 2024 17:32:22 +0000

Photo by Mika Baumeister on Unsplash

Imagine that you have built an awesome machine learning model that can forecast your target value with great accuracy. In some cases, your job might be over at this point. However, often the business does not only want to know what will happen but how to influence the outcome as well. True to the motto:

Knowing the future is silver, being able to change it is golden.

This simple truth goes without saying, you know it from your personal life. Knowing the lottery numbers for the next week is good, but only if you can adjust your numbers accordingly.

As a business example, take the problem of customer churn, i.e., customers who stop doing business with you. Knowing that a customer wants to leave you is good, but the real question is: how to prevent this customer from churning?

The business wants some way of intervention for example by giving out a coupon or granting this customer some kind of membership upgrade. Something that the business can influence to decrease the probability of churn.

If x = "give the customer a coupon" and y = churn probability, we want to be able to make causal statements: If I do x, what happens with y? People also like to call it what-if scenarios. x is called a treatment variable.

Image by the author.

This is more difficult than making correlational statements. Observing that ice cream sales correlate with shark attacks is easy. But does one cause the other? Probably not. It is rather the good weather that drives people to buy ice cream and take a swim in the sea, exposing them to sharks. So, closing all ice cream parlors as a way to decrease shark attacks will most likely not work.

In this article, I will show you how to reach the correct conclusions. The methods I am going to show you are extremely simple to carry out on an algorithmic level: it just involves training models and post-processing their predictions slightly. However, in Causal Inference, you have to be careful which features to include during training. More is not always better. That’s why I will first show you a simple tool to check which features you should include and only then the actual methods.

Be Careful With These Features

Let us take another example from one of my other articles. You can read it here, but I will also shortly recap it as well:

The Need for Causal Inference

Assume that we have a dataset about employees in a company. We know their Sense of duty, the Overtime they work, the Income they have, and their Happiness.

Image by the author.

You want to answer the following question:

How does overtime influence income?

Here, Overtime is our treatment, and Income is the outcome. If you have never heard about causal inference before, the solution seems so obvious:

train a model with all features Sense of duty, Overtime, and Happiness to predict Income, and then
plug in many numbers for Overtime to see how Income changes.

If the model has a good predictive power, this should work out, right?

Unfortunately, not.

This is not universally true, as I have demonstrated in my other article from above. The reason is that the typical Machine Learning models only learn correlations, not causations. If you are not careful, the model will learn to lower ice cream sales in order to reduce the number of shark attacks.

However, by carefully selecting your features, the above method still works! Your feature set has to be what is called a sufficient adjustment set. Let us see what that means.

Sufficient adjustment sets

The concept of sufficient adjustment sets is pretty theoretical, and I will not go into detail in this article since I could fill a whole article series about it. I just want to point you to https://www.dagitty.net/dags.html where you can check yourself if your features form a sufficient adjustment set.

First of all, you have to specify a causal graph, which is by far the hardest part of doing causal inference. This is a graph that tells you which feature potentially influences which other features. For the data above, let us assume the following causal structure:

Image by the author. From https://www.dagitty.net/.

This graph encodes the following assumptions:

The sense of duty influences how much overtime somebody works, as well as the income. (If I have a high sense of duty, I might work longer, and also harder or more diligently, which might lead to higher income.)
Overtime influences happiness and income. (Too much overtime makes people unhappy, but results in more money.)
Income influences happiness as well. (More money, more happiness!)

There are also more hidden assumptions that you can recognize by the absence of arrows. For example, the graph encodes that income does not influence overtime, or happiness does not influence the sense of duty. We also assume that no other factors influence income or any of the other variables, which is a quite strong assumption.

You can debate whether these assumptions make sense or not. They are often not verifiable, which makes causal reasoning so damn hard. But for our purposes, let us go with this graph since it is sensible enough.

Ok, so we have decided on a causal structure of our dataset. Again, this was the hard part, and you have to be careful here. Finding a sufficient adjustment set now – the set that makes our naive method of training a model and plugging in different values for the treatment variable – is a purely graph theoretic task. Dagitty can help us with that for now.

Dagitty

You can easily click together the graph from the image above. Then, in the top left corner, you can flag

Overtime as Exposure (treatment), i.e., the thing you want to play around with to see how it changes the outcome, and
Income as Outcome.

The website will tell you that Sense of duty is a sufficient adjustment set in the top right corner.

Image by the author.

You can now click Sense of duty and flag it as Adjusted, and the same box will tell you that you adjusted correctly. If you set Happiness to Adjusted as well, it will say that you adjusted incorrectly.

What this means for you

You should only train a model with the features

Overtime and
Sense of duty.

You should not include Happiness, you should not omit Sense of duty, ** if you want to make causal statements about how Overtime influences Income**.

Again, more features is not always better in causality.

For the remainder of this article, let us make the following simple assumption: All features in our dataset form a sufficient adjustment set.

Meta-Learners

Now, the last part got longer than anticipated, sorry for that. However, it is necessary since otherwise, you might use supposedly simple methods to come to the wrong conclusions. But with our sufficient adjustment set assumption – that you should always think about before you do anything causal—everything will work out. Let us now get real, and see how to find causal effects.

Meta-learners are the easiest method to compute causal effects for data scientists, quote me on that. Basically, you can take any machine learning model that you know and love, and plug it into another algorithm – the meta-learner – that uses this model to output the causal effects.

The meta algorithms that I am about to show you are simple to implement yourself, but you can also use libraries such as EconML or CausalML to achieve the same thing with fewer lines of code. But nothing is better than knowing what is going on under the hood, right?

Binary treatment

So far, I did not specify what kind of variable the treatment (Overtime, Coupon, …) is. In the following, I will assume that the treatment is binary since most meta-learners only work with those.

Note that this is often no problem since you can always discretize continuous features like overtime as "works less/more than 3 hours overtime a week". A treatment such as "customer got coupon" comes as binary already.

In this case, we can see what the uplift of our intervention/treatment is: what happens if we do a certain action vs. what happens if we don’t.

S-learner

Remember the naive method I proposed, i.e., just training a model and plugging in different values for your treatment variable? This is what is called an S-learner, where S stands for single. In the case of a binary treatment T, we plug in 1 and 0 for T and subtract the results. That’s it already.

First, train:

Image by the author.

T is just one of our features, but the special one that we want to toggle to see how it influences the outcome. After training, we use the model like this:

Image by the author.

Then, for each row, we get an estimate for the uplift, i.e., the result of setting the treatment from 0 to 1.

T-learner

This one is simple as well. When using the T-learner – the T stands for two – you train two models. One on the part of the dataset where T = 1, and one on the dataset where T = 0.

Image by the author.

Then, use it in the natural way:

Image by the author.

If you have these treatment effects, you can compute any other statistic you want, such as:

the average treatment effect over all observations, or
conditional treatment effects in subgroups of the observations.

Implementation

Let us implement both learners, so you see how easy it is. First, let us load the data.

!pip install polars
import polars as pl

data = pl.read_csv("https://raw.githubusercontent.com/Garve/datasets/main/causal_data.csv")

The data looks like this:

Image by the author.

S-learner

Now, train a single model including the treatment variable.

from sklearn.ensemble import HistGradientBoostingRegressor

model = HistGradientBoostingRegressor()
X = data.select(pl.all().exclude("y"))
y = data.select("y").to_numpy().ravel()

model.fit(X.to_numpy(), y)

Now, plug in the train data – or any other dataset with the same format – one time replacing t with all ones, and all zeroes.

X_all_treatment = X.with_columns(t=1).to_numpy() # set treatment to one
X_all_control = X.with_columns(t=0).to_numpy() # set treatment to zero

treatment_effects = model.predict(X_all_treatment) - model.predict(X_all_control)

print(treatment_effects)

# Output:
# [ 0.02497236  3.95121282  4.15999904 ...  3.89655012  0.04411704
#  -0.06875453]

You can see that for some rows, the treatment increased the output by 4, but for some, it did not. With a simple

print(treatment_effects.mean())

you will find out that the average treatment effect is around 2. So, if you would treat every individual in your dataset, your outcome would increase by around 2 on average compared to if nobody gets any treatment.

T-learner

Here, we will train two models.

model_0 = HistGradientBoostingRegressor()
model_1 = HistGradientBoostingRegressor()

data_0 = data.filter(pl.col("t") == 0)
data_1 = data.filter(pl.col("t") == 1)

X_0 = data_0.select(["x1", "x2", "x3"]).to_numpy()
y_0 = data_0.select("y").to_numpy().ravel()
X_1 = data_1.select(["x1", "x2", "x3"]).to_numpy()
y_1 = data_1.select("y").to_numpy().ravel()

model_0.fit(X_0, y_0)
model_1.fit(X_1, y_1)

data_without_treatment = data.select(["x1", "x2", "x3"]).to_numpy()
treatment_effects = model_1.predict(data_without_treatment) - model_0.predict(data_without_treatment)

The results are similar to the S-learner’s output.

Conclusion

In this article, we have learned that estimating causal effects is not as trivial as one might think. The methodology of the S-learner is very natural and easy to implement, but it only yields valid causal insights if the training features form a sufficient adjustment set. The same is true for the T-learner, another option for coming to causal conclusions.

Both methods have their strength and weaknesses. When doing S-learning, the model could choose to ignore the treatment feature, for example. In this case, the predicted treatment effects would all be zero, even if the truth is different. The T-learner has the problem that one of the two datasets could be really small. If there are only 10 observations with a treatment value of 1, you probably cannot trust this model too much.

There are other methods, such as the X-learner by Künzet et al. to address these problems, and we might cover them in a future post.

I hope that you learned something new, interesting, and valuable today. Thanks for reading!

If you have any questions, write me on LinkedIn!

And if you want to dive deeper into the world of algorithms, give my new publication All About Algorithms a try! I’m still searching for writers!

All About Algorithms

The post Easy Methods for Causal Inference appeared first on Towards Data Science.

Find Hidden Laws Within Your Data with Symbolic Regression

Dr. Robert Kübler — Fri, 09 Feb 2024 17:32:23 +0000

Photo by NASA on Unsplash

As machine learning practitioners, we usually have a dataset (X, y), and we want to find a function M— also known as a model – such that M(X) ≈ y. Typically, we do not care about the functional form of M. As far as we are concerned, **** our model can be a neural network, a tree-based algorithm, or something completely different – as long as the performance on the test set is good, we are happy.

However, if we use complex models like these, we might miss out on interesting patterns, maybe even fundamental physics or economic laws within our data. In order to do better, I will show you how to build models using symbolic regression. **** These models have the property that they consist of only a few terms and can be (re-)implemented easily wherever you want. Let us see what I mean by that.

A Physics Experiment

Let us assume that we are an experimental physicist and want to find out how long it takes for an object to reach the ground when we drop it from some height h. For example, if you drop an object (that is heavy enough not to be influenced by air resistance) from a height of h = 1.5 m, it will take about t = 0.55 s until it reaches the ground. Try it out!

However, this is only true for Earth, or other celestial bodies with a gravitational acceleration of g = 9.8067 m/s². The moon, for example, has a gravitational acceleration of 1.625 m/s², and dropping the same object there from 1.5m would take a longer time of about 1.36s, which should align with what you know from movies.

Now, our task is to find a general formula t(h, g) that tells us how long the object needs to reach the ground. This is nothing more than building a model that takes the values height h and gravitational acceleration g and predicts time t. Dear physicists, please bear with me.

Collect data and train a model

Let us assume that we were flying around with our spaceship, and dropped a few things from different heights and different planets, always measuring how long takes to reach the ground. Here are the first rows of the measurements:

Image by the author.

We can now split it into train and test, and then train a standard machine learning model.

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

data = pd.read_csv("https://raw.githubusercontent.com/Garve/towards_data_science/main/Symbolic%20Regression/free_fall.csv")
X = data[["h", "g"]]
y = data["t"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

rf = RandomForestRegressor(random_state=123).fit(X_train, y_train)
print(rf.score(X_test, y_test))

# Output:
# 0.9874788707086176

Cool, so we have built a model that learned the underlying pattern very well. But… what is that pattern?

Model interpretation

It is hard to tell since the actual formula that lives within the model is quite complicated. We can try to see what it has learned using Shapley values. It is not really necessary to know what they are for this article, but if you are still interested, I recommend checking out my other article:

Shapley Values Clearly Explained

Basically, they let you determine how the different features drive the model output. Using the awesome library shap, you can do the following:

# !pip install shap
import shap

te = shap.TreeExplainer(rf, feature_names=["h", "g"])
shapley_values = te(X)
shap.plots.scatter(shap_values=shapley_values)

Image by the author.

On the left side, you can see that higher values of h result in higher model outputs, i.e., higher values of t. On the right side, you can see that lower values of g produce higher values of t according to the model. Both observations make sense: if we drop an object from a higher place, it will take longer until it reaches the ground. And if the object gets pulled towards the ground with a stronger force, it will reach the ground faster.

Usually, we are happy with a result like this and can deploy the model. Still, we can argue that this approach has a few issues:

the model is quite complicated, there is zero theoretical insight,
and because of this, we have to deploy it as a scikit-learn model, i.e., **** we cannot just reimplement it easily when our deployment service does not like scikit-learn models.

Let us see how we can build another model that fixes both issues.

Symbolic Regression

We can utilize a technique called symbolic regression that tries to find simple algebraic expressions that describe the data. For example, the output of a model training on a dataset with the features _x_₁, _x_₂, and _x_₃ with the target y could be y = √_x_₁ + sin(_x_₂/(_x_₃+1)) or y = exp(_x_₁²) – _x_₂. Let us see how it fares on our "free fall" dataset.

There are many packages for symbolic regression, for example, gplearn and PySR. I will use gplearn here since it is slightly easier to use, but it has not been updated for 2 years now, unfortunately. PySR is in active development but uses Julia under the hood, so you have another dependency.

Fitting symbolic regression

First, let us install gplearn via pip install gplearn . Then, we can define a SymbolicRegressor and specify a set of operations – such as addition, or taking a logarithm – that it can use within the formula.

from gplearn.genetic import SymbolicRegressor

sr = SymbolicRegressor(
    function_set=(
        "add",
        "sub",
        "mul",
        "div",
        "sqrt",
        "log",
        "abs",
        "neg",
    ),
    random_state=0,
)

sr.fit(X_train, y_train)

The model now tries to find a formula that

is simple, but
also describes the data well, i.e., with a small loss.

In my case, this is the result:

Image by the author.

which is the ugly version of

Image by the author.

since _x_₀ = h and _x_₁ = g. Apart from having a test _r_² of over 0.999, this is actually the correct physics formula that describes falling objects without air resistance! You might also find the equivalent formula h = 0.5 · g · _t_² in literature.

We have rediscovered what smart physicists found out hundreds of years ago, but without any physics knowledge!

And don’t only take it from me. The authors of PySR published a paper in which they show how many more physical laws could be rediscovered. On page 16, you can see this table:

From https://arxiv.org/pdf/2305.01582.pdf.

In the column, you can see a bunch of symbolic regression libraries. On the left side, there are physical laws. The fractions in the cells are the number of correct expressions rediscovered, divided by the number of total trials. Here are the formulas:

From https://arxiv.org/pdf/2305.01582.pdf.

Planck’s law and the Rydberg formula were too hard to discover for any library. Still, (only) PySR was doing a good job in the other cases. gnlearn was not part of the comparison.

How does this sorcery work?

So, the functional form of our model can be completely crazy. This is fundamentally different from what happens in neural networks, where the formula itself is fixed, and only the parameters are adjusted. Or from tree-based algorithms such as gradient boosting, where the result is always a piecewise constant function.

Long story short, people use Evolutionary Algorithms under the hood that would be too long to cover in this article. These are heavily randomized and heuristic processes that are extremely hard to analyze. Still, they can work quite well in practice, as we have seen above. If you want to have a beginner article about evolutionary algorithms, check out my other one:

An extensible Evolutionary Algorithm Example in Python

If you don’t feel like reading it now, here is the gist of it for our current use case:

Start with a bunch of simple random formulas, e.g. t = h, t = g², t = g + h, t = √h, …
Check how good they are using a loss function, e.g. mean squared error.
Pick some of the best formulas and combine them, e.g. combining t = h and t = g² could result in t = h + g². Or combining t = √h and t = h · g could result in t = √(h · g).
You can also mutate them, e.g. slightly changing t = √(h·g) to t = √(h + g).
Now, you have a new set of formulas and can go to step 1 again. Or, if you can spot a very good formula already, stop.

How these steps are implemented depends on the library you use. You have a lot of freedom to design these steps, making writing these algorithms really fun. The PySR authors describe their specific version of evolutionary algorithms in their paper as well.

Conclusion

In this article, we have seen how easy it is to build models with meaningful formulas. In our example, not only the performance in the test set was better.

Since our model is just a simple formula, using this model anywhere is trivial.

For example, you can go to SQL and write this:

SELECT
  h,
  g,
  SQRT(2 * h / g) AS t_pred
FROM
  your_table

and here we go! You can also reimplement this formula in any programming language of your choice easily.

Sure, the formula is not always short or describes the data well. In this case, you could always add more operations the algorithm can choose from, like the sine, the exponential function, or you can even create your own small building blocks. This could help create a crisp formula again.

I hope that you learned something new, interesting, and valuable today. Thanks for reading!

If you have any questions, write me on LinkedIn!

And if you want to dive deeper into the world of algorithms, give my new publication All About Algorithms a try! I’m still searching for writers!

All About Algorithms

The post Find Hidden Laws Within Your Data with Symbolic Regression appeared first on Towards Data Science.

Shapley Values Clearly Explained

Dr. Robert Kübler — Sun, 04 Feb 2024 17:32:24 +0000

Photo by Vadim Sherbakov on Unsplash

When was the last time you teamed up with some buddies and achieved something great together? Whether it was winning a game, mastering a project at work, or finishing in the top 3 in a Kaggle competition. And if you can’t remember anything (poor you), what about a nice evening with your friends? Picture this: a fantastic night out, followed by sharing a taxi ride home, only to be greeted with a sizable taxi bill. It’s in such moments that you may find yourself wondering:

How can we distribute our team result fairly across every member?

A team can also be the features of a Machine Learning model, the team’s outcome the prediction that the model makes. In this case, it is interesting to know how much each feature contributed to the final prediction. This is what data scientists are typically interested in.

Fair or even?

As an example, imagine that you teamed up for a competition with two more friends. You performed well and received 12,000 € as a team. How do you distribute the money fairly? Distributing it evenly is no problem, everybody gets 4,000€. But we all know that often some people contribute more to the team’s success than others. So maybe a split like 5,000€ + 4,000€ + 3,000€ would be more appropriate, depending on the circumstances.

Or take that taxi bill. You and two other friends went home after hanging out, and that taxi ride home cost you 60€ in total. Is it fair for everyone to pay 20€? Or isn’t it fairer if a person who lives really far away pays more than the ones who live closer, maybe even close together? And if this person has to pay more, how much more should it be?

In this article, I will show you how to answer questions like this using Shapley Values, named after the Nobel laureate Lloyd Shapley. They provide a mathematical framework for assigning a fair share of a team’s success to each contributing member.

Three gentlemen in a taxi

As a running example, let us use the taxi scenario I mentioned before. Here comes a slightly more fleshed-out story:

In the charming town of Luton, three British gentlemen from Oxford, Cambridge, and London, dressed in dapper suits, spent a jolly evening exploring its cobblestone streets. Starting at a local pub, they clinked glasses in celebration, moved on to the bustling marketplace for street food delights, and ended their night in a refined tearoom, sipping tea to the soothing tunes of a live pianist. With hearts warmed by laughter and shared moments, the trio bid farewell to Luton, leaving behind the echoes of a truly delightful evening.

What a lovely story, innit? In the end, the friends shared a single taxi – to spend some more time together – to get home.

https://www.google.com/maps/@51.7811482,-0.4596552,8.75z?entry=ttu, image by the author.

As you can see, the taxi has to drive quite a route to go from Luton all the way around to reach all three cities. The taxi price was 3£/mile, and it went 161 miles, resulting in a price of 483£. Phew. But now comes the question of questions:

How much should everyone pay?

Let us start with a few observations.

London is closest to Luton, 34 miles.
If the London friend had gone home on his own, the other two friends would still have needed to travel 127 miles to get home, which is still quite a lot.

The same is not true for the other cities. Intuitively, this means that the friend going to London did not increase the mileage too much and should pay less compared to the other two. We will now learn how to compute the exact and fair numbers according to Shapley.

Computing Shapley Values

Everything we need to know to compute Shapley values is the prices for each combination – also known as coalition – of friends taking the taxi. Let us call the friends C(ambridge), L(ondon), and O(xford) from now. C, L, and O are also called players in general.

Coalition values

We have established that a coalition of C, L, and O would have cost 483£, this is what we have observed. We also write this as v({C, L, O}) = 483, meaning that the value of the coalition {C, L, O} is 483.

However, we did not observe any other coalition value. This is usually the case when you want to compute Shapley values. You have the value of the full coalition, but you have to come up with the other values as well. For example, what would have been the price if only C and O would have taken the taxi?

In our case, we can reconstruct these values using Google Maps. Visiting Cambridge and Oxford would have been a 127-mile ride that cost 127 miles · 3£/mile = 381£, so v({C, O}) = 381. As another example, take the coalition {L}, i.e., only going to London. That would have been a 34-mile ride from Luton, resulting in a coalition value (price) of v({L}) = 34 · 3 = 102.

You can do the same for all other coalitions and get the following list:

Image by the author.

Contribution values

Now that we have this list, the idea of Shapley values revolves around the concept of contributions. As an example, let us find out how much C contributes. But what does contributing even mean?

Well, if the taxi is empty – with a cost of 0 – and C gets in, the price goes from 0 to 114 according to the table above. The contribution of C in this case is v({C}) – v({}) = 114 – 0 = 114.

As another example, if there is only O in the taxi — with a cost of 150 — and C joins him, the cost increases to 381. Hence the contribution in this case is v({C, O}) – v({O}) = 381 – 150 = 231.

The fair share that everyone has to pay should intuitively be connected to how much their presence in this coalition (taxi) contributes to the final value (price).

However, we have the following dilemma:

Contributions are heavily tied to the concept of time.

You can see this reflected in how we came up with the contributions for C.

First, the taxi was empty, then C joined, driving up the price by 114 = v({C}) – v({}).
First, only O was in the taxi, then C joined, driving up the price by 231 = v({C, O}) – v({O}).
First, O and L were in the taxi, then C joined, driving up the price by 195 = v({C, L, O}) – v({L, O}).

But you only know that everyone was in that taxi, there is no inherent timely order. So, did C drive up the price by 114, 231, or 195? And here is the magic sauce to solve this dilemma:

Just average the contribution values over all potential orderings of the coalition in time.

Sounds abstract, but let us compute how it works for C.

Average over permutations

If we just know that C, L, and O were in the taxi, let us consider all orderings in which they could have entered it. Let us make another list!

Image by the author.

In the table above, you can see a lot of numbers that we have seen before. If C comes first, his incremental value is 114. The order of L and O afterward does not matter, it is 114 in both cases. Similarly, if C gets in last, his incremental value is 195. If C comes after O, it is 231. The only new number here is at L, C, O, i.e., the incremental value of C when only L was in the taxi before.

And now? Take the average! The result is the Shapley value of C:

Image by the author.

So, the fair price according to the Shapley values would be 172£ for C. You can do the math for L and O as well and get

The fair share for everyone. Image by the author.

Note that the sum is equal to the total price: 483 = 172 + 119.5 + 191.5. This is not a coincidence, but a provable result called Efficiency. Shapley values would be useless without this property.

Python code

I have written some code to replicate the steps we have taken. This code is definitely not the fastest way to calculate Shapley values, but you should be able to follow along.

from itertools import permutations

# from the article, change it as you wish
coalition_values = {
    frozenset(): 0,
    frozenset("C"): 114,
    frozenset("L"): 102,
    frozenset("O"): 150,
    frozenset(("C", "L")): 285,
    frozenset(("C", "O")): 381,
    frozenset(("L", "O")): 288,
    frozenset(("C", "L", "O")): 483,
}

def compute_shapley_values(coalition_values, player):
    players = max(coalition_values, key=lambda x: len(x))
    contributions = []

    for permutation in permutations(players):
        player_index = permutation.index(player)
        coalition_before = frozenset(permutation[:player_index])  # excluding player
        coalition_after = frozenset(permutation[: player_index + 1])  # player joined coalition
        contributions.append(coalition_values[coalition_after] - coalition_values[coalition_before])

    return sum(contributions) / len(contributions)  # average, results in Shapley values

for player in ("C", "L", "O"):
    print(player, compute_shapley_values(coalition_values, player))

# Output:
# C 172.0
# L 119.5
# O 191.5

If n is the number of players, we loop over n! player permutations. Even for small values of n, such as 10, this is a huge number already: 3,628,800 iterations. This makes this implementation rather useless. Also, I store all values in lists before taking the average, so memory-wise, the above code only works for small instances as well.

The General Formula

Now that we know the single steps, let me give you the classical Shapley value formula:

Image by the author.

Phew, that’s quite a mouthful, even with the knowledge we’ve gained over the course of the article. Let me annotate a few things here.

Image by the author.

So, let us get started. N is the set of all players, in our example N = {C, L, O}, and n = |N| is the number of players. S is a coalition, such as {C, O}. You can see that we have coalition values of coalitions S with and without player i. Taking the difference gives us the contribution values. x! is the factorial function, i.e., x! = x · (x – 1) · (x – 2) · … · 2 · 1.

Multiplicity

Now, the only mystery is this multiplicity factor. To understand it, let us go back to our contribution table of C.

Image by the author.

You can see how some numbers appear multiple times, namely 114 and 195. The numbers are the same since the order of all players before C and after C does not matter for the contribution values. For example, the orderings C, L, O and C, O, L have the same contribution value since both use the formula v({C}) – v({}). C comes first, the order after it does not matter. The same with L, O, C and O, L, C. C comes last, the order of everything before does not matter.

If there were two more players X and Y, the following orderings would also have the same contribution values:

X, O, C, Y, L
X, O, C, L, Y
O, X, C, Y, L
O, X, C, L, Y

C is in the third place, the order of what happens before and the order of what happens after does not matter. In any case, the contribution value is v({C, O, X}) – v({O, X}).

The multiplicity counts the number of permutations with the same contribution value.

The combinatorial argument goes as follows: There is an ordering of n players, and I want to compute the Shapley value of the player in position |S| + 1. Before this player, there are |S| other players, and there are |S|! ways to rearrange those. After this player, there are still n – |S| – 1 other players, which we can rearrange in (n – |S| – 1)! ways. We can freely combine both permutations, leading to |S|! · (n – |S| – 1)! orderings with the same contribution value, which is exactly the multiplicity factor in the Shapley formula.

As a numerical example, take the player example again with an ordering like X, O, C, Y, L. We care about the C. There are two ways to permute X and O, as well as two ways to permute Y and L, leading to a multiplicity of 2! · 2! = 4, which is why this list had a size of four.

Faster implementation

Here is a much better implementation. I used the variable names from the formula to make it easier to understand.

from itertools import combinations
from math import factorial

coalition_values = {
    frozenset(): 0,
    frozenset("C"): 114,
    frozenset("L"): 102,
    frozenset("O"): 150,
    frozenset(("C", "L")): 285,
    frozenset(("C", "O")): 381,
    frozenset(("L", "O")): 288,
    frozenset(("C", "L", "O")): 483,
}

def powerset(set_):
    """Create all subsets of a given set."""
    for k in range(len(set_) + 1):
        for subset in combinations(set_, k):
            yield set(subset)

def compute_shapley_values(coalition_values, i):
    N = max(coalition_values, key=lambda x: len(x))
    n = len(N)
    contribution = 0

    for S in powerset(N - {i}):
        multiplicity = factorial(len(S)) * factorial(n - len(S) - 1)
        coalition_before = frozenset(S)
        coalition_after = frozenset(S | {i})
        contribution += multiplicity * (coalition_values[coalition_after] - coalition_values[coalition_before])

    return contribution / factorial(n)

for player in ("C", "L", "O"):
    print(player, compute_shapley_values(coalition_values, player))

# Output:
# C 172.0
# L 119.5
# O 191.5

Since we enumerate over all subsets, this implementation loops over 2ⁿ / 2 sets. This is still bad, but not as horrible as the n! iterations as in the first implementation. For n = 10, there are only 512 iterations instead of over 3 million.

Properties of Shapley Values

Let us end this article with some important properties of Shapley value.

Efficiency: The sum of all players’ Shapley values equals the full coalition value. (Proof: Click here.)
Symmetry: If two players i and j have the same contribution values, i.e., v(S ∪ {i}) = v(S ∪ {j}) for all coalitions S not containing the players i and j, their Shapley values are also equal. (Proof: Both sums have the same summands, hence, the whole sum is equal.)
Linearity: If there are two games with coalition value functions v and w, then the Shapley value for a player i in the combined game v + w is equal to the sum of the Shapley values in the single games, i.e., ϕᵢ(v+w) = ϕᵢ(v) + ϕᵢ(w). (Proof: Take the Shapley formula from above and plug in v+w instead of v. Then you can separate the sum into two sums, which correspond to ϕᵢ(v) and ϕᵢ(w).)
Null Player: If a player i never contributes anything, their Shapley value is zero, i.e., if v(S ∪ {i}) = v(S) for each S not containing i, then ϕᵢ(v) = 0. (Proof: All contribution values v(S ∪ {i}) – v(S) are zero, so the whole sum is zero.)

Another interesting fact is that the Shapley values are the only values that satisfy the above four properties.

Conclusion

The problem of dividing a joint team effort fairly among the team members is a problem everyone knows from real life. We often fall back on simple solutions, such as dividing the pot of gold into equal parts. After all, it’s easy to do and it sounds fair on paper.

However, now you have a tool to compute actual fair shares for each member. At least if you can quantify all coalition values, which is the main obstacle to using Shapley values in practice, along with the exponential complexity of the calculation.

In a later post, we will look in more detail at the use of Shapley values in machine learning to break down model predictions.

I hope that you learned something new, interesting, and valuable today. Thanks for reading!

If you have any questions, write me on LinkedIn!

And if you want to dive deeper into the world of algorithms, give my new publication All About Algorithms a try! I’m still searching for writers!

All About Algorithms

The post Shapley Values Clearly Explained appeared first on Towards Data Science.

Neural Networks For Periodic Functions

Dr. Robert Kübler — Tue, 16 Jan 2024 17:32:26 +0000

Photo by Willian Justen de Vasconcellos on Unsplash

Neural networks are known to be great approximators for any function – at least whenever we don’t move too far away from our dataset. Let us see what that means. Here is some data:

Image by the author.

It does not only look like a sine wave, it actually is, with some noise added. We can now train a normal feed-forward neural network having 1 hidden layer with 1000 neurons and ReLU activation. We get the following fit:

Image by the author.

It looks quite decent, apart from the edges. We could fix this by adding more neurons to the hidden layer according to Cybenko’s universal approximation theorem. But I want to point you something else:

Image by the author.

We could argue now that this extrapolation behavior is bad if we assume the wave pattern to continue outside of the observed range. But if there is no domain knowledge or more data we can resort to, it would just be this: an assumption.

However, in the remainder of this article, we will assume that any periodic pattern we can pick up within the data continues outside as well. This is a common assumption when doing time series modeling, where we naturally want to extrapolate into the future. We assume that any observed seasonality in the training data will just continue like that, because what else can we say without any additional information? In this article, I want to show you how using sine-based activation functions helps bake this assumption into the model.

But before we go there, let us shortly dive deeper into how ReLU-based neural networks extrapolate in general, and why we should not use them for time series forecasting as is.

Extrapolation of Neural Networks

We have seen that the output of our ReLU network turns into a line the further we go away from our dataset. This is a general pattern, as shown by Ziyin et al. in their paper Neural Networks Fail to Learn Periodic Functions and How to Fix It. They prove:

From their paper.

To simplify it a little bit, you can read it as f(z·u) ≈ z·W·u + b for large z. This shows that if you plug in absolutely large values in the ReLU neural network – meaning very far away from 0 – this network behaves like a linear function.

If the input and output are one-dimensional, such as in our example, we can set u = 1 and get f(z) ≈ W·z + b for large z, where W is a 1×1 matrix, i.e., a real number. Setting u = -1, we get f(-z) ≈ W’·z + b for large z and some other number W’. This is exactly what we have seen on the edges.

Sigmoid- or tanh-based networks

Something similar holds for bounded activation functions such as sigmoid or tanh:

Image by the author.

The authors formalize it in

From their paper.

This means that our model behaves like a constant function f(z·u) ≈ v whenever we go away far from 0, as opposed to a more general linear function in the ReLU case.

While this is all interesting, let us get to the activation functions that can model periodic behavior.

Sine-Based Activation Functions

There are several ways to proceed from here. The easiest thing is just to use sin(x) as an activation function and see what happens. Ziyin et al. put even more thought into it and suggest using their so-called snake function Snake(x) = x + sin²(x), or even Snakeₐ(x) = x + sin²(ax) / a, __ if you want to parameterize it.

Pure sine

Let us build a model using only a single hidden note with the sine function as the activation. I will use TensorFlow to implement it, but it is so simple that you can use any Deep Learning framework to follow along.

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(1, activation=tf.math.sin),
    tf.keras.layers.Dense(1)
])

model.compile(
    loss="mse",
    optimizer=tf.keras.optimizers.AdamW() # AdamW is a better version of Adam, see here: https://arxiv.org/abs/1711.05101
)

We can now train the model:

X = tf.random.uniform((100000, 1), minval=-6, maxval=6)
y = tf.math.sin(X) + 0.1 * np.random.randn(100000, 1)

model.fit(
    X,
    y,
    validation_split=0.1,
    epochs=100,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True), # stop whenever the validation mse does not improve in two epochs
        tf.keras.callbacks.ReduceLROnPlateau(patience=0) # lower learning rate if the validation mse does not improve
    ]
)

We get the following fit:

Image by the author.

This is what we would like to see when doing a time series forecast! Although the left side (the past) usually does not matter, it is nice to get a reasonable extrapolation there as well for free.

A word of caution. Do not throw in too many hidden nodes, since otherwise, the model might pick up seasonalities that do not exist in the data – classical overfitting. Look at what can happen if we use 100 hidden nodes:

Image by the author.

In addition to the short seasonality, the model picks up a larger seasonality that goes down in the time of the training. We know that this is not true, though, for our data.

Less is more.

Snake function

The authors did not invent the snake function for stationary data as we have it at the moment. It will not work well with our data since it imposes a trend on the model.

def snake(x):
    return tf.math.sin(x)**2 + x

model = tf.keras.Sequential([
    tf.keras.layers.Dense(5, activation=snake),
    tf.keras.layers.Dense(1)
])

Training this model gives something like this:

Image by the author.

We can see an upward trend that also comes out of nowhere. But this is a consequence of how the snake function is designed, i.e., the "+ x" term. For functions or time series with a trend, though, the snake function can shine.

The authors also present a universal extrapolation theorem:

From their paper. C¹ functions have continuous first derivatives.

This is similar to the Fourier approximation theorem, but not the same since we use the snake function here, instead of pure sine and cosine waves.

Time Series With a Trend

Let us say that we want to forecast a time series now, and we assume that it

has potentially many seasonalities, such as hourly, monthly, yearly, or even some period we cannot easily describe, and
a trend.

There are a lot of classical statistical methods to tackle this problem, such as SARIMA or triple exponential smoothing, also known as Holt-Winters. On the neural network side, you can use recurrent Neural Networks, convolutions, or transformers.

However, I will not go into the details of these methods but focus on our main goal: building simple sine-based feed-forward neural networks. In contrast to many of the aforementioned methods, we do not have to deal with recursive formulas in this case but model our target directly. Let us first create some data we can train on.

X = tf.cast(tf.range(50)[:, tf.newaxis], tf.float32)
y = 3*tf.math.sin(X) + 2*tf.math.sin(2*X) - tf.math.sin(3*X) + X + 0.1 * tf.random.normal((50, 1))

The data looks like this:

Image by the author.

Additive trend + seasonality model

We can implement it in keras like this:

class AdditiveTrendAndSeasonalityModel(tf.keras.Model):
    def __init__(self, trend_model, seasonality_model):
        super().__init__()
        self.trend_model = trend_model
        self.seasonality_model = seasonality_model

    def call(self, x):
        return self.trend_model(x) + self.seasonality_model(x)

trend_model = tf.keras.Sequential([
    tf.keras.layers.Dense(5, activation="relu"),
    tf.keras.layers.Dense(1)
])

seasonality_model = tf.keras.Sequential([
    tf.keras.layers.Dense(5, activation=tf.math.sin),
    tf.keras.layers.Dense(1, use_bias=False)
])

model = AdditiveTrendAndSeasonalityModel(trend_model, seasonality_model)

Now we can compile and train the model.

model.compile(
    loss="mse",
    optimizer=tf.keras.optimizers.AdamW(learning_rate=0.0005)
)

model.fit(
    X,
    y,
    epochs=3000,
    callbacks=[tf.keras.callbacks.EarlyStopping(monitor="loss", patience=5)]
)

We can use our model to extrapolate now:

Image by the author.

The extrapolation behavior of our model looks great! The trend is nice and linear, just as in the training data, but the seasonality pattern could be more accurate. The pattern that the model came up with is too simple, although the model had the chance to learn the true pattern since we gave it five sine-activated nodes. Only three would have been necessary.

I also experienced that initialization matters very much in sine-based neural networks. Sometimes, we get a good and fast convergence to a model as in the image above. But sometimes, we get stuck in very bad local minima – it is a bit of a hit-and-miss.

Still, since our model consists of a trend_model and seasonality_model , we can easily decompose our time series into a trend and seasonal part:

Image by the author.

Nice! And if you want a multiplicative trend, just replace the + with a * in the model class.

Conclusion

In this article, we have seen that the extrapolation capabilities of neural networks are determined by their activation functions – a fact that many people are not aware of.

Furthermore, for time series modeling, we have seen how to create a simple neural network that consists of a trend and seasonality part that are then added together, giving us a nice decomposition. And because we are dealing with time series, extrapolation capabilities are crucial. Hence we have to be careful which activation functions we choose. In the case of the seasonality part of our network, we used sine activation functions since they make this part of the neural network periodic.

You can also use the snake function that Ziyin et al. proposed, but then you will lose the ability to decompose the time series into trend and seasonality components. But give it a try, if this is not important to you, and rather the predictive performance matters.

I hope that you learned something new, interesting, and valuable today. Thanks for reading!

If you have any questions, write me on LinkedIn!

And if you want to dive deeper into the world of algorithms, give my new publication All About Algorithms a try! I’m still searching for writers!

All About Algorithms

The post Neural Networks For Periodic Functions appeared first on Towards Data Science.