Tidying Up the Framework of Dataset Shifts: The Example

I recently talked about the causes of model performance degradation, meaning when their prediction quality drops with respect to the moment we trained and deployed our models. In this other post, I proposed a new way of thinking about the causes of model degradation. In that framework, the so-called conditional probability comes out as the global cause.

The conditional probability is, by definition, composed of three probabilities which I call the specific causes. The most important learning of this restructure of concepts is that covariate shift and conditional shift are not two separate or parallel concepts. Conditional shift can happen as a function of covariate shift.

With this restructuring, I believe it becomes easier to think about the causes and it becomes more logical to interpret the shifts that we observe in our applications.

This is the scheme of causes and model performance for Machine Learning models:

Image by author. Adapted from https://towardsdatascience.com/tidying-up-the-framework-of-dataset-shifts-cd9f922637b7

In this scheme, we see the clear path that connects the causes to the prediction performance of our estimated models. One fundamental assumption we need to make in Statistical Learning is that our models are "good" estimators of the real models (real decision boundaries, real regression functions, etc.). "Good" can have different meanings, such as unbiased estimators, precise estimators, complete estimators, sufficient estimators, etc. But, for the sake of simplicity and the upcoming discussion, let’s say that they are good in the sense that they have a small prediction error. In other words, we assume that they are representative of the real models.

With this assumption, we are able to look for the causes of model degradation of the estimated model in the probabilities P(X), P(Y), P(X|Y), and consequently, P(Y|X).

So, what we will do today is to exemplify and walk through different scenarios to see how P(Y|X) changes as a function of the 3 probabilities P(X|Y), P(X), and P(Y). We will do so by using a population of a few points in a 2D space and calculating the probabilities from these sample points in the way Laplace would do. The purpose is to digest the hierarchy scheme of causes of model degradation, keeping P(Y|X) as the global cause, and the other three as the specific causes. In that way, we can understand, for example, how a potential covariate shift can be sometimes the argument of the conditional shift rather than being a separate shift of its own.

The example

The case we will draw for our lesson today is a very simple one. We have a space of two covariates X1 and X2 and the output Y is a binary variable. This is what our model space looks like:

You see there that the space is organized in 4 quadrants and the decision boundary in this space is the cross. This means that the model classifies samples in class 1 if they lie in the 1st and 3rd quadrants, and in class 0 otherwise. For the sake of this exercise, we will walk through the different cases comparing P(Y=1|X1>a). This will be our conditional probability to showcase. If you are wondering why not taking also X2, it’s only for the simplicity of the exercise. It doesn’t affect the insight we want to understand.

If you’re still with a bittersweet feeling, taking P(Y=1|X1>a) is equivalent to P(Y=1|X1>a, -inf <X2 < inf), so theoretically, we are still taking X2 into account.

Reference model

So to start with, we calculate our showcase probability and we obtain 1/2. Pretty much here our group of samples is quite uniform throughout the space and the prior probabilities are also uniform:

Shifts are coming up

One extra sample appears in the bottom right quadrant. So the first thing we ask is: Are we talking about a covariate shift?

Well, yes, because there is more sampling in X1>a than there was before. So, is this only a covariate shift but not a conditional shift? Let’s see. Here is the calculation of all the same probabilities as before with the updated set of points (The probabilities that changed are in orange):

What did we see here? In fact, not only did we get a covariate shift, but overall, all the probabilities changed. The prior probability also changed because the covariate shift brought a new point of class 1 making the incidence of this class bigger than class 2. Then also, the inverse probability P(X1>a|Y=1) changed precisely because of the prior shift. All of that overall led to a conditional shift so we now got P(Y=1|X1>a)=2/3 instead of 1/2.

Here’s a thought bubble. A very important one actually.

With this shift in the sampling distribution, we obtained shifts in all the probabilities that play a role in the whole scheme of our models. Yet, the decision boundary that existed based on the initial sampling remained valid for this shift.

What does this mean?

Even though we obtained a conditional shift, the decision boundary did not necessarily degrade. Because the decision boundary comes from the expected value, if we calculate this value based on the current shift, the boundary may remain the same but with a different conditional probability.

2. Samples at the first quadrant don’t exist anymore.

So, for X1>a things remained unchanged. Let’s see what happens to the conditional probability we’re showcasing and its elements.

Intuitively, because within X1>a things remain unchanged, the conditional probability remained the same. Yet, when we look at P(X1>a) we obtain 2/3 instead of 1/2 compared to the training sampling. So here we have a covariate shift without a conditional shift.

From a math perspective, how can the covariate probability change without the conditional probability changing? This is because P(Y=1) and P(X1>a|Y=1) changed accordingly to the covariate probability. Therefore the compensation makes up for an unchanged conditional probability.

With these changes, just as before, the decision boundary remained valid.

3. Throwing in some samples in different quadrants while the decision boundary remained valid.

We have here 2 extra combinations. In one case, the prior remained the same while the other two probabilities changed, still not changing the conditional probability. In the second case, only the inverse probability was associated with a conditional shift. Check the shifts here below. The latter is a pretty important one, so don’t miss it!

With this, we have now a pretty solid perspective on how the conditional probability can change as a function of the other three probabilities. But most importantly, we also know that not all conditional shifts invalidate the existing decision boundary. So what’s the deal with it?

Concept drift

In the previous post, I also proposed a more specific way of defining concept drift (or concept shift). The proposal is:

We refer to a change in the concept when the decision boundary or regression function becomes invalid when the probabilities at play are shifting.

So, the crucial point about this is that if the decision boundary becomes invalid, surely there is a conditional shift. The reverse, as we discussed in the previous post and as we saw in the examples above, is not necessarily true.

This might not be so fantastic from a practical perspective because it means that to truly know if there’s a concept drift, we might be forced to re-estimate the boundary or function. But at least, for our theoretical understanding, this is just as fascinating.

Here’s an example in which we have a concept drift, naturally with a conditional shift, but actually without a covariate or a prior shift.

How cool is this separation of components? The only element that changed here was the inverse probability, but, contrary to the previous shift we studied above, this change in the inverse probability was linked to the change in the decision boundary. Now, a valid decision boundary is only the separation according to X1>a discarding the boundary dictated by X2.

What have we learned?

We have walked very slowly through the decomposition of the causes of model degradation. We studied different shifts of the probability elements and how they relate to the degradation of the prediction performance of our machine learning models. The most important insights are:

A conditional shift is a global cause of prediction degradation in machine learning models
The specific causes are covariate shift, prior shift, and inverse probability shift
We can have many different cases of probability shifts while the decision boundary remains valid
A change in the decision boundary causes a conditional shift, but the reverse is not necessarily true!
Concept drift may be more specifically associated with the decision boundary rather than with the overall conditional probability distribution

What follows from this? Reorganizing our practical solutions in light of this hierarchy of definitions is the biggest invitation I make. We might find so many wanted answers to our current questions regarding the way in which we can monitor our models.

If you are currently working on model performance monitoring using these definitions, don’t hesitate to share your thoughts on this framework.

Happy thinking to everyone!