Model Drift | Towards Data Science

How to Spot and Prevent Model Drift Before it Impacts Your Business

Claudia Ng — Thu, 06 Mar 2025 19:22:22 +0000

Despite the AI hype, many tech companies still rely heavily on machine learning to power critical applications, from personalized recommendations to fraud detection.

I’ve seen firsthand how undetected drifts can result in significant costs — missed fraud detection, lost revenue, and suboptimal business outcomes, just to name a few. So, it’s crucial to have robust monitoring in place if your company has deployed or plans to deploy machine learning models into production.

Undetected Model Drift can lead to significant financial losses, operational inefficiencies, and even damage to a company’s reputation. To mitigate these risks, it’s important to have effective model monitoring, which involves:

Tracking model performance
Monitoring feature distributions
Detecting both univariate and multivariate drifts

A well-implemented monitoring system can help identify issues early, saving considerable time, money, and resources.

In this comprehensive guide, I’ll provide a framework on how to think about and implement effective Model Monitoring, helping you stay ahead of potential issues and ensure stability and reliability of your models in production.

What’s the difference between feature drift and score drift?

Score drift refers to a gradual change in the distribution of model scores. If left unchecked, this could lead to a decline in model performance, making the model less accurate over time.

On the other hand, feature drift occurs when one or more features experience changes in the distribution. These changes in feature values can affect the underlying relationships that the model has learned, and ultimately lead to inaccurate model predictions.

Simulating score shifts

To model real-world fraud detection challenges, I created a synthetic dataset with five financial transaction features.

The reference dataset represents the original distribution, while the production dataset introduces shifts to simulate an increase in high-value transactions without PIN verification on newer accounts, indicating an increase in fraud.

Each feature has different underlying distributions:

Transaction Amount: Log-normal distribution (right-skewed with a long tail)
Account Age (months): clipped normal distribution between 0 to 60 (assuming a 5-year-old company)
Time Since Last Transaction: Exponential distribution
Transaction Count: Poisson distribution
Entered PIN: Binomial distribution.

To approximate model scores, I randomly assigned weights to these features and applied a sigmoid function to constrain predictions between 0 to 1. This mimics how a logistic regression fraud model generates risk scores.

As shown in the plot below:

Drifted features: Transaction Amount, Account Age, Transaction Count, and Entered PIN all experienced shifts in distribution, scale, or relationships.

Distribution of drifted features (image by author)

Stable feature: Time Since Last Transaction remained unchanged.

Distribution of stable feature (image by author)

Drifted scores: As a result of the drifted features, the distribution in model scores has also changed.

Distribution of model scores (image by author)

This setup allows us to analyze how feature drift impacts model scores in production.

Detecting model score drift using PSI

To monitor model scores, I used population stability index (PSI) to measure how much model score distribution has shifted over time.

PSI works by binning continuous model scores and comparing the proportion of scores in each bin between the reference and production datasets. It compares the differences in proportions and their logarithmic ratios to compute a single summary statistic to quantify the drift.

Python implementation:

# Define function to calculate PSI given two datasets
def calculate_psi(reference, production, bins=10):
  # Discretize scores into bins
  min_val, max_val = 0, 1
  bin_edges = np.linspace(min_val, max_val, bins + 1)

  # Calculate proportions in each bin
  ref_counts, _ = np.histogram(reference, bins=bin_edges)
  prod_counts, _ = np.histogram(production, bins=bin_edges)

  ref_proportions = ref_counts / len(reference)
  prod_proportions = prod_counts / len(production)
  
  # Avoid division by zero
  ref_proportions = np.clip(ref_proportions, 1e-8, 1)
  prod_proportions = np.clip(prod_proportions, 1e-8, 1)

  # Calculate PSI for each bin
  psi = np.sum((ref_proportions - prod_proportions) * np.log(ref_proportions / prod_proportions))

  return psi
  
# Calculate PSI
psi_value = calculate_psi(ref_data['model_score'], prod_data['model_score'], bins=10)
print(f"PSI Value: {psi_value}")

Below is a summary of how to interpret PSI values:

PSI < 0.1: No drift, or very minor drift (distributions are almost identical).
0.1 ≤ PSI < 0.25: Some drift. The distributions are somewhat different.
0.25 ≤ PSI < 0.5: Moderate drift. A noticeable shift between the reference and production distributions.
PSI ≥ 0.5: Significant drift. There is a large shift, indicating that the distribution in production has changed substantially from the reference data.

Histogram of model score distributions (image by author)

The PSI value of 0.6374 suggests a significant drift between our reference and production datasets. This aligns with the histogram of model score distributions, which visually confirms the shift towards higher scores in production — indicating an increase in risky transactions.

Detecting feature drift

Kolmogorov-Smirnov test for numeric features

The Kolmogorov-Smirnov (K-S) test is my preferred method for detecting drift in numeric features, because it is non-parametric, meaning it doesn’t assume a normal distribution.

The test compares a feature’s distribution in the reference and production datasets by measuring the maximum difference between the empirical cumulative distribution functions (ECDFs). The resulting K-S statistic ranges from 0 to 1:

0 indicates no difference between the two distributions.
Values closer to 1 suggest a greater shift.

Python implementation:

# Create an empty dataframe
ks_results = pd.DataFrame(columns=['Feature', 'KS Statistic', 'p-value', 'Drift Detected'])

# Loop through all features and perform the K-S test
for col in numeric_cols:
    ks_stat, p_value = ks_2samp(ref_data[col], prod_data[col])
    drift_detected = p_value < 0.05
		
		# Store results in the dataframe
    ks_results = pd.concat([
        ks_results,
        pd.DataFrame({
            'Feature': [col],
            'KS Statistic': [ks_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

Below are ECDF charts of the four numeric features in our dataset:

ECDFs of four numeric features (image by author)

Let’s look at the account age feature as an example: the x-axis represents account age (0-50 months), while the y-axis shows the ECDF for both reference and production datasets. The production dataset skews towards newer accounts, as it has a larger proportion of observations with lower account ages.

Chi-Square test for categorical features

To detect shifts in categorical and boolean features, I like to use the Chi-Square test.

This test compares the frequency distribution of a categorical feature in the reference and production datasets, and returns two values:

Chi-Square statistic: A higher value indicates a greater shift between the reference and production datasets.
P-value: A p-value below 0.05 suggests that the difference between the reference and production datasets is statistically significant, indicating potential feature drift.

Python implementation:

# Create empty dataframe with corresponding column names
chi2_results = pd.DataFrame(columns=['Feature', 'Chi-Square Statistic', 'p-value', 'Drift Detected'])

for col in categorical_cols:
    # Get normalized value counts for both reference and production datasets
    ref_counts = ref_data[col].value_counts(normalize=True)
    prod_counts = prod_data[col].value_counts(normalize=True)

    # Ensure all categories are represented in both
    all_categories = set(ref_counts.index).union(set(prod_counts.index))
    ref_counts = ref_counts.reindex(all_categories, fill_value=0)
    prod_counts = prod_counts.reindex(all_categories, fill_value=0)

    # Create contingency table
    contingency_table = np.array([ref_counts * len(ref_data), prod_counts * len(prod_data)])

    # Perform Chi-Square test
    chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
    drift_detected = p_value < 0.05

    # Store results in chi2_results dataframe
    chi2_results = pd.concat([
        chi2_results,
        pd.DataFrame({
            'Feature': [col],
            'Chi-Square Statistic': [chi2_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

The Chi-Square statistic of 57.31 with a p-value of 3.72e-14 confirms a large shift in our categorical feature, Entered PIN. This finding aligns with the histogram below, which visually illustrates the shift:

Distribution of categorical feature (image by author)

Detecting multivariate shifts

Spearman Correlation for shifts in pairwise interactions

In addition to monitoring individual feature shifts, it’s important to track shifts in relationships or interactions between features, known as multivariate shifts. Even if the distributions of individual features remain stable, multivariate shifts can signal meaningful differences in the data.

By default, Pandas’ .corr() function calculates Pearson correlation, which only captures linear relationships between variables. However, relationships between features are often non-linear yet still follow a consistent trend.

To capture this, we use Spearman correlation to measure monotonic relationships between features. It captures whether features change together in a consistent direction, even if their relationship isn’t strictly linear.

To assess shifts in feature relationships, we compare:

Reference correlation (ref_corr): Captures historical feature relationships in the reference dataset.
Production correlation (prod_corr): Captures new feature relationships in production.
Absolute difference in correlation: Measures how much feature relationships have shifted between the reference and production datasets. Higher values indicate more significant shifts.

Python implementation:

# Calculate correlation matrices
ref_corr = ref_data.corr(method='spearman')
prod_corr = prod_data.corr(method='spearman')

# Calculate correlation difference
corr_diff = abs(ref_corr - prod_corr)

Example: Change in correlation

Now, let’s look at the correlation between transaction_amount and account_age_in_months:

In ref_corr, the correlation is 0.00095, indicating a weak relationship between the two features.
In prod_corr, the correlation is -0.0325, indicating a weak negative correlation.
Absolute difference in the Spearman correlation is 0.0335, which is a small but noticeable shift.

The absolute difference in correlation indicates a shift in the relationship between transaction_amount and account_age_in_months.

There used to be no relationship between these two features, but the production dataset indicates that there is now a weak negative correlation, meaning that newer accounts have higher transaction amounts. This is spot on!

Autoencoder for complex, high-dimensional multivariate shifts

In addition to monitoring pairwise interactions, we can also look for shifts across more dimensions in the data.

Autoencoders are powerful tools for detecting high-dimensional multivariate shifts, where multiple features collectively change in ways that may not be apparent from looking at individual feature distributions or pairwise correlations.

An autoencoder is a neural network that learns a compressed representation of data through two components:

Encoder: Compresses input data into a lower-dimensional representation.
Decoder: Reconstructs the original input from the compressed representation.

To detect shifts, we compare the reconstructed output to the original input and compute the reconstruction loss.

Low reconstruction loss → The autoencoder successfully reconstructs the data, meaning the new observations are similar to what it has seen and learned.
High reconstruction loss → The production data deviates significantly from the learned patterns, indicating potential drift.

Unlike traditional drift metrics that focus on individual features or pairwise relationships, autoencoders capture complex, non-linear dependencies across multiple variables simultaneously.

Python implementation:

ref_features = ref_data[numeric_cols + categorical_cols]
prod_features = prod_data[numeric_cols + categorical_cols]

# Normalize the data
scaler = StandardScaler()
ref_scaled = scaler.fit_transform(ref_features)
prod_scaled = scaler.transform(prod_features)

# Split reference data into train and validation
np.random.shuffle(ref_scaled)
train_size = int(0.8 * len(ref_scaled))
train_data = ref_scaled[:train_size]
val_data = ref_scaled[train_size:]

# Build autoencoder
input_dim = ref_features.shape[1]
encoding_dim = 3 
# Input layer
input_layer = Input(shape=(input_dim, ))
# Encoder
encoded = Dense(8, activation="relu")(input_layer)
encoded = Dense(encoding_dim, activation="relu")(encoded)
# Decoder
decoded = Dense(8, activation="relu")(encoded)
decoded = Dense(input_dim, activation="linear")(decoded)
# Autoencoder
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer="adam", loss="mse")

# Train autoencoder
history = autoencoder.fit(
    train_data, train_data,
    epochs=50,
    batch_size=64,
    shuffle=True,
    validation_data=(val_data, val_data),
    verbose=0
)

# Calculate reconstruction error
ref_pred = autoencoder.predict(ref_scaled, verbose=0)
prod_pred = autoencoder.predict(prod_scaled, verbose=0)

ref_mse = np.mean(np.power(ref_scaled - ref_pred, 2), axis=1)
prod_mse = np.mean(np.power(prod_scaled - prod_pred, 2), axis=1)

The charts below show the distribution of reconstruction loss between both datasets.

Distribution of reconstruction loss between actuals and predictions (image by author)

The production dataset has a higher mean reconstruction error than that of the reference dataset, indicating a shift in the overall data. This aligns with the changes in the production dataset with a higher number of newer accounts with high-value transactions.

Summarizing

Model monitoring is an essential, yet often overlooked, responsibility for data scientists and machine learning engineers.

All the statistical methods led to the same conclusion, which aligns with the observed shifts in the data: they detected a trend in production towards newer accounts making higher-value transactions. This shift resulted in higher model scores, signaling an increase in potential fraud.

In this post, I covered techniques for detecting drift on three different levels:

Model score drift: Using Population Stability Index (PSI)
Individual feature drift: Using Kolmogorov-Smirnov test for numeric features and Chi-Square test for categorical features
Multivariate drift: Using Spearman correlation for pairwise interactions and autoencoders for high-dimensional, multivariate shifts.

These are just a few of the techniques I rely on for comprehensive monitoring — there are plenty of other equally valid statistical methods that can also detect drift effectively.

Detected shifts often point to underlying issues that warrant further investigation. The root cause could be as serious as a data collection bug, or as minor as a time change like daylight savings time adjustments.

There are also fantastic python packages, like evidently.ai, that automate many of these comparisons. However, I believe there’s significant value in deeply understanding the statistical techniques behind drift detection, rather than relying solely on these tools.

What’s the model monitoring process like at places you’ve worked?

Want to build your AI skills?

I run the AI Weekender and write weekly blog posts on data science, AI weekend projects, career advice for professionals in data.

Resources

Jupyter Notebook: https://colab.research.google.com/drive/1qQFKjg3wLWmj2z4w6_U_xsqRREaB2sBP?authuser=3#scrollTo=EdpoxjNY_CUX

The post How to Spot and Prevent Model Drift Before it Impacts Your Business appeared first on Towards Data Science.

Tidying Up the Framework of Dataset Shifts: The Example

Valeria Fonseca Diaz — Fri, 01 Sep 2023 22:25:45 +0000

How the conditional probability changes as a function of the three probability elements

Image by author

I recently talked about the causes of model performance degradation, meaning when their prediction quality drops with respect to the moment we trained and deployed our models. In this other post, I proposed a new way of thinking about the causes of model degradation. In that framework, the so-called conditional probability comes out as the global cause.

The conditional probability is, by definition, composed of three probabilities which I call the specific causes. The most important learning of this restructure of concepts is that covariate shift and conditional shift are not two separate or parallel concepts. Conditional shift can happen as a function of covariate shift.

With this restructuring, I believe it becomes easier to think about the causes and it becomes more logical to interpret the shifts that we observe in our applications.

This is the scheme of causes and model performance for Machine Learning models:

Image by author. Adapted from https://towardsdatascience.com/tidying-up-the-framework-of-dataset-shifts-cd9f922637b7

In this scheme, we see the clear path that connects the causes to the prediction performance of our estimated models. One fundamental assumption we need to make in Statistical Learning is that our models are "good" estimators of the real models (real decision boundaries, real regression functions, etc.). "Good" can have different meanings, such as unbiased estimators, precise estimators, complete estimators, sufficient estimators, etc. But, for the sake of simplicity and the upcoming discussion, let’s say that they are good in the sense that they have a small prediction error. In other words, we assume that they are representative of the real models.

With this assumption, we are able to look for the causes of model degradation of the estimated model in the probabilities P(X), P(Y), P(X|Y), and consequently, P(Y|X).

So, what we will do today is to exemplify and walk through different scenarios to see how P(Y|X) changes as a function of the 3 probabilities P(X|Y), P(X), and P(Y). We will do so by using a population of a few points in a 2D space and calculating the probabilities from these sample points in the way Laplace would do. The purpose is to digest the hierarchy scheme of causes of model degradation, keeping P(Y|X) as the global cause, and the other three as the specific causes. In that way, we can understand, for example, how a potential covariate shift can be sometimes the argument of the conditional shift rather than being a separate shift of its own.

The example

The case we will draw for our lesson today is a very simple one. We have a space of two covariates X1 and X2 and the output Y is a binary variable. This is what our model space looks like:

Image by author

You see there that the space is organized in 4 quadrants and the decision boundary in this space is the cross. This means that the model classifies samples in class 1 if they lie in the 1st and 3rd quadrants, and in class 0 otherwise. For the sake of this exercise, we will walk through the different cases comparing P(Y=1|X1>a). This will be our conditional probability to showcase. If you are wondering why not taking also X2, it’s only for the simplicity of the exercise. It doesn’t affect the insight we want to understand.

If you’re still with a bittersweet feeling, taking P(Y=1|X1>a) is equivalent to P(Y=1|X1>a, -inf , so theoretically, we are still taking X2 into account.

Image by author

Reference model

So to start with, we calculate our showcase probability and we obtain 1/2. Pretty much here our group of samples is quite uniform throughout the space and the prior probabilities are also uniform:

Image by author

Shifts are coming up

One extra sample appears in the bottom right quadrant. So the first thing we ask is: Are we talking about a covariate shift?

Well, yes, because there is more sampling in X1>a than there was before. So, is this only a covariate shift but not a conditional shift? Let’s see. Here is the calculation of all the same probabilities as before with the updated set of points (The probabilities that changed are in orange):

Image by author

What did we see here? In fact, not only did we get a covariate shift, but overall, all the probabilities changed. The prior probability also changed because the covariate shift brought a new point of class 1 making the incidence of this class bigger than class 2. Then also, the inverse probability P(X1>a|Y=1) changed precisely because of the prior shift. All of that overall led to a conditional shift so we now got P(Y=1|X1>a)=2/3 instead of 1/2.

Here’s a thought bubble. A very important one actually.

With this shift in the sampling distribution, we obtained shifts in all the probabilities that play a role in the whole scheme of our models. Yet, the decision boundary that existed based on the initial sampling remained valid for this shift.

What does this mean?

Even though we obtained a conditional shift, the decision boundary did not necessarily degrade. Because the decision boundary comes from the expected value, if we calculate this value based on the current shift, the boundary may remain the same but with a different conditional probability.

2. Samples at the first quadrant don’t exist anymore.

So, for X1>a things remained unchanged. Let’s see what happens to the conditional probability we’re showcasing and its elements.

Image by author

Intuitively, because within X1>a things remain unchanged, the conditional probability remained the same. Yet, when we look at P(X1>a) we obtain 2/3 instead of 1/2 compared to the training sampling. So here we have a covariate shift without a conditional shift.

From a math perspective, how can the covariate probability change without the conditional probability changing? This is because P(Y=1) and P(X1>a|Y=1) changed accordingly to the covariate probability. Therefore the compensation makes up for an unchanged conditional probability.

With these changes, just as before, the decision boundary remained valid.

3. Throwing in some samples in different quadrants while the decision boundary remained valid.

We have here 2 extra combinations. In one case, the prior remained the same while the other two probabilities changed, still not changing the conditional probability. In the second case, only the inverse probability was associated with a conditional shift. Check the shifts here below. The latter is a pretty important one, so don’t miss it!

Image by author

With this, we have now a pretty solid perspective on how the conditional probability can change as a function of the other three probabilities. But most importantly, we also know that not all conditional shifts invalidate the existing decision boundary. So what’s the deal with it?

Concept drift

In the previous post, I also proposed a more specific way of defining concept drift (or concept shift). The proposal is:

We refer to a change in the concept when the decision boundary or regression function becomes invalid when the probabilities at play are shifting.

So, the crucial point about this is that if the decision boundary becomes invalid, surely there is a conditional shift. The reverse, as we discussed in the previous post and as we saw in the examples above, is not necessarily true.

This might not be so fantastic from a practical perspective because it means that to truly know if there’s a concept drift, we might be forced to re-estimate the boundary or function. But at least, for our theoretical understanding, this is just as fascinating.

Here’s an example in which we have a concept drift, naturally with a conditional shift, but actually without a covariate or a prior shift.

Image by author

How cool is this separation of components? The only element that changed here was the inverse probability, but, contrary to the previous shift we studied above, this change in the inverse probability was linked to the change in the decision boundary. Now, a valid decision boundary is only the separation according to X1>a discarding the boundary dictated by X2.

What have we learned?

We have walked very slowly through the decomposition of the causes of model degradation. We studied different shifts of the probability elements and how they relate to the degradation of the prediction performance of our machine learning models. The most important insights are:

A conditional shift is a global cause of prediction degradation in machine learning models

The specific causes are covariate shift, prior shift, and inverse probability shift

We can have many different cases of probability shifts while the decision boundary remains valid

A change in the decision boundary causes a conditional shift, but the reverse is not necessarily true!

Concept drift may be more specifically associated with the decision boundary rather than with the overall conditional probability distribution

What follows from this? Reorganizing our practical solutions in light of this hierarchy of definitions is the biggest invitation I make. We might find so many wanted answers to our current questions regarding the way in which we can monitor our models.

If you are currently working on model performance monitoring using these definitions, don’t hesitate to share your thoughts on this framework.

Happy thinking to everyone!
The post Tidying Up the Framework of Dataset Shifts: The Example appeared first on Towards Data Science.

Tidying up the framework of dataset shifts

Valeria Fonseca Diaz — Tue, 18 Jul 2023 13:51:03 +0000

In collaboration with Marco Dalla Vecchia as the image creator

We train models and use them to predict certain outcomes given a set of inputs. We all know that’s the game of ML. We know quite a lot about training them, so much so that now they have evolved into AI, the biggest level of intelligence that has ever existed. But when it comes to using them, we are not that far ahead, and we continue exploring and understanding every aspect that matters after models go into deployment.

So today, we will discuss the issue of model performance drift (or just Model Drift), also frequently known as model failure or model degradation. What we refer to is the issue of the quality of predictions that our ML model delivers. Be it a class or a number, we care about the gap between that prediction and what the real class or value would be. We talk about model performance drift when the quality of those predictions goes down with respect to the moment when we deployed the model. You may have found other terminology for this topic in the literature, but stay with me on model performance drift or simply model drift, at least for the purpose of our current conversation.

What we know

Several blogs, books, and many papers have explored and explained the core concepts of model drift so we’ll enter into this current picture first. We have organized the ideas mainly into the concepts of covariate shift, prior shift, and conditional shift. The latter is also known commonly known as concept drift. These shifts are known to be the main causes of model drift (remember, a drop in the quality of predictions). The summarized definitions go as follows:

Covariate shift: Changes in the distribution of P(X) without necessarily having changes in P(Y|X). This means that the distribution of the input features changes, and some of those shifts may cause the model to drift.

Prior shift: Changes in the distribution of P(Y). Here, the distribution of the labels or the numerical output variable changes. Most likely, if the probability distribution of the output variable shifts, the current model will have a large uncertainty on the given prediction so it may easily drift.

Conditional shift (aka concept drift): The conditional distribution P(Y|X) changes. This means that, for a given input, the probability of the output variable has changed. As far as we know until now, this shift usually leaves us with very little room to keep up the quality of predictions. Is it so really?

Many sources exist listing examples of the occurrence of these dataset shifts. One of the core opportunities for research is detecting these types of shifts without the need for new labels [1, 2, 3]. Interesting metrics have been recently released to monitor the prediction performance of the model in an unsupervised way [2, 3]. They are indeed motivated by the different concepts of dataset shifts and they reflect quite accurately the changes in the real probability distributions of the data. So we are going to dive into the theory of these shifts. Why? Because perhaps there’s some order we can put about these definitions. By tidying up, we might be able to move forward more easily or simply understand this entire framework more clearly.

To do that, let’s go back to the beginning and make a slow derivation of the story. Grab a coffee, read slowly, and stay with me. Or just, don’t drift!

The real and the estimated model

The ML models we train attempt to get us close to a real, yet unknown, relationship or function that maps a certain input X to an output Y. We naturally distinguish the real unknown relationship from the estimated one. Yet, the estimated model is bound to the behavior of the real unknown model. That is, if the real model changes and the estimated model is not robust against these changes, the estimated model’s predictions will be less accurate.

The performance we can monitor deals with the estimated function but the causes of model drift are found in the changes of the real model.

What is the real model? The real model is founded in the so-called conditional distribution P(Y|X). This is the probability distribution of an output given an input.

What is the estimated model? This one is a function ê(x) which specifically estimates the expected value of P(Y|X=x). This function is the one connected to our ML model.

Here’s a visual representation of these elements:

(Image by author)

Good, so now that we clarified these two elements, we’re ready to organize the ideas behind the so-called dataset shifts and how the concepts connect to each other.

The new arrangement

The global cause of model drift

Our main objective is to understand the causes of model drift for our estimated model. Because we already understood the connection between the estimated model and the conditional probability distribution, we can state here what we already knew before: The global cause for our estimated model to drift is the change in P(Y|X).

Basic and apparently easy, but more fundamental than we think. We assume our estimated model to be a good reflection of the real model. The real model is governed by P(Y|X). So, if P(Y|X) changes, our estimated model will likely drift. We need to mind the path we are following in that reasoning which we showed in the figure above.

We knew this already before, so what’s new about it? The new thing is that we now baptize the changes in P(Y|X) here as the global cause, not just a cause. This will impose a hierarchy with respect to the other causes. This hierarchy will help us nicely position the concepts about the other causes.

The specific causes: Elements of the global cause

Knowing that the global cause lies in the changes in P(Y|X), it becomes natural to dig into what elements constitute this latter probability. Once we have identified those elements, we will continue talking about the causes of model drift. So what are those elements?

We have known it always. The conditional probability is theoretically defined as P(Y|X) = P(Y, X) / P(X), that is, the joint probability divided by the marginal probability of X. But we can open up the joint probability once more and we obtain the magical formula we’ve known from centuries ago:

(Image by author)

Do you already see where we’re going? The conditional probability is something that is fully defined by three elements:

P(X|Y): The inverse conditional probability

P(Y): The prior probability

P(X): The covariates’ marginal probability

Because these are the three elements that define the conditional probability P(Y|X), we are ready to give a second statement: If P(Y|X) changes, those changes come from at least one of the three elements that define it. Put differently, the changes in P(Y|X) are defined by any change in P(X|Y), P(Y), or P(X).

That said, we have positioned the other elements from our current knowledge as specific causes of model drift rather than just parallel causes to P(Y|X).

Going back to the beginning of this post, we listed covariate shift and prior shift. We note, then, that there’s yet another specific cause: the changes in the inverse conditional distribution P(X|Y). We usually find some mention of this distribution when talking about the changes in P(Y) as if in general we were considering the inverse relationship from Y to X [1,4].

The new hierarchy of concepts

(Image by author)

We can have now a clear comparison between the current thinking about these concepts and the proposed hierarchy. Until now, we have been talking about the causes of model drift by identifying different probability distributions. The three main distributions, P(X), P(Y), and P(Y|X) are known to be the main causes of drift in the quality of predictions returned by our ML model.

The twist I propose here imposes a hierarchy on the concepts. In it, the global cause of drift of a model that estimates the relationship X -> Y is the changes in the conditional probability P(Y|X). Those changes in P(Y|X) can come from changes in P(X), P(Y), or P(X|Y).

Let’s list some of the implications of this hierarchy:

We could have cases where P(X) changes, but if P(Y) and P(X|Y) also change accordingly, then P(Y|X) remains the same.

We can also have cases where P(X) changes, but if P(Y) or P(X|Y) doesn’t change accordingly, P(Y|X) will change. If you have given some thought to this topic before, you have probably seen that in some cases we could see X changing and those changes do not seem entirely independent of Y|X, so in the end, Y|X also changes. Here, P(X) is the specific cause of the changes in P(Y|X), which in turn is the global cause of our model drifting.

The previous two statements are true also for P(Y).

Because the three specific causes may or may not change independently, overall, the changes in P(Y|X) can be explained by the changes in these specific elements altogether. It can be because P(X) moved a bit here, and P(Y) moved a bit over there, then those two also make P(X|Y) change, which in the end altogether causes P(Y|X) to change.

P(X) and P(Y|X) are not to be thought of independently, P(X) is a cause of P(Y|X)

Where is the estimated ML model in all this?

Ok, now we know that the so-called covariate and prior shifts are causes of conditional shift rather than parallel to it. Conditional shifts encompass the set of specific causes for prediction performance degradation of the estimated model. But the estimated model is rather a decision boundary or function, not really a direct estimation of the probabilities at play. So what do the causes mean for the real and estimated decision boundaries?

Let’s gather all the pieces and draw the complete path connecting all the elements:

(Image by author)

Note that our ML model can come about analytically or numerically. Moreover, it can come as a parametric or non-parametric representation. So, in the end, our ML models are an estimation of the decision boundary or regression function which we can derive from the expected conditional value.

This fact has an important implication for the causes we have been discussing. While most of the changes happening in P(X), P(Y), and P(X|Y) will imply changes in P(Y|X) and so in E(Y|X), not all of them necessarily imply a change in the real decision boundary or function. In that case, the estimated decision boundary or function will remain valid if this one has originally been an accurate estimate. Look at this example below:

(Image by author)

See that P(Y) and P(X) changed. The density and location of the points account for a different probability distribution

These changes make P(Y|X) change

However, the decision boundary remained valid

Here’s one important bit. Imagine we are looking at the changes in P(X) only without information about the real labels. We would like to know how good the predictions are. If P(X) shifts towards areas where the estimated decision boundary has a large uncertainty, the predictions are likely inaccurate. So in the case of a covariate shift towards uncertain regions of the decision boundary, most likely a conditional shift is also happening. But we would not know if the decision boundary is changing or not. In that case, we can quantify a change occurring at P(X), which can indicate a change in P(Y|X), but we would not know what is happening to the decision boundary or regression function. Here’s a representation of this problem:

So now that we have said all this, it’s time for yet one more statement. We talk about conditional shift when we refer to the changes in P(Y|X). It’s possible that what we have been calling concept drift refers specifically to the changes in the real decision boundary or regression function. See here below a typical example of a conditional shift with a change in the decision boundary but without a covariate or prior shift. In fact, the change came from the change in the inverse conditional probability P(X|Y).

(Image by author)

Implications for our current monitoring methods

We care about understanding these causes so we can develop methods to monitor the performance of our ML models as accurately as possible. None of the proposed ideas is bad news for the available practical solutions. Quite the opposite, with this new hierarchy of concepts, we might be able to push further our attempts to detect the causes of model performance degradation. We have methods and metrics that have been proposed to monitor the prediction performance of our models, mainly proposed in light of the different concepts we have listed here. However, it is possible that we have mixed the concepts in the assumptions of metrics [2]. For example, we might have been referring to an assumption as "no conditional shift", when in reality it may be specifically "no change in the decision boundary" or "no change in the regression function". We need to keep thinking about this.

More about prediction performance degradation

Zooming in and zooming out. We have dived into the framework to think about the causes of prediction performance degradation. But we have another dimension to discuss this topic which comes about the types of prediction performance shifts. Our models suffer because of the listed causes, and those causes are reflected as different shapes of prediction misalignment. We find mainly four types: Bias, Slope, Variance, and Non-linear shifts. Check out this post to find out more about this other side of the coin.

Summary

We studied in this post the causes of model performance degradation and proposed a framework based on the theoretical connections of the concepts we already knew before. Here are the main points:

The probability P(Y|X) governs the real decision boundary or function.

The estimated decision boundary or function is assumed to be the best approximation to the real one.

The estimated decision boundary or function is the ML model.

The ML model can experience prediction performance degradation.

That degradation is caused by changes in P(Y|X).

P(Y|X) changes because there are changes in at least one of these elements: P(X), P(Y), or P(X|Y).

There can be changes in P(X) and P(Y) without having changes in the decision boundary or regression function.

The general statement is: if the ML model is drifting, then P(Y|X) is changing. The reverse is not necessarily true.

This framework of concepts is hopefully nothing but a seed of the crucial topic of ML prediction performance degradation. While the theoretical discussion is simply a delight, I trust that this connection will help us push further the aim of measuring these changes in practice while optimizing for the required resources (samples and labels). Please join the discussion if you have other contributions to your knowledge.

What’s causing your model to drift in prediction performance?

Have a happy thinking!

References

[1] https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html

[2] https://www.sciencedirect.com/science/article/pii/S016974392300134X

[3]https://nannyml.readthedocs.io/en/stable/how_it_works/performance_estimation.html#performance-estimation-deep-dive

[4] https://medium.com/towards-data-science/understanding-dataset-shift-f2a5a262a766
The post Tidying up the framework of dataset shifts appeared first on Towards Data Science.

Prediction Performance Drift: The Other Side of the Coin

Valeria Fonseca Diaz — Thu, 02 Feb 2023 15:01:28 +0000

The world of Machine Learning has moved and grown so fast that in less than two decades we are already at the next stage. The models are built, and now we need to know if they provide accurate predictions in the short, medium, and long term. So many methods, theoretical approaches, schools of thought, paradigms, and digital tools are in our pockets when it comes to building our models. Now then, we want to understand better what’s at stake when it comes to prediction performance in time.

One may think that the prediction performance of the model is ensured by the dictated quality based on a test set, a separate set of samples that were not involved in the task during the training of the model at all, yet measured or observed at that moment. The reality is that, while the model may have been delivered with a certain (hopefully very satisfactory) prediction performance based on that test set, the observations that will come in time will not fully share the same properties or trends as the training and test sets. Recall that the models are machines, not thinking entities. If they see something new for which they were not trained, they cannot be critical of these new things. We, humans, can.

Prediction performance drift: Side A

The way in which we have been thinking about prediction performance involves separating the variability distributions that influence a model. There’s the distribution of the input features P(X), the distribution of the labels P(Y), and the conditional distribution of the labels given the inputs P(Y|X). **** Any changes in any of these components are potential _cause_s for the model to make more inaccurate predictions with respect to its original performance. With this in mind, if we monitor these components and something in them changes at some point, well, that’s a moment for a deep check-up to see what’s happening and fix it if necessary.

Prediction performance drift: Side B

While the previous is unquestionable, we can also look at another dimension when it comes to prediction performance drift: The types of drift. When we look at the prediction performance, it’s possible to discriminate between different types of drift. Now, why would this be of any use? Well, not only we’ll detect that something is changing, but we’ll more easily know how to fix it.

There can essentially be 4 types of drift: Bias, Slope, Variance, and Non-linearities. While these names might hint more particularly at a linear regression model, they apply to the general menu of machine learning models. More generalized names for these types could be Constant shift, Rotation with respect to prediction boundary, Dispersion collapsing with boundary, and Change of boundary shape. We may agree that the latter names are long and time-consuming to write and talk about, so let’s stick to the former names. And while doing so, let’s not confuse any of the types with the nature of the model. Any of the 4 types of drifts can occur in any type of model.

Types of drift for regression models (Image by author)

By checking for the different types of drift, not only we’ll detect that something is changing, but we’ll more easily know how to fix it.

Side B: The 4 types

A bias drift refers to a constant shift. In a regression model, a bias drift is happening if we observe that our predictions are a constant away from the observed values. In a classification model, a bias drift will happen if the features for each class have a constant shift, which may also be class-specific. This type of drift may simply happen because of a change in the average of the population without any other changes in the relationships among the features and their impact on the target variable. Being a simple change, it may be easily repairable by re-shifting the bias of the model or the average in the sample.

A slope drift happens when there’s a rotation with respect to the center of the regression function or decision boundary. This type of drift can be very common in learning tasks involving images that are taken from different angles while being essentially the same image. Depending on the degree of rotation and whether the rotation is class-specific for classification tasks, a slope drift may be fixed by a simple adjustment of the rotation or by completely retraining the model.

Example of types of drift for a linear classifier (Image by author)

A variance drift, as its name specifies, it’s an increase in dispersion. It will become damaging to the prediction performance if the dispersion pushes the samples to cross the task boundaries. In the case of regression, a variance drift manifests by larger residuals. In the case of classification, this drift may be an expansion of the original dispersion of the features in each class. Here again, this drift may be class-specific. This drift may occur due to different causes, it may correspond to a bigger sampling close to the decision boundaries or an uneven shift of the input features. If due to sampling, this problem might be solved by a recalculation of the model with updated data which is more representative of the current sampling.

Example of types of drift for a non-linear classifier (Image by author)

The last type of drift is called non-linear drift. While looking very specific and systematic from the shown examples, this type of drift may contain all types of combinations of the aforementioned three types. The famous case when teaching artificial neural networks involves 2 input features to separate 2 classes. In a 2D space, the 2 classes may be separated by a linear function, but the simple rotation of 2 points induces and non-linear change. If the classes can no longer be separated by one line, well, the problem has become non-linear. Just as in the other cases, for classification tasks, this drift may also be class-specific as shown in the linear classification example above. As this is the most complex type of drift, the non-linear patterns that may be observed in the prediction quality of the model may always force the retraining of the model.

Example from linear to non-linear classification (Image by author)

When is the prediction performance degrading?

Drifts are not always problematic. Let’s dive a bit into this. A change in the center of X without a change in the center of Y, or vice-versa, will lead to a bias drift. However, if both, X and Y are moving accordingly with the model function, there won’t be a prediction performance drift as the model will continue making accurate predictions. When it comes to the slope, while for regression there’s less room for rotation, in classification tasks the classes may rotate without damaging the performance when this rotation does not collapse with the decision boundary. This may be a very theoretical fact with a little incidence in practice, but hey, that’s how it goes.

Variance drifts may be more realistic for classification problems than slope drifts. The model can handle a variance drift as long as the dispersion is not collapsing with the decision boundary. When the points start crossing decision boundaries, the prediction performance is compromised. The same is true for regression models, although, just as in the case of slope drift, there’s very little tolerance for dispersion expansion without accounting for prediction degradation.

Non-linear drifts are the least permissive. These changes might always be problematic. Models can be robust to a lot of changes corresponding to the previous drifts, but they are designed to handle a specific task with a specific shape. That shape is the representation of the concept that we created the model to predict, and if that shape starts changing, the whole concept might be changing. So here it’s hard to think of cases in which a non-linear change is not problematic.

Let’s keep flipping the coin

So here we have a new view to think about the degradation in the prediction performance of our models. Having us equipped with not only the causes but now with the types, we can keep flipping the coin when Monitoring the quality of prediction of our models. That way, we not only get information about what and when something is changing, but also why it’s changing as well as how it is changing.

Are we going to end up formulating the "who" and "where"?
The post Prediction Performance Drift: The Other Side of the Coin appeared first on Towards Data Science.

3 Tested Techniques to Recover Your Failing Models

Thuwarakesh Murallie — Tue, 13 Sep 2022 06:29:12 +0000

There’s only one constant in the universe – CHANGE!

Everything you see and sense was a version of itself a moment ago – the data you used to fit your models are no exception.

A perfect model over time fails to predict as it used to. It’s not a bug; that’s the way ML models work. It’s the ML engineer’s job to expect it.

An ML model in production is very much the same as whitewater rafting. Worse yet – it is a river no one had traveled before. You must have a sense of the current to expect what’s ahead. Ignoring the dangers may lead your raft to a cliff.

Translating it to MLOps, you must actively track the underlying data changes before your model starts falling off a cliff – before they make inaccurate predictions.

This phenomenon is popularly known as the "Concept drift."

Since not all the changes are unique, handling them requires different strategies. For instance, social behavior, such as divorce rates, shows sluggish changes. On the other hand, stock markets often get shocks creating and destroying millionaires overnight.

And there are also cyclical changes. The change you experience now will likely reverse in the coming years.

Your domain may show one of these types of change. And you should be prepared before they occur.

Why Machine Learning Models Die In Silence?

Here are three techniques ML engineers often use to fight concept drift.

Have a retraining schedule for your models.

Retraining is a straightforward concept to understand. In every cycle, you get a new set of data. You can use them for training the model once again from scratch.

Given the simplicity, you may be tempted to use it often. But, retraining is a costly operation. You benefit from retraining only when your models are cheap and fast to train.

You could take one of three approaches to optimize the cost and training time. You can either retrain the model with…

your new data only;

all your past data, or;

the full dataset with more weight on the latest data points.

Retraining models with recent data only.

Though abandoning old data and retraining entirely on new data may sound counterintuitive, they are helpful in many situations, especially if you’re sure that old data are obsolete and only create noise in a model.

Short-term financial market predictions are a good example. Because there are numerous bot trading accounts, if there’s a pattern, soon all bots learn them. As a result, the pattern quickly becomes obsolete. Your model may have to forget the outdated patterns and find new ones in your new dataset.

This way of retraining is less costly as it involves little data. But the model loses the context from its prior learnings.

Go for it only when your old data is undoubtedly irrelevant.

You should be aware of model overfitting when using a smaller dataset. While only the most recent dataset may seem a cheap and fast way to retrain, if your model has a lot of parameters, it may overlearn the data points.

Retrain on all your past dataset

You can’t discard what you already have when you need to persist in the knowledge. Therefore, you may have to retrain the model with all your past data.

Weather forecasts, for instance, go through changes for sure. We talk about climate change more seriously for this reason. But the transition is gradual. Your model can benefit from past data while learning the changing patterns from new data.

The downside of this training approach is the cost. It’s no surprise with more data points, you incur more costs – both for storage and training.

But it may be inevitable if your model has numerous parameters. To avoid overfitting, your dataset has to be sufficiently large.

Retrain with more weight on recent data points

You can use this approach when your dataset has to be large enough, yet you want to account for the aging dataset.

You could use weights when you need to retrain the model on your entire dataset yet want to give more importance to the latest. You can assign more weight for a newer data point and lower the weight for older ones.

But for this to work, your algorithm should support data point selection. If not, you could sample the dataset with more weight towards the latest.

Regularly update your old model with new data.

This may sound similar to the previous strategy. But there’s a subtle difference. Here we don’t retrain the model from scratch as we did before. We update the already trained model with our new dataset.

Your algorithm should support initial weights and batch training. If you’re using a deep neural network, it’s a good fit.

Updating models may be more cost-efficient than retraining from scratch. But it also depends on the size of your model and the input data size. This may not be an excellent option if each update has plenty of data points and your model has millions of parameters.

On the flip side, this method is suitable if you must persist in past knowledge and adapt to new changes.

Expanding your models with new components.

Based on the two previous strategies, it may be beneficial to retrain only the variable component of a model.

More advanced techniques such as transfer learning and model ensemble make this possible.

In transfer learning, you take a pre-trained neural network for retraining. Instead of retraining the entire model, you unpack only the network’s last layer. This is often called unfreezing. Now you train the previous layer with your new data. The model does not forget everything it learned before. Yet most recent data play a critical role in the final decision.

Transfer Learning: The Highest Leverage Deep Learning Skill You Can Learn.

You could also use a new model to refine your prediction. We call such groups an ensemble of models. While your initial model stays untouched, there will be a new model between. The new model learns from new data and alters the behavior of the original model.

Sometimes, doing nothing is okay!

You’ve read it correctly.

Doing nothing means you train a model and use it for a prolonged period. It is also known as static models.

Static models are attractive for a couple of reasons.

You need some yardstick to measure anything. A static model allows us to identify and drift as they happen. For data drift, this static model is your yardstick.

Also, as you make changes to your existing model, you need a benchmark to evaluate. Static models are a perfect target as well.

Yet, you must be careful with static models. If you’re using one in production, you should be sure that the data and model drifts don’t affect the model too much.

Final thoughts

If you think the job of a data scientist or ML engineer is over when the model is deployed, you’re making a risky bet.

It’s just the beginning.

Models in production deteriorate in performance. It’s known as concept drift, a widely researched challenge in Machine Learning.

Depending on the situation, you can use various techniques to ensure the continuous best performance of your models.

In this post, we’ve discussed five strategies often used by ML engineers to fight concept drift.

You can retrain, update or extend the model! Each has its benefits and drawbacks.

What would you do when your models fail to predict accurately?

Thanks for reading, friend! Say Hi to me on LinkedIn, Twitter, and Medium.

Not a Medium member yet? Please use this link to become a member because, at no extra cost for you, I earn a small commission referring.

The post 3 Tested Techniques to Recover Your Failing Models appeared first on Towards Data Science.

Getting a Grip on Data and Model Drift with Azure Machine Learning

Andreas Kopp — Tue, 07 Jun 2022 10:50:08 +0000

By Natasha Savic and Andreas Kopp

Change is the only constant in life. In machine learning, it shows up as drift of data, model predictions, and decaying performance, if not managed carefully.

Photo by serjan midili on Unsplash

In this article, we discuss data and model drift and how it affects the performance of production models. You will learn methods to identify and to mitigate drift and MLOps best practices to transition from static models to evergreen AI services using Azure Machine Learning.

We also include a sample notebook if you want to try out the concepts in practical examples.

Understanding data and model drift

Many machine learning projects conclude after a phase of extensive data and feature engineering, modeling, training, and evaluation with a satisfactory model that is deployed to production. However, the longer a model is in operation, the more problems can creep in that might remain undetected for quite a long time.

Data drift and performance degradation due to model drift

Data Drift means that distributions of input data change over time. Drift can lead to a gap between what the model has initially learned from the training data and the inferencing observations during production. Let’s look at a few examples of data drift:

Real-world changes: an originally small demographic group increasingly appears in the labor market (e.g., war refugees); new regulatory frameworks come into play influencing user consent (e.g., GDPR)

Data acquisition problems: incorrect measurements due to a broken IoT sensor; an initially mandatory input field of a web form becomes optional for privacy reasons

Data engineering problems: unintended coding or scaling changes or swap of variables

Model drift is accompanied by a decrease in model performance over time (e.g., accuracy drop in a supervised classification use case). There are two main sources of model drift:

Real-world changes are also referred to as concept drift: The relationship between features and target variables has changed in the real world. Examples: the collapse of travel activities during a pandemic and; rise of inflation impacts buying behavior.

Data drift: The described drift of input data might also affect model quality. However, not every occurrence of data drift is necessarily a problem. When drift occurs on less important features the model might respond robustly, and performance is not affected. Let us assume that a demographic cohort (a specific combination of age, gender, and income) occurs more often during inferencing than seen during training. It won’t cause headaches if the model still predicts the outcomes for this cohort correctly. It is more problematic if the drift leads the model into less populated and/or more error-prone areas of the feature space.

Model drift typically stays undetected until new ground truth labels are available. The original test data is no longer a reliable benchmark because the real-world function has changed.

The following illustration summarizes the various kinds of drift:

Types of drift. Adopted from Data and concept drifts in machine learning | Towards Data Science

The transition from normal behavior to drift can be vastly different. Demographic changes in the real world typically lead to gradual data or model drift. However, a broken sensor might cause abrupt deviations from the normal range. Seasonal fluctuations in buying behavior (e.g., Christmas season) are manifested as a recurring drift.

If we have timestamps for our observations (or the data points are at least arranged chronologically), the following can be done to detect, analyze and mitigate drift.

We will describe these methods in more detail below and experiment with them using a predictive maintenance case study.

From static to evergreen models

From static to evergreen models

The options to analyze and mitigate data and model drift depend on the availability of current data over the Machine Learning model’s lifecycle.

Let us assume that a bank collected historical data to train a model to support credit lending decisions. The goal is to predict whether a loan application should be approved or rejected. Labeled training data was collected in the period from January to December 2020.

The bank’s data scientists have spent the first quarter of 2021 training and evaluating the model and decided to bring it to production in April 2021. Let us look at three options the team can use for collecting production data:

Good drift management depends on data availability

Scenario 1: Static model

Here, the team doesn’t collect any production data. Perhaps they did not consider this at all since their project scope only covered delivering the initial model. Another reason could be open data privacy questions (storing regulated personal data).

Obviously, there is not much that can be done to detect data or Model Drift beyond analyzing the historical training data. Drift may only be uncovered when model users start complaining about the model predictions are becoming increasingly unsuitable for business decisions. However, since feedback is not systematically collected, gradual drift will likely remain undiscovered for a long time.

Interestingly, many productive machine learning models are run in this mode today. However, machine learning lifecycle management procedures like MLOps are getting more traction in practice to address issues like these.

The static model approach might be acceptable if the model is trained on representative data and the feature/target relationship is stable over time (e.g., biological phenomena which change at an evolutionary pace).

Scenario 2: Collecting production data

The team decides to collect observed input data (features) from the production phase together with the corresponding model predictions.

This approach is straightforward to implement if there are no data protection concerns or other organizational hurdles. By comparing the recent production data with original training observations, drift in features and predicted labels can be found. Significant shifts in key features (in terms of feature importance) can be used as a trigger for further investigation.

However, essential information is missing to find out if there is a problem with the model: we do not have new ground truth labels to evaluate the production predictions. This might lead to the following situations:

Virtual drift (false positive): We observe data drift, but the model still works as desired. This may get the team to acquire new labeled data for retraining although it is unnecessary (from a model drift perspective).

Concept drift (false negative): While there is no drift in the input data, the real-world function has moved away from what the model had learned. Hence, an increasingly outdated model leads to inaccurate business decisions.

Scenario 3: Evergreen model

In this scenario, the bank not only analyzes production input and predictions for potential drift but also collects labeled data. Depending on the business context, this can be done in one of the following ways:

Business units contribute newly labeled data points (as was done for the initial training)

Human-in-the-loop feedback: The model predictions from the production phase are systematically reviewed. Especially false approvals and false rejections, found by domain experts, and the corresponding features with the corrected labels are collected for retraining.

Incorporating human-in-the-loop feedback requires adjustment of processes and systems (e.g., business users can overwrite or flag incorrect predictions in their applications).

The main advantages are that concept drift can be identified with high reliability and the model can regularly be refreshed by retraining.

Incorporating business feedback and regular retraining is an essential part of mature MLOps practices (see our reference architecture example for Azure Machine Learning below).

Data and model drift management in practice

It is essential to have a detection mechanism that measures drift systematically. Ideally, such a mechanism is part of an integrated MLOps workflow that compares training and inference distributions on a continuous basis. We have compiled several mechanisms that support data and model drift management.

We are using a predictive maintenance use case based on a synthetic dataset in our sample notebook. The goal is to predict equipment failure based on features like speed or heat deviations, operator, assembly line, days since the last service, etc.

To identify drift, we combine statistical techniques and distribution overlaps (data drift) as well as predictive techniques (model/concept drift). For both drift types, we will briefly introduce the method used.

Drift detection starts by partitioning a dataset of chronologically sorted observations into a reference and current window. The reference (or baseline) window represents older observations and is often identical to the initial training data. The current window typically reflects more recent data points seen in the production phase. This is not a strict 1:1 mapping as it might be needed to adjust the windows to better locate when drift occurred.

Partitioning the chronological dataset into reference and current windows

We first need to differentiate between numerical and categorical/discrete data. For the statistical tests, both types of data will undergo distinct non-parametric tests that provide a p-value. We are handling different sample sizes and do not make assumptions about the actual distribution of our data. Therefore, non-parametric approaches are a handy way to test the similarity of two samples without needing to know the actual probability distribution.

Those tests allow us to accept or reject the null hypothesis with a degree of confidence, as defined by the p-value. As such, you can control the sensitivity of the test by adjusting the threshold for the p-value. We recommend a more conservative p-value such as 0.01 by default. The larger your sample gets, the more prone it is to pick up on noise. Other commonly used methods to figure out the drift between distributions are the Wasserstein Distance for continuous and the JS Divergence for probability distributions.

Here are some best practices to limit the number of false alarms in drift detection:

Scope the drift analyses to a shortlist of key features if you have many variables in your dataset

Use a sub-sample instead of all data points if your dataset is large

Reduce the p-value threshold further or select an alternative test for larger data volumes

While statistical tests are useful to identify drift, it is hard to interpret the magnitude of the drift as well as in which direction it occurs. Given a variable like age, did the sample get older or younger and how is the age spread? To answer those questions, it is useful to visualize the distributions. For this, we add another non-parametric method: the Kernel Density Estimation (KDE). Since we have two different data types, we will perform a pre-processing step on the categorical data to convert it into a pseudo-numerical notation by encoding the variables. The same ordinal encoder object is used for both the reference and the current distributions to ensure consistency:

Now that we have encoded our data, we can visually inspect either the entire dataset or selected variables of interest. We use the Kernel Density Estimation functions of the reference and current samples to compute their intersection in percent. The following steps were adapted from this sample:

Pass the data into a KDE function as in scipy.stats.gaussian_kde() with the bandwidth method "scott". The parameter determines the degree of smoothing of the distribution.

Take the range (min and max of both distributions) and compute the intersection points of both KDE functions within this range by using the differential of both functions.

Perform pre-processing for variables that have a constant value to avoid errors and align the scale of both distributions.

Approximate the area under the intersection points using the composite trapezoidal rule as per numpy.trapz().

Plot the reference and current distributions, intersection points as well as the area of intersection with the percentage of overlap.

Add the respective statistical test (KS or Chi-Square) to the title and provide a drift indication (Drift vs. No Drift)

The steps can be understood in more depth by checking out the code samples. The result of the KDE intersections looks as follows:

Comparing KDE intersections to identify data drift

A brief inspection of the plots tells us:

Which variables have significantly different distributions between the reference and current sample?

What is the magnitude and direction of the drift?

If we look at the variable "operator", we can see that there was a substantial change in terms of which employee was operating the machine between when the model was fitted versus today. For example, it could have happened that an operator has retired and, consequently, does not operate any machines anymore (operator 2). Conversely, we see that new operators have joined who were not present before (e.g., operator 6).

Now that we have seen how to uncover drift in features and labels, let us find out if the model is affected by data or concept drift.

Similar to before, we stack the historical (training) observations and recently labeled data points together in a chronological dataset. Then, we compare model performance based on the most recent observations, as the following overview illustrates.

Predictive model drift detection

At the core, we want to answer the question: Does a newer model perform better in predicting on most recent data than a model trained on older observations?

We probably have no idea in advance if and where drift has crept in.

Therefore, in our first attempt, we might use the original training data as reference and the inference observations as current windows. If we find the existence of drift by this, we will likely try out different reference and current windows to pinpoint where exactly drift crept in.

A couple of options exist for predictive model drift detection:

Each option has specific advantages. As a result of the first alternative, you already have a trained candidate ready for deployment if model 2 outperforms model 1. The second option will likely reduce false positives, at the expense of being less sensitive to drift. Alternative 3 reduces unnecessary training cycles in cases where no drift is identified.

Let’s check out how to use option 1 to find drift in our predictive maintenance use case. The aggregated dataset consists of 45,000 timestamped observations which we spilt into 20,000 references, 20,000 current, and 5,000 most recent observations for the test.

We define a scikit-learn pipeline to preprocess numerical and categorical features and train two LightGBM classifiers for comparison:

After repeating the last step for the current classifier, we compare the performance metrics to find out if there is a noticeable gap between the models:

The current model outperforms the reference model by a large margin. Therefore, we can conclude that we indeed have identified model drift and that the current model is a promising candidate for replacing the production model. A visual way of inspection is to compare the distributions of confidence scores of both classifiers:

Model drift impact on predicted class probabilities (left) with KDE intersection (right)

The histograms on the left show a clear difference between the predicted class probabilities of the reference and current models and therefore also confirm the existence of model drift.

Finally, we reuse our KDE plots and statistical tests from the data drift section to measure the extent of the drift. The intersection between the KDE plots for both classifiers amounts to only 85%. Furthermore, the results of the KS test suggest that the distributions are not identical.

In this example, the results were to be expected because we intentionally built drift into our synthetic predictive maintenance dataset. With a real-world dataset, results won’t always be as obvious. Also, it might be necessary to try different reference and current window splits to reliably find model drift.

MLOps reference architecture for evergreen models

We will now focus on embedding drift detection and mitigation in an MLOps architecture with Azure Machine Learning. The following section leverages concepts such as Azure ML Datasets, Models, and Pipelines. The demo repository provides an example of an Azure ML pipeline for generating the data drift detection plots. To automate the re-training of models, we recommend using the latest Azure MLOPs code samples and documentation. The following illustration provides a sample architecture including everything we learned about data and model drift so far.

MLOps architecture for evergreen models

By considering drift mitigation as part of an automated Azure MLOps workflow, we can maintain evergreen ML services with manageable effort. To do so, we perform the following steps:

Ingest and version data in Azure Machine Learning This step is crucial to maintain a lineage between the training data, machine learning experiments, and the resulting models. For automation, we use Azure Machine Learning pipelines which consume managed datasets. By specifying the version parameter (version="latest") you can ensure to obtain the most recent data.

Train modelIn this step, the model is trained on the source data. This activity can also be part of an automated Azure Machine Learning pipeline. We recommend adding a few parameters like the dataset name and version to re-use the same pipeline object across multiple dataset versions. By doing so, the same pipeline can be triggered in case model drift is present. Once the training is finished, the model is registered in the Azure Machine Learning model registry.

Evaluate model Model evaluation is part of the training/re-training pipeline. Besides looking at performance metrics to see how good a model is, a thorough evaluation also includes reviewing explanations, checking for bias and fairness issues, looking at where the model makes mistakes, etc. It will often include human verification.

Deploy model This is where you deploy a specific version of the model. In the case of evergreen models, we would deploy the model that was fitted to the latest dataset.

Monitor model Collection of telemetry about the deployed model. For example, an Azure AppInsights workbook can be used to collect the number of requests made to the model instance as well as service availability and other user-defined metrics.

Collect inference data and labels As part of a continuous improvement of the service, all the inferences that are made by the model should be saved into a repository (e.g., Azure Data Lake) alongside the ground truth (if available). This is a crucial step as it allows us to figure out the amount of drift between the inference and the reference data. Should the ground truth labels not be available, we can monitor data drift but not model drift.

Measure data drift Based on the previous step, we can kick off the data drift detection by using the reference data and contrasting it against the current data using the methods introduced in the sections above.

Measure model drift In this step, we determine if the model is affected by data or concept drift. This is done using one of the methods introduced above.

Trigger re-training In case of model or concept drift, we can trigger a full re-training and deployment pipeline utilizing the same Azure ML pipeline we used for the initial training. This is the last step that closes the loop between a static and an evergreen model. The re-training triggers can either be: Automatic – Comparing performance between the reference model and current model and automatically deploying if the current model performance is better than the reference model. Human in the loop – Inspect data drift visualization alongside performance metrics between reference and current model and deploy with a data scientist/ model owner in the loop. This scenario would be suitable for highly regulated industries. This can be done using PowerApps, Azure DevOps pipelines, or GitHub Actions.

Next steps

In this article, we have looked at practical concepts to find and mitigate the drift of data and machine learning models for tabular use cases. We have seen these concepts in action using our demo notebook with the predictive maintenance example. We encourage you to adapt these methods to your own use cases and appreciate any feedback.

Being able to systematically identify and manage the drift of production models is a big step toward mature MLOps practices. Therefore, we recommend integrating these concepts into an end-to-end production solution like the MLOps reference architecture introduced above. The last part of our notebook includes examples of how to implement drift detection using automated Azure Machine Learning Pipelines.

Data and model drift management is only one building block of a holistic MLOps vision and architecture. Feel free to check out the documentation for general information about MLOps and how to implement it using Azure Machine Learning.

In our examples, we have looked at tabular machine learning use cases. Drift mitigation approaches for unstructured data like images and natural language are also emerging. The Drift Detection in Medical Imaging AI repository provides a promising method of analyzing medical images in conjunction with metadata to detect model drift.

All images unless otherwise noted are by the authors.
The post Getting a Grip on Data and Model Drift with Azure Machine Learning appeared first on Towards Data Science.