The world’s leading publication for data science, AI, and ML professionals.

The Increasing Importance of Monitoring Your AI Models

After the recent firing of Timnit Gebru, a highly respected AI ethics expert at Google, the topic of model training and retraining - and…

After the recent firing of Timnit Gebru, a highly respected AI ethics expert at Google, the topic of model training and retraining – and the compute power required to train and retrain the gargantuan AI/ML models that drive some of the world’s most visible tools – is on the minds of many in the tech industry. While the costs of retraining large language models at Google are significant, every company that has AI/ML models in production must also consider the costs associated with model upkeep: Retrain too early and you will incur added compute and personnel costs. Retrain too late and you incur the business costs of a poorly performing model and, potentially, the reputation costs of inaccurate predictions.

Truth is, it was past time to get serious about monitoring models five years ago. We already knew then that AI/ML models were a constant "work in progress" and needed to be continuously retrained to match changing realities. And yet, we still live in a world in which a biased model causes someone to unjustly be denied a loan, or a broken data pipeline causes a trading house to unknowingly trigger a selloff, or an old, degraded model incorrectly predicts a surge in medical supply demand in one hospital, which causes a shortage in another. Every time this happens, it erodes trust in AI.

As an industry, we can – and should – do better.

Here we are, at the edge of a future filled with opportunities for AI to make a real impact, but too many organizations continue to fail at monitoring and managing the risk associated with production models. One of the key reasons for this failure is a lack of understanding that monitoring data science models is vastly different from monitoring traditional software. Here are just a few of the differences:

Whereas traditional software is deterministic, models are probabilistic.

Whereas software can be written using waterfall and Agile frameworks with strict guardrails and processes, model development looks more like a team of life science researchers working on a new vaccine.

Whereas software requires code and system configurations to reproduce results, models depend on data, code, analytical package dependencies, and hardware configurations to reproduce results.

And most importantly, whereas software produces consistent results over time, model results change if the relationship between the incoming data and the predicted target drift apart over time. Catching these changes requires a new class of monitoring systems. Instead of looking only at usage, latency, and cost metrics, these new systems must also consider data quality, Data Drift, and model quality – all indicators of a model’s "health."

Most of the problems that arise with model health are due to the differences that exist between training an AI/Ml Model and putting it into production. A model is trained on past data; The goal of the training process is to find patterns in the data that can be exploited to predict an outcome. This training data is carefully selected by data scientists who curate it – sometimes in unusual ways – to get it to a state in which patterns can be exposed.

Once the model is in production, it receives raw data that it has never seen before. That data must be processed so that it has the same "look and feel" as the training data; only then can the model make accurate predictions and drive business value. This, in essence, is the goal of model monitoring systems.

First, monitoring the process that prepares the new raw data for scoring can help ensure data quality by answering critical questions, such as: Is the process up? Is the frequency of the feed within acceptable bounds? Does the live scoring data have the expected data types?

Second, monitoring the incoming scoring data streams to ensure they aren’t trending over time or undergoing dramatic shifts helps keep data drift in check. We all know the world is not static: Customer tastes change. Physical systems degrade. Geo-political impacts occur. Checking for distribution shifts and drifts in scoring data – for each field or input – is vital to maintaining model health.

Finally, because models are built on complex and complicated patterns, data quality and data drift may be completely in bounds and yet model quality may erode over time due to small, hard-to-see changes in the incoming data; changes that can only be caught by assessing the accuracy of models. This requires ground truth, e.g., We need to know the final yield of the crop before we can compare it to the predicted yield; We need to know how many loans defaulted before we can compare against the predicted default rate. Getting ground truth can be difficult or near impossible in some business settings, but it is very doable with minimal effort in others. But the net-net is that no Model Monitoring system is complete without a suite of model quality checks.

As an industry, our collective goal should be to establish systems that properly manage production models by capturing their lineage, validating their readiness, and maintaining their health. In this article, we advocated the prioritization of such systems and discussed maintaining model health as the first step. Creating a sharable model lineage and establishing comprehensive readiness validation routines – the topic of my next article – are no less important and too often overlooked.


Related Articles