Model Monitoring | Towards Data Science https://towardsdatascience.com/tag/model-monitoring/ The world’s leading publication for data science, AI, and ML professionals. Thu, 06 Mar 2025 19:59:52 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Model Monitoring | Towards Data Science https://towardsdatascience.com/tag/model-monitoring/ 32 32 How to Spot and Prevent Model Drift Before it Impacts Your Business https://towardsdatascience.com/how-to-spot-and-prevent-model-drift-before-it-impacts-your-business/ Thu, 06 Mar 2025 19:22:22 +0000 https://towardsdatascience.com/?p=598826 3 essential methods to track model drift you should know

The post How to Spot and Prevent Model Drift Before it Impacts Your Business appeared first on Towards Data Science.

]]>
Despite the AI hype, many tech companies still rely heavily on machine learning to power critical applications, from personalized recommendations to fraud detection. 

I’ve seen firsthand how undetected drifts can result in significant costs — missed fraud detection, lost revenue, and suboptimal business outcomes, just to name a few. So, it’s crucial to have robust monitoring in place if your company has deployed or plans to deploy machine learning models into production.

Undetected Model Drift can lead to significant financial losses, operational inefficiencies, and even damage to a company’s reputation. To mitigate these risks, it’s important to have effective model monitoring, which involves:

  • Tracking model performance
  • Monitoring feature distributions
  • Detecting both univariate and multivariate drifts

A well-implemented monitoring system can help identify issues early, saving considerable time, money, and resources.

In this comprehensive guide, I’ll provide a framework on how to think about and implement effective Model Monitoring, helping you stay ahead of potential issues and ensure stability and reliability of your models in production.

What’s the difference between feature drift and score drift?

Score drift refers to a gradual change in the distribution of model scores. If left unchecked, this could lead to a decline in model performance, making the model less accurate over time.

On the other hand, feature drift occurs when one or more features experience changes in the distribution. These changes in feature values can affect the underlying relationships that the model has learned, and ultimately lead to inaccurate model predictions.

Simulating score shifts

To model real-world fraud detection challenges, I created a synthetic dataset with five financial transaction features.

The reference dataset represents the original distribution, while the production dataset introduces shifts to simulate an increase in high-value transactions without PIN verification on newer accounts, indicating an increase in fraud.

Each feature has different underlying distributions:

  • Transaction Amount: Log-normal distribution (right-skewed with a long tail)
  • Account Age (months): clipped normal distribution between 0 to 60 (assuming a 5-year-old company)
  • Time Since Last Transaction: Exponential distribution
  • Transaction Count: Poisson distribution
  • Entered PIN: Binomial distribution.

To approximate model scores, I randomly assigned weights to these features and applied a sigmoid function to constrain predictions between 0 to 1. This mimics how a logistic regression fraud model generates risk scores.

As shown in the plot below:

  • Drifted features: Transaction Amount, Account Age, Transaction Count, and Entered PIN all experienced shifts in distribution, scale, or relationships.
Distribution of drifted features (image by author)
  • Stable feature: Time Since Last Transaction remained unchanged.
Distribution of stable feature (image by author)
  • Drifted scores: As a result of the drifted features, the distribution in model scores has also changed.
Distribution of model scores (image by author)

This setup allows us to analyze how feature drift impacts model scores in production.

Detecting model score drift using PSI

To monitor model scores, I used population stability index (PSI) to measure how much model score distribution has shifted over time.

PSI works by binning continuous model scores and comparing the proportion of scores in each bin between the reference and production datasets. It compares the differences in proportions and their logarithmic ratios to compute a single summary statistic to quantify the drift.

Python implementation:

# Define function to calculate PSI given two datasets
def calculate_psi(reference, production, bins=10):
  # Discretize scores into bins
  min_val, max_val = 0, 1
  bin_edges = np.linspace(min_val, max_val, bins + 1)

  # Calculate proportions in each bin
  ref_counts, _ = np.histogram(reference, bins=bin_edges)
  prod_counts, _ = np.histogram(production, bins=bin_edges)

  ref_proportions = ref_counts / len(reference)
  prod_proportions = prod_counts / len(production)
  
  # Avoid division by zero
  ref_proportions = np.clip(ref_proportions, 1e-8, 1)
  prod_proportions = np.clip(prod_proportions, 1e-8, 1)

  # Calculate PSI for each bin
  psi = np.sum((ref_proportions - prod_proportions) * np.log(ref_proportions / prod_proportions))

  return psi
  
# Calculate PSI
psi_value = calculate_psi(ref_data['model_score'], prod_data['model_score'], bins=10)
print(f"PSI Value: {psi_value}")

Below is a summary of how to interpret PSI values:

  • PSI < 0.1: No drift, or very minor drift (distributions are almost identical).
  • 0.1 ≤ PSI < 0.25: Some drift. The distributions are somewhat different.
  • 0.25 ≤ PSI < 0.5: Moderate drift. A noticeable shift between the reference and production distributions.
  • PSI ≥ 0.5: Significant drift. There is a large shift, indicating that the distribution in production has changed substantially from the reference data.
Histogram of model score distributions (image by author)

The PSI value of 0.6374 suggests a significant drift between our reference and production datasets. This aligns with the histogram of model score distributions, which visually confirms the shift towards higher scores in production — indicating an increase in risky transactions.

Detecting feature drift

Kolmogorov-Smirnov test for numeric features

The Kolmogorov-Smirnov (K-S) test is my preferred method for detecting drift in numeric features, because it is non-parametric, meaning it doesn’t assume a normal distribution.

The test compares a feature’s distribution in the reference and production datasets by measuring the maximum difference between the empirical cumulative distribution functions (ECDFs). The resulting K-S statistic ranges from 0 to 1:

  • 0 indicates no difference between the two distributions.
  • Values closer to 1 suggest a greater shift.

Python implementation:

# Create an empty dataframe
ks_results = pd.DataFrame(columns=['Feature', 'KS Statistic', 'p-value', 'Drift Detected'])

# Loop through all features and perform the K-S test
for col in numeric_cols:
    ks_stat, p_value = ks_2samp(ref_data[col], prod_data[col])
    drift_detected = p_value < 0.05
		
		# Store results in the dataframe
    ks_results = pd.concat([
        ks_results,
        pd.DataFrame({
            'Feature': [col],
            'KS Statistic': [ks_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

Below are ECDF charts of the four numeric features in our dataset:

ECDFs of four numeric features (image by author)

Let’s look at the account age feature as an example: the x-axis represents account age (0-50 months), while the y-axis shows the ECDF for both reference and production datasets. The production dataset skews towards newer accounts, as it has a larger proportion of observations with lower account ages.

Chi-Square test for categorical features

To detect shifts in categorical and boolean features, I like to use the Chi-Square test.

This test compares the frequency distribution of a categorical feature in the reference and production datasets, and returns two values:

  • Chi-Square statistic: A higher value indicates a greater shift between the reference and production datasets.
  • P-value: A p-value below 0.05 suggests that the difference between the reference and production datasets is statistically significant, indicating potential feature drift.

Python implementation:

# Create empty dataframe with corresponding column names
chi2_results = pd.DataFrame(columns=['Feature', 'Chi-Square Statistic', 'p-value', 'Drift Detected'])

for col in categorical_cols:
    # Get normalized value counts for both reference and production datasets
    ref_counts = ref_data[col].value_counts(normalize=True)
    prod_counts = prod_data[col].value_counts(normalize=True)

    # Ensure all categories are represented in both
    all_categories = set(ref_counts.index).union(set(prod_counts.index))
    ref_counts = ref_counts.reindex(all_categories, fill_value=0)
    prod_counts = prod_counts.reindex(all_categories, fill_value=0)

    # Create contingency table
    contingency_table = np.array([ref_counts * len(ref_data), prod_counts * len(prod_data)])

    # Perform Chi-Square test
    chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
    drift_detected = p_value < 0.05

    # Store results in chi2_results dataframe
    chi2_results = pd.concat([
        chi2_results,
        pd.DataFrame({
            'Feature': [col],
            'Chi-Square Statistic': [chi2_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

The Chi-Square statistic of 57.31 with a p-value of 3.72e-14 confirms a large shift in our categorical feature, Entered PIN. This finding aligns with the histogram below, which visually illustrates the shift:

Distribution of categorical feature (image by author)

Detecting multivariate shifts

Spearman Correlation for shifts in pairwise interactions

In addition to monitoring individual feature shifts, it’s important to track shifts in relationships or interactions between features, known as multivariate shifts. Even if the distributions of individual features remain stable, multivariate shifts can signal meaningful differences in the data.

By default, Pandas’ .corr() function calculates Pearson correlation, which only captures linear relationships between variables. However, relationships between features are often non-linear yet still follow a consistent trend.

To capture this, we use Spearman correlation to measure monotonic relationships between features. It captures whether features change together in a consistent direction, even if their relationship isn’t strictly linear.

To assess shifts in feature relationships, we compare:

  • Reference correlation (ref_corr): Captures historical feature relationships in the reference dataset.
  • Production correlation (prod_corr): Captures new feature relationships in production.
  • Absolute difference in correlation: Measures how much feature relationships have shifted between the reference and production datasets. Higher values indicate more significant shifts.

Python implementation:

# Calculate correlation matrices
ref_corr = ref_data.corr(method='spearman')
prod_corr = prod_data.corr(method='spearman')

# Calculate correlation difference
corr_diff = abs(ref_corr - prod_corr)

Example: Change in correlation

Now, let’s look at the correlation between transaction_amount and account_age_in_months:

  • In ref_corr, the correlation is 0.00095, indicating a weak relationship between the two features.
  • In prod_corr, the correlation is -0.0325, indicating a weak negative correlation.
  • Absolute difference in the Spearman correlation is 0.0335, which is a small but noticeable shift.

The absolute difference in correlation indicates a shift in the relationship between transaction_amount and account_age_in_months.

There used to be no relationship between these two features, but the production dataset indicates that there is now a weak negative correlation, meaning that newer accounts have higher transaction amounts. This is spot on!

Autoencoder for complex, high-dimensional multivariate shifts

In addition to monitoring pairwise interactions, we can also look for shifts across more dimensions in the data.

Autoencoders are powerful tools for detecting high-dimensional multivariate shifts, where multiple features collectively change in ways that may not be apparent from looking at individual feature distributions or pairwise correlations.

An autoencoder is a neural network that learns a compressed representation of data through two components:

  • Encoder: Compresses input data into a lower-dimensional representation.
  • Decoder: Reconstructs the original input from the compressed representation.

To detect shifts, we compare the reconstructed output to the original input and compute the reconstruction loss.

  • Low reconstruction loss → The autoencoder successfully reconstructs the data, meaning the new observations are similar to what it has seen and learned.
  • High reconstruction loss → The production data deviates significantly from the learned patterns, indicating potential drift.

Unlike traditional drift metrics that focus on individual features or pairwise relationships, autoencoders capture complex, non-linear dependencies across multiple variables simultaneously.

Python implementation:

ref_features = ref_data[numeric_cols + categorical_cols]
prod_features = prod_data[numeric_cols + categorical_cols]

# Normalize the data
scaler = StandardScaler()
ref_scaled = scaler.fit_transform(ref_features)
prod_scaled = scaler.transform(prod_features)

# Split reference data into train and validation
np.random.shuffle(ref_scaled)
train_size = int(0.8 * len(ref_scaled))
train_data = ref_scaled[:train_size]
val_data = ref_scaled[train_size:]

# Build autoencoder
input_dim = ref_features.shape[1]
encoding_dim = 3 
# Input layer
input_layer = Input(shape=(input_dim, ))
# Encoder
encoded = Dense(8, activation="relu")(input_layer)
encoded = Dense(encoding_dim, activation="relu")(encoded)
# Decoder
decoded = Dense(8, activation="relu")(encoded)
decoded = Dense(input_dim, activation="linear")(decoded)
# Autoencoder
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer="adam", loss="mse")

# Train autoencoder
history = autoencoder.fit(
    train_data, train_data,
    epochs=50,
    batch_size=64,
    shuffle=True,
    validation_data=(val_data, val_data),
    verbose=0
)

# Calculate reconstruction error
ref_pred = autoencoder.predict(ref_scaled, verbose=0)
prod_pred = autoencoder.predict(prod_scaled, verbose=0)

ref_mse = np.mean(np.power(ref_scaled - ref_pred, 2), axis=1)
prod_mse = np.mean(np.power(prod_scaled - prod_pred, 2), axis=1)

The charts below show the distribution of reconstruction loss between both datasets.

Distribution of reconstruction loss between actuals and predictions (image by author)

The production dataset has a higher mean reconstruction error than that of the reference dataset, indicating a shift in the overall data. This aligns with the changes in the production dataset with a higher number of newer accounts with high-value transactions.

Summarizing

Model monitoring is an essential, yet often overlooked, responsibility for data scientists and machine learning engineers.

All the statistical methods led to the same conclusion, which aligns with the observed shifts in the data: they detected a trend in production towards newer accounts making higher-value transactions. This shift resulted in higher model scores, signaling an increase in potential fraud.

In this post, I covered techniques for detecting drift on three different levels:

  • Model score drift: Using Population Stability Index (PSI)
  • Individual feature drift: Using Kolmogorov-Smirnov test for numeric features and Chi-Square test for categorical features
  • Multivariate drift: Using Spearman correlation for pairwise interactions and autoencoders for high-dimensional, multivariate shifts.

These are just a few of the techniques I rely on for comprehensive monitoring — there are plenty of other equally valid statistical methods that can also detect drift effectively.

Detected shifts often point to underlying issues that warrant further investigation. The root cause could be as serious as a data collection bug, or as minor as a time change like daylight savings time adjustments.

There are also fantastic python packages, like evidently.ai, that automate many of these comparisons. However, I believe there’s significant value in deeply understanding the statistical techniques behind drift detection, rather than relying solely on these tools.

What’s the model monitoring process like at places you’ve worked?


Want to build your AI skills?

👉🏻 I run the AI Weekender and write weekly blog posts on data science, AI weekend projects, career advice for professionals in data.


Resources

The post How to Spot and Prevent Model Drift Before it Impacts Your Business appeared first on Towards Data Science.

]]>
Monitoring BERT Model Training with TensorBoard https://towardsdatascience.com/monitor-bert-model-training-with-tensorboard-2f4c42b373ea/ Fri, 24 Dec 2021 16:12:03 +0000 https://towardsdatascience.com/monitor-bert-model-training-with-tensorboard-2f4c42b373ea/ Gradient Flow and Update Ratios

The post Monitoring BERT Model Training with TensorBoard appeared first on Towards Data Science.

]]>
https://unsplash.com/@tobiastu
https://unsplash.com/@tobiastu

In the previous article, we explained all the building components of the Bert model. Now we are going to train the model monitoring the training process in TensorBoard, looking at the gradient flow, updates-parameters ratios, loss and evaluation metrics.

Why would we like to monitor gradients flow and updates ratios instead of simply looking at the loss and evaluation metrics? When we start the model training on a big amount of data, we might run many iterations before realising, looking at the loss and evaluation metrics that the model is not training. Here, looking at the gradients magnitude and updates ratio we can immediately spot that something is wrong which saves us time and money.


Data preparation

We will use 20newsgroups dataset (License: Public Domain / Source: http://qwone.com/~jason/20Newsgroups/) from sklearn in this example with 4 categories : alt.atheism, talk.religion.misc, comp.graphics and sci.space. We tokenize the data with BertTokenizer from the transformers library and wrap them into BertDataset class which inherits from torch.utils.data.Dataset allowing to batch and shuffle the data and conveniently load them into the model.

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
X_train = pd.DataFrame(newsgroups_train['data'])
y_train = pd.Series(newsgroups_train['target'])
X_test = pd.DataFrame(newsgroups_test['data'])
y_test = pd.Series(newsgroups_test['target'])
BATCH_SIZE = 16
max_length = 256
config = BertConfig.from_pretrained("bert-base-uncased")
config.num_labels = len(y_train.unique())
config.max_position_embeddings = max_length
train_encodings = tokenizer(X_train[0].tolist(), truncation=True, padding=True, max_length=max_length)
test_encodings = tokenizer(X_test[0].tolist(), truncation=True, padding=True, max_length=max_length)

class BertDataset(Dataset):
  def __init__(self, encodings, labels):
   self.encodings = encodings
   self.labels = labels
  def __getitem__(self, idx):
   item = {key: torch.tensor(val[idx]).to(device) for key, val in 
  self.encodings.items()}
   item['labels'] = torch.tensor(self.labels[idx]).to(device)
   return item
  def __len__(self):
   return len(self.labels)

train_dataset = BertDataset(train_encodings, y_train)
test_dataset = BertDataset(test_encodings, y_test)
train_dataset_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_dataset_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)
for d in train_dataset_loader:
    print(d)
    break

# output : 
{'input_ids': tensor([[ 101, 2013, 1024,  ...,    0,    0,    0],
         [ 101, 2013, 1024,  ..., 1064, 1028,  102],
         [ 101, 2013, 1024,  ...,    0,    0,    0],
         ...,
         [ 101, 2013, 1024,  ..., 2620, 1011,  102],
         [ 101, 2013, 1024,  ..., 1012, 4012,  102],
         [ 101, 2013, 1024,  ..., 3849, 2053,  102]], device='cuda:0'),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]], device='cuda:0'),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0'),
 'labels': tensor([3, 0, 2, 1, 0, 2, 2, 1, 1, 0, 1, 3, 3, 0, 2, 1], device='cuda:0')}

TensorBoard usage

Tensorboard allows us to write and save for future analysis different types of data, including images and scalars. First of all let’s install tensorboard with pip:

pip install tensorboard

To write to TensorBoard we will be using the SummaryWriter from torch.utils.tensorboard

from torch.utils.tensorboard import SummaryWriter
# SummaryWriter takes log directory as argument
writer = SummaryWriter('tensorboard/runs/bert_experiment_1')

To write scalars, we use:

writer.add_scalar('loss/train', loss, counter_train)

The _countertrain variable is needed to know the step number at which something was written to TensorBoard. To write an image, we will use the following:

writer.add_figure("gradients", myfig, 
        global_step=counter_train, close=True, walltime=None)

Model training

Now let’s look at our training function

def train_epoch(  model : BertForSequenceClassification,
                  data_loader : DataLoader,
                  optimizer : AdamW,
                  scheduler : get_linear_schedule_with_warmup,
                  n_examples : int,
                  out_tensorboard : bool = True,
                  out_every : int = 30,
                  step_eval : int = None,
                  test_data_loader : DataLoader= None,
                  len_test_dataset : int = None
                ) -> Tuple[float, float]:
    """
    out_every : every how many steps add gradients and ratios figures and train loss to the tensorboard
    out_tensorboard : write to tensorboard or not
    step_eval : every how many step evaluate the model on test data. If None is passed then we will evaluate only at the end of the epoch.
    """
    print(f"Overall number of steps for training : {len(data_loader) * EPOCHS}")
    print(f"Tensorboard will save {(len(data_loader) * EPOCHS) // out_every} figures")
    global counter_train
    SKIP_PROB = 0
    model = model.train()
    losses = []
    correct_predictions = 0

    tot_batches = len(data_loader) * EPOCHS
    steps_out_stats = list(np.arange(0, tot_batches, out_every))

    running_losses = []
    for d in tqdm(data_loader, desc="Train batch"):

        outputs = model(**d)
        preds = outputs.logits.argmax(1)
        loss = outputs.loss
        correct_predictions += torch.sum(preds == d['labels'])
        running_losses.append(loss.item())
        losses.append(loss.item())
        loss.backward()

        if counter_train in steps_out_stats and out_tensorboard:
            curr_params = copy.deepcopy(optimizer.param_groups[0]['params'])

            print("writing gradients and ratios..")
            # write gradients to tensorboard
            myfig = plot_grad_flow(model.named_parameters(), skip_prob=SKIP_PROB)
            writer.add_figure("gradients", myfig, global_step=counter_train, close=True, walltime=None)

            named_params = copy_model_params(model.named_parameters())

            writer.add_scalar('loss/train', np.mean(running_losses), counter_train)
            running_loss = []

            nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            scheduler.step()

            next_params = copy.deepcopy(optimizer.param_groups[0]['params'])
            ratios = compute_ratios(curr_params, next_params, named_params)
            optimizer.zero_grad()

            fig_ratio = plot_ratios(ratios, skip_prob=SKIP_PROB)
            writer.add_figure("gradient ratios", fig_ratio, global_step=counter_train, close=True, walltime=None)

        else:
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

        writer.add_scalar('learning_rate', scheduler.get_lr()[0], counter_train)

        if step_eval != None and counter_train % step_eval == 0 and counter_train > 1:
            print("evaluating the model..")
            val_acc, val_loss = eval_model(model, test_dataset_loader, len_test_dataset)
            writer.add_scalar('loss/test', val_loss, counter_train)
            writer.add_scalar('accuracy/test', val_acc, counter_train)
            model = model.train()

        counter_train += 1
    return correct_predictions.cpu().numpy() / n_examples, np.mean(losses)

_outevery variable controls how often to write to TensorBoard, measured in number of steps. We can also evaluate more often than after each epoch using _stepeval variable. Gradients flow and updates ratios figures are returned from _plot_gradflow and _plotratios functions respectively.

_plot_gradflow we pass the model’s parameters and for better visualization we can decide to skip some of the layers with _skipprob parameter. We write mean, max and standard deviations of the gradients of each layer, ignoring bias layers as they are less interesting. You can remove that part on the 17th line if you want to display bias’s gradients as well. We also display the percentage of zero gradients in each layer. It’s important to highlight, as BERT uses GeLU rather than ReLU activation function this last plot might be less useful, but if you are using a different model with ReLU which suffers from dying neurons problem, displaying the percentage of zero gradients is actually helpful.

def plot_grad_flow(named_parameters : iter, skip_prob : float = 0.5, verbose : bool = False, seed : int = 0) -> plt.figure:
    '''Plots the gradients flowing through different layers in the net during training.
    Can be used for checking for possible gradient vanishing / exploding problems.

    skip_prob : skip some random layers for better visualization if there are a lot of layers
    seed : random seed to skip layers
    '''

    np.random.seed(seed)
    name_replace = {"encoder" : "enc", "layer" : "l"}
    plt.rcParams["figure.figsize"] = (20, 12)
    # plt.rcParams['figure.dpi']= 150

    mean_grads, max_grads, zero_grads, stds = [], [], [], []
    name_layers = []
    for (n, p) in (named_parameters):
        if (p.requires_grad) and ("bias" not in n):
            if np.random.rand() < skip_prob:
                if verbose:
                    print(f"skipped {n}")
                continue
            name_layers.append('.'.join([name_replace.get(el, el) for el in n.split(".") if el not in ['weight', 'bert', 'self']]))
            mean_grads.append(p.grad.abs().mean().detach().cpu().item())
            max_grads.append(p.grad.abs().max().detach().cpu().item())
            stds.append(p.grad.abs().std().detach().cpu().item())
            zero_grads.append(torch.sum(p.grad.abs() == 0.).detach().cpu().item() / p.grad.nelement())
        elif not (p.requires_grad):
            print(f"{n} does not require grad")

    sns.set_style("darkgrid", {"axes.facecolor": ".9"})
    sns.set(font_scale = 1.2)
    fig, ax = plt.subplots(3)
    sns.barplot(x=name_layers, y=max_grads, palette=['b']*len(name_layers), alpha=0.9, ax=ax[2], label="max grads",)
    sns.barplot(x=name_layers, y=mean_grads, palette=['r']*len(name_layers), alpha=0.9, ax=ax[2], label="mean grads",)

    sns.barplot(x=name_layers, y=zero_grads, palette=['k']*len(name_layers), alpha=0.9, ax=ax[0], label="percentage zero grads")

    sns.barplot(x=name_layers, y=stds, palette=['c']*len(name_layers), alpha=0.9, ax=ax[1], label="standard dev grads")

    ax[2].set_ylim([-0.005, 0.05])
    ax[1].set_ylim([-0.005, 0.1])
    ax[0].set_ylim([-0.005, 1.05])
    ax[2] = ax[2].set_xticklabels(ax[2].get_xticklabels(), rotation = 90)
    ax[0].set(xticklabels=[])
    ax[1].set(xticklabels=[])

    ax[0].legend()
    ax[1].legend()
    plt.legend()
    plt.close()
    return fig

_plotratios function displays the update / parameter ratio for each parameter which is a standardized measure as the update is divided by the parameter value and helps to understand how your neural network is learning. As a rough heuristic, this value should be around 1e-3, if lower then learning rate might be too low, otherwise too high. Also here you can reduce the layers displayed through _skipprob parameter.

def plot_ratios(ratios : dict,
                skip_prob : float = 0.5,
                verbose : bool = False,
                seed : int = 0) -> plt.figure:
    '''Plots the update/param ratio.
    Can be used for checking for possible gradient vanishing / exploding problems.

    This ratio should be around 1e-3.
    If it is lower than this then the learning rate might be too low.
    If it is higher then the learning rate is likely too high

    skip_prob : skip some random layers for better visualization if there are a lot of layers
    seed : random seed to skip layers
    '''
    np.random.seed(seed)
    name_replace = {"encoder" : "enc", "layer" : "l"}
    plt.rcParams["figure.figsize"] = (20, 12)
    # plt.rcParams['figure.dpi']= 150

    ratios_list = []
    name_layers = []
    for (n, r) in (ratios.items()):
        if np.random.rand() < skip_prob:
            if verbose:
                print(f"skipped {n}")
            continue
        name_layers.append('.'.join([name_replace.get(el, el) for el in n.split(".") if el not in ['weight', 'bert', 'self']]))
        ratios_list.append(r)

    sns.set_style("darkgrid", {"axes.facecolor": ".9"})
    sns.set(font_scale = 1.2)
    fig, ax = plt.subplots()
    sns.barplot(x=name_layers, y=ratios_list, palette=['b']*len(name_layers), ax=ax, label="ratio update/param",)

    ax.set_ylim([-0.0005, 0.005])
    ax = ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)

    plt.legend()
    plt.close()
    return fig

def compute_ratios(param_prev : dict ,
                   param_next : dict,
                   named_params : dict,
                   ignore_bias : bool = True) -> dict:
    """ Compute update/param ratio
    """
    updates = {}
    param_names = []
    for i, (name, val) in enumerate(named_params.items()):
        param_names.append(name)
        updates[name] = copy.deepcopy((param_next[i] - param_prev[i]).cpu().detach().numpy())

    ratio_updates = {}
    for n in param_names:
        if ignore_bias:
            if 'bias' in n: continue
        k = named_params[n]
        param_scale = np.linalg.norm(k.ravel())
        update = updates[n]
        update_scale = np.linalg.norm(update.ravel())
        ratio_updates[n] = update_scale/param_scale
    return ratio_updates

During the model training you can start TensorBoard using the following command :

tensorboard --samples_per_plugin images=100 --logdir bert_experiment_1

Otherwise, you can create a .bat file (on Windows) for a quicker launch. Create a new file, for example _runtensorboard with .bat extension and copy paste the below command modifying _path_to_anaconda_env, path_to_savedresults and _envname accordingly. Then click on the file to launch TensorBoard.

cmd /k "cd path_to_anaconda_envScripts &amp; activate env_name &amp; cd path_to_saved_resultstensorboardruns &amp; tensorboard - samples_per_plugin images=100 - logdir bert_experiment_1"

Defining _samples_perplugin images = 100 we display all the images in this task, otherwise by default TensorBoard will only display some of them.


Results

Monitoring these plots we can spot quickly if something is going not as expected. For example, if gradients are zero for many layers you might have a vanishing gradient problem. Similarly, if ratios are very low or very high you might want to dig deeper immediately without waiting until the end of the training or several epochs of training.

For example, looking at the ratios and the gradients on step number 240 with the below configuration we can see that things look good and our training is proceeding well – we can expect good results at the end.

EPOCHS = 5
optimizer = AdamW(model.parameters(), lr=3e-5, correct_bias=False)
total_steps = len(train_dataset_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
                   optimizer,
                   num_warmup_steps= 0.7 * total_steps,
                   num_training_steps=total_steps
                                            )
Ratio updates, Image by Author
Ratio updates, Image by Author
Gradients flow, Image by Author
Gradients flow, Image by Author

Indeed the final results are :

train/test accuracy, Image by Author
train/test accuracy, Image by Author

While if we change the schedule setting to this:

scheduler = get_linear_schedule_with_warmup(
 optimizer,
 num_warmup_steps= 0.1 * total_steps,
 num_training_steps=total_steps
 )

We notice at 180th step from the graphs that the model is not learning and we can stop the training to investigate further. In this case setting _num_warmupsteps to _0.1 * totalsteps makes the learning rate decrease and become very small shortly after the start of the training, and we end up with vanishing gradients which do not propagate back to first layers of the networking stopping effectively the learning process.

Ratio updates, Image by Author
Ratio updates, Image by Author
Gradients flow, Image by Author
Gradients flow, Image by Author
train/test accuracy, Image by Author
train/test accuracy, Image by Author

You can find the full code in this GitHub repo to try and experiment yourself.


Conclusions

After reading this and previous articles you should have all the tools and understanding of how to train Bert model in your projects! Things I described here are some of the ones you want to monitor during training, but TensorBoard offers other functionalities like Embeddings Projector that you can use to explore your embedding layer and many more that you can find here.

References

https://cs231n.github.io/neural-networks-3/

The post Monitoring BERT Model Training with TensorBoard appeared first on Towards Data Science.

]]>
Key Metrics for Data Science Team Success https://towardsdatascience.com/key-metrics-for-data-science-team-success-822da77f509c/ Fri, 29 Oct 2021 22:24:43 +0000 https://towardsdatascience.com/key-metrics-for-data-science-team-success-822da77f509c/ How data science team leaders can measure team performance and demonstrate success for the C-suite

The post Key Metrics for Data Science Team Success appeared first on Towards Data Science.

]]>
As the field of Data Science continues to grow and mature, many data science leaders struggle when C-suite executives ask them to demonstrate consistent success. A team may have delivered substantial projects, and models delivering tangible results, but data science is still science – It involves experimenting and learning, along with discovering actionable learnings. Some projects won’t demonstrate new insights, and experimental programs may point the way to longer-term impact, but not useful results today. That’s OK at first, but you still need a process to show ongoing growth and refinement for the C-suite.

A recent survey from Wakefield Research found that 71% of data executives say their company leadership expects revenue growth from their investment in data science. Senior leaders don’t just want incremental growth from these programs – 25% of data science leaders say their company leadership expects double-digit growth from data science, adding pressure to deliver quickly.

But a good data science program can take months or years to get a flywheel of innovation cranking away – so how do you show consistent results each quarter to demonstrate you’re not only discovering useful insights today, you’re also building an analytics machine that will add value for many years?

The trick is to move beyond metrics that just show validity and results for certain projects, and deploy metrics that show the overall performance of your complete data science program. You want to show how your team is getting faster, how they are delivering measurable results, and how the team is situated to keep growing. These are the key areas to evaluate for your data science KPIs, to show how your group is adding value to the broader organization.

Demonstrate velocity

One of the best ways to show that you’re not just flying by the seat of your pants (like many fledgling teams), is to track the velocity of your overall performance. When you start a new project, you have objectives, theories, but you don’t know how the project will turn out. Much of data science is a research process, and a team may try 99 experiments before the 100th yields an interesting result…and even "no insight" can be a valid outcome.

But "no insight" shouldn’t be the only output in those situations – you dedicated significant staff resources, and you learned something from that work, knowledge that you can successfully use in the future. You need a system to capture this work and catalog the modeling datasets, features used and initial results, so when you get a similar project down the road, you have a head start in terms of validated data, preliminary models, or effective approaches. Making your process repeatable and trackable is important to building a high-velocity data team so you can move from bespoke "artisan thinking" to reproducible and reusable "modular system" thinking.

The best performing teams move fastest when they build on the past. In terms of metrics, I’ve worked with one data scientist who tracks component reuse as one of the KPIs for their team. People on that team are recognized when they create a widely used component, like a great dataset diagnostic tool, and given credit for their contribution to the overall success.

Glenn Hofmann is the Chief Analytics Officer for New York Life, one of the world’s largest life insurance companies with a more than 175 year track record and one that operates in a heavily regulated industry. Hofmann was an early proponent of a more systemized approach to data science. Over the past five years, the Hofmann-led Center for Data Science and Artificial Intelligence (CDSAi) at New York Life grew from a team of seven people to nearly 50, and invested in infrastructure that captures business-critical results. In that time, the team has created comprehensive models for key business targets (such as customer retention and agent productivity) that can be versioned or expanded rapidly with a new idea from a business partner. CDSAi has also created an environment utilizing Python and R stacks, a data science workbench platform, and a Kubernetes cluster to automate procedures and speed up deployment.

"We’ve eliminated months of recoding work and can bring Python and R code directly to production. Our models can now be accessed from any production platform in the company via an API," Hofmann notes. By creating an API that others can use in the company, CDSAi and New York Life overall can quickly deploy new projects and support a wide range of business requests.

Leaders need to take a systematic approach to managing data science, and change the process from a complicated moonshot every time to just another lap around the racetrack. By reducing the time to iterate by standardizing templates, reusing datasets and saving software configurations, you can deploy a model in hours – and this helps your team iterate and refine at real-time speed.

Deliver ROI

The importance of data science has elevated the field, but it’s also created high expectations, especially with C-level executives who are unclear on what’s possible with data science.

The Wakefield Research study found that while company leadership may have double-digit revenue expectations for data science, today 82% of companies are making splashy short-term investments without recognizing the ongoing benefits of investing in data science. 46% of those executives say these short-sighted investments happen often or all the time.

If a model fails, the investment and budget may just disappear. 78% of data executives have seen their companies stop a data science initiative or cut back investment if a data model fails, including 26% who say it has happened several times. So you need to make sure and set expectations, and show results that improve over time.

One way to show direct ROI is by using control groups in the production environment. This is going to help you show the value across the company, especially with senior business leaders outside of the data science or IT organizations. One company I know created a ‘global holdout’ group from their customer segmentation and price elasticity models. A year later, they compared revenue from the holdouts to customers guided by the predictive models. By creating a before and after comparison, the team demonstrated more than a $1 billion lift in revenue, results that gave the data science team significant credibility and supported new hires.

One final point on ROI – don’t forget to show people the big picture. Make sure you socialize aggregate portfolio metrics. Even if it’s just an approximation, you want to show the impact of your whole portfolio of projects. You also want stakeholders to be aware how many projects are in-flight, in the pipeline, and on backlog. Executives may only focus on a couple of projects in their line of business or department, so showing them the big picture can be illuminating. This cadence of regular reporting also gives you a chance to demonstrate the collective achievement of the whole data science team along with individual contributions.

Grow your team

Beyond the daily work of finding insights, building a strong data science team is one of the biggest challenges I see in the field today. The Wakefield study found 48% of data executives complain of inadequate data skills among employees, and 44% say they’re not able to hire enough talent to scale data science in the first place.

Recruitment and retention will be ongoing issues, and if you want to build a foundation for significant results, you need a strong, consistent team. You have to show your staff that you take data science seriously and have a rigorous program. Then when reviewing your goals with the CEO, you can also report how many people you’ve hired, your turnover rate, and how quickly new hires are able to contribute meaningful work. By supporting your team with a robust program and clear procedures, the team can focus on their work.

To do this, you need to deploy technology that supports a decentralized team. With ongoing work from home programs, you may not have even met team members in person – even your direct reports! With this level of decentralization, you have to do more to connect new employees to the business, so they have a better sense of what the company needs to do. Make sure new people meet with Line of Business owners, marketing and sales teams and with the IT teams that will help them get work done.

As your team grows, the processes you put in place today will pay off as new team members start working with experts in particular data science tools or fields, building on the knowledge you’ve captured and experiments you’ve conducted. 39% of executives say one of the top obstacles to data science having a great impact at their organization is inconsistent standards and processes. Make it easy to digest your corpus of knowledge, and learn how the rest of the team gets work done so they can mirror this process.

One tactical idea is to hold "lunch-and-learns" for the data science team, and also for the whole company so that everyone learns what your team has done and where it’s going next. New York Life’s CDSAi not only conducts monthly lunch presentations on projects, methodologies, and data usages, the team also hosts an annual data science expo and regular forums featuring external guest speakers who educate and inspire New York Life’s broad data community.

From my experience at Domino Data Lab, I’ve seen many companies do things correctly, and I’ve seen many more who come to us after months of flailing around trying to gain traction. Let’s just say the head of data science programs at those less sophisticated companies don’t usually last long in the role. So I urge data leaders to create a sustainable, long-lasting program.

If you build a program that continues to accelerate with repeatable processes and reusable assets, keeps the C-suite informed and understanding of the big picture, and helps your team feel integrated and effective, then you can build a team that will make a major impact and deliver significant, measurable, results.

The post Key Metrics for Data Science Team Success appeared first on Towards Data Science.

]]>
A six-point framework on how to maintain your AI/ML models https://towardsdatascience.com/a-six-point-framework-on-how-to-maintain-your-ai-ml-models-b466e926005c/ Mon, 21 Jun 2021 18:40:51 +0000 https://towardsdatascience.com/a-six-point-framework-on-how-to-maintain-your-ai-ml-models-b466e926005c/ As the pandemic has made big changes across our world, we can't always rely on historical data that we used to train and build our first...

The post A six-point framework on how to maintain your AI/ML models appeared first on Towards Data Science.

]]>
Notes from Industry

As the pandemic has made big changes across our world, we can’t always rely on historical data that we used to train and build our first model versions. We all know – or we should realize by now – that these first versions will break somehow. It is just a matter of time. In our first article from last December, we discussed why you need model monitoring on your AI/ML models. Let’s broaden that discussion, and our vision, by considering a holistic framework to maintain our models. This is critical because models are living, functional tools that support our business decisions, drive revenue, reduce costs, and represent a significant investment by the company. Simply monitoring models is a good start but is not sufficient especially if you have a desire to scale beyond a handful of models in production.

A holistic framework should ensure that your model isn’t biased (we all recall the Amazon hiring model that was only trained using mostly males). It should encompass explainability. It should cover all that we need for full reproducibility when it comes time to retrain. We break it down into six key points as follows.

  1. A well-documented purpose is the first step. Our models should align with our business goals and purposes, otherwise they will grow stale and lose potency. This seems obvious, but often is neglected because sometimes modelers get more involved in building them for their research or to satisfy their intellectual curiosity. As my colleague David Bloch said on this blog, "Part of the challenge [is] the difficulty in assessing the value of a good decision." Sometimes a model can be used to clarify these decisions and more closely map and quantify their value to the business. Having this purpose – and understanding the actual business objectives – moves the model from a Data Science project and makes it a legitimate part of the business. Part of the purposeful approach is to also think through your goals, KPIs and other metrics to assess ROI and fill out the details about the target end users and delivery mechanisms. Another part is in understanding how a model will be used both before and after it satisfies particular business criteria.
  2. Data lineage details. Every model comes with some built-in lineage of its underlying data. The trick is capturing these details and how the data was prepared with sufficient details to ensure that the model can be reproduced and trusted. This is also useful around audit times, so we don’t have to try to track down a model’s data origin story or have to onboard it from scratch.

As I mentioned in my earlier article, even the best models evolve because the underlying data and relationships change over time. Having this data lineage is key to tracking and hopefully preventing concept drift, where the world changes but the model doesn’t reflect these changes. This drift could be caused by changing data distributions, measurements, or the underlying user base that may be ignored by your model. How you document these changes is critical.

  1. A full lifecycle tracking system. Like the software development lifecycle, this is a process to link the model runs with specific data versions and is another way to document the various changes made to model elements that were part of the experimental build process. Think what GitHub does for tracking versions of program code or what Docker does to track system definitions and components, or what Kubernetes does for tracking and orchestrating compute versions. As we finish various model runs, we need to be documenting these elements so we can annotate our progress and show how we fixed various problems with our models. The evolution of our models is almost more important than the actual models themselves because we can better understand what we are modelling and why we chose not only to build them in the first place but adjust their data inputs and assumptions.
  2. A model registry that links to the lifecycle tracking system mentioned in the point above. The registry can also be used to track the model version history where each version is fully reproducible with the same elements as our experiments in changing data, code, software, and hardware platforms. The ideal situation is to have a central registry with a summary dashboard where you can browse the model versions and drill down into each one’s history.
  3. Validation routines that document code reviews, reports on the various explanations about its ethical and bias checks and obtains the stamp of approval from its users. This would also be a good place to report on its service level agreements and other functional tests we’ve done and comments on its general production readiness. I have seen many modelers who skip this step. Validation is key to making sure that the model actually does what you intended it to do. It is also the key to making a decision when the useful life of a model is nearing its end and needs to be retired or rebuilt.
  4. The last point is having an open Model Monitoring system. This is what I discussed in my December post, and should be used to capture items such as data drift, a single ground truth, measurement accuracy and provide drill-down capabilities to troubleshoot signals. The monitoring system should also be able to detect anomalies and automatically alert stakeholders when certain thresholds are exceeded.

As you expand your investment in data science and modeling, you will need to manage and maintain an ever increasing collection of models that your business depends on daily. Here are two ways you can get started. First, review your current model maintenance plan against each of the six objectives noted here. Second, create a task force for the effort, or consider getting external help. It will take an up-front investment of time and resources, but you’ll end up with better models that live longer, are safer, and play a larger role in guiding your business decisions. The businesses that get ahead of the curve with model care will be strongly positioned for competitive advantage for years to come. Think of this resource as a way to envision and guide a solid future for your business.

The post A six-point framework on how to maintain your AI/ML models appeared first on Towards Data Science.

]]>
The Increasing Importance of Monitoring Your AI Models https://towardsdatascience.com/the-increasing-importance-of-monitoring-your-ai-models-fc32afbc70b5/ Wed, 16 Dec 2020 16:42:21 +0000 https://towardsdatascience.com/the-increasing-importance-of-monitoring-your-ai-models-fc32afbc70b5/ After the recent firing of Timnit Gebru, a highly respected AI ethics expert at Google, the topic of model training and retraining – and the compute power required to train and retrain the gargantuan AI/ML models that drive some of the world’s most visible tools – is on the minds of many in the tech […]

The post The Increasing Importance of Monitoring Your AI Models appeared first on Towards Data Science.

]]>

After the recent firing of Timnit Gebru, a highly respected AI ethics expert at Google, the topic of model training and retraining – and the compute power required to train and retrain the gargantuan AI/ML models that drive some of the world’s most visible tools – is on the minds of many in the tech industry. While the costs of retraining large language models at Google are significant, every company that has AI/ML models in production must also consider the costs associated with model upkeep: Retrain too early and you will incur added compute and personnel costs. Retrain too late and you incur the business costs of a poorly performing model and, potentially, the reputation costs of inaccurate predictions.

Truth is, it was past time to get serious about monitoring models five years ago. We already knew then that AI/ML models were a constant "work in progress" and needed to be continuously retrained to match changing realities. And yet, we still live in a world in which a biased model causes someone to unjustly be denied a loan, or a broken data pipeline causes a trading house to unknowingly trigger a selloff, or an old, degraded model incorrectly predicts a surge in medical supply demand in one hospital, which causes a shortage in another. Every time this happens, it erodes trust in AI.

As an industry, we can – and should – do better.

Here we are, at the edge of a future filled with opportunities for AI to make a real impact, but too many organizations continue to fail at monitoring and managing the risk associated with production models. One of the key reasons for this failure is a lack of understanding that monitoring data science models is vastly different from monitoring traditional software. Here are just a few of the differences:

Whereas traditional software is deterministic, models are probabilistic.

Whereas software can be written using waterfall and Agile frameworks with strict guardrails and processes, model development looks more like a team of life science researchers working on a new vaccine.

Whereas software requires code and system configurations to reproduce results, models depend on data, code, analytical package dependencies, and hardware configurations to reproduce results.

And most importantly, whereas software produces consistent results over time, model results change if the relationship between the incoming data and the predicted target drift apart over time. Catching these changes requires a new class of monitoring systems. Instead of looking only at usage, latency, and cost metrics, these new systems must also consider data quality, Data Drift, and model quality – all indicators of a model’s "health."

Most of the problems that arise with model health are due to the differences that exist between training an AI/Ml Model and putting it into production. A model is trained on past data; The goal of the training process is to find patterns in the data that can be exploited to predict an outcome. This training data is carefully selected by data scientists who curate it – sometimes in unusual ways – to get it to a state in which patterns can be exposed.

Once the model is in production, it receives raw data that it has never seen before. That data must be processed so that it has the same "look and feel" as the training data; only then can the model make accurate predictions and drive business value. This, in essence, is the goal of model monitoring systems.

First, monitoring the process that prepares the new raw data for scoring can help ensure data quality by answering critical questions, such as: Is the process up? Is the frequency of the feed within acceptable bounds? Does the live scoring data have the expected data types?

Second, monitoring the incoming scoring data streams to ensure they aren’t trending over time or undergoing dramatic shifts helps keep data drift in check. We all know the world is not static: Customer tastes change. Physical systems degrade. Geo-political impacts occur. Checking for distribution shifts and drifts in scoring data – for each field or input – is vital to maintaining model health.

Finally, because models are built on complex and complicated patterns, data quality and data drift may be completely in bounds and yet model quality may erode over time due to small, hard-to-see changes in the incoming data; changes that can only be caught by assessing the accuracy of models. This requires ground truth, e.g., We need to know the final yield of the crop before we can compare it to the predicted yield; We need to know how many loans defaulted before we can compare against the predicted default rate. Getting ground truth can be difficult or near impossible in some business settings, but it is very doable with minimal effort in others. But the net-net is that no Model Monitoring system is complete without a suite of model quality checks.

As an industry, our collective goal should be to establish systems that properly manage production models by capturing their lineage, validating their readiness, and maintaining their health. In this article, we advocated the prioritization of such systems and discussed maintaining model health as the first step. Creating a sharable model lineage and establishing comprehensive readiness validation routines – the topic of my next article – are no less important and too often overlooked.

The post The Increasing Importance of Monitoring Your AI Models appeared first on Towards Data Science.

]]>
Explaining the “Why” behind consumer spending on Black Friday https://towardsdatascience.com/explaining-the-why-behind-consumer-spending-on-black-friday-34d5c207b471/ Fri, 27 Nov 2020 14:13:29 +0000 https://towardsdatascience.com/explaining-the-why-behind-consumer-spending-on-black-friday-34d5c207b471/ In 2020, life as we know it changed overnight, and humanity as a collective is still figuring out what our new normal will be. Due to the...

The post Explaining the “Why” behind consumer spending on Black Friday appeared first on Towards Data Science.

]]>
Photo by Ashkan Forouzani on Unsplash
Photo by Ashkan Forouzani on Unsplash

In 2020, life as we know it changed overnight, and humanity as a collective is still figuring out what our new normal will be. Due to the Pandemic, everyone in all walks of life got more familiar with the concept of data and insights. For most of the year, all of us were literally waking up to dashboards, whether they are about the spread of the virus or other economic derivatives of that. As that behavior permeated itself across the general population, it is also influencing the enterprise.

If you’re a data science and analytics leader, the time is now – your entire organization is hungry for more insights around cash positions, supply chain, consumer behavior, and user sentiment.

Getting ahead with Explainable AI

As the world is changing every day [5], in order to predict the immediate future, data science and analytics leaders need the ability to capture and monitor newer data and continuously improve the models. What we need now is a way to combine "Humans + Data + AI" for effective decision making.

Explainable AI is the next-generation AI that is revolutionizing business decision-making across industries [1, 4, 7, 9].

Explainable AI vs. Black-Box AI (image source: Fiddler.AI)
Explainable AI vs. Black-Box AI (image source: Fiddler.AI)

With Explainable AI, business and analytics leaders can make accurate decisions and know the "why" behind the model decisions and how the various factors influence model outputs [1]. The technology makes the overall decision process better informed and results in more accurate outcomes.

Black Friday Consumer Spend Analytics

Let us look at how a retail company, say "ABC LLC," can use this new technology to understand consumer purchase behavior at a level that could not have been possible in the past [3, 6]. Assuming that I am a data scientist at "ABC LLC" my goal is to create Actionable Insights into the purchasing decisions of consumers on Black Friday given historical behavior and answer the following questions for my business teams:

  • What are the top drivers behind consumer purchasing decisions?
  • What are the factors that drive a particular female consumer segment to buy?
  • Which cities are likely to spend the most?
  • How many dollars is a consumer going to spend for this year’s Black Friday, and why?
  • Are women more likely to drive more sales $ than men at this store?

To do that, there are 5 main steps that I need to take:

  1. Gather data and prepare it for training
  2. Perform exploratory analysis and build features
  3. Build a series of models that I can tune
  4. Analyze model predictions with explanations
  5. Operationalize the model for continuous insights

In fact, it is a cyclical process where we start with a dataset, prepare it for training, evaluate the model, compare it with other models we may have built, and perform challenger/champion testing and analyze the data with the model to gather insights, which can be fed back into the training process to train a better model.

Explainable AI Workflow (Image source: Fiddler.AI)
Explainable AI Workflow (Image source: Fiddler.AI)

Data Gathering

For this blog, I will use a popular Kaggle Black Friday dataset [11] which is a fairly good representative of a dataset collected from a retail store’s purchase transactions. The dataset contains the following variables about a transaction:

Feature descriptions of Black Friday dataset (Image Source: here)
Feature descriptions of Black Friday dataset (Image Source: here)

Exploratory Data Analysis

We are ready to perform some basic data exploration and come up with some insight. To do this, I imported the dataset into Fiddler as a flat-file and got the high-level data statistics.

Fiddler automatically generates data statistics such as feature distributions, feature correlations, and mutual information for us to get a high-level understanding of the data. For example, the following 3 insights can be gathered quickly:

  1. The majority of the transactions are coming from B-category cities
  2. Males are buying more than females
  3. 26–35 age group is the dominant purchasing group.
Purchase distribution in the data (image source: Fiddler.AI)
Purchase distribution in the data (image source: Fiddler.AI)
Gender distribution in the data (image source: Fiddler.AI)
Gender distribution in the data (image source: Fiddler.AI)
City Category Distribution (image source: Fiddler.AI)
City Category Distribution (image source: Fiddler.AI)
Age Distribution (image source: Fiddler.AI)
Age Distribution (image source: Fiddler.AI)
Mutual Information between the features and the target (image source: Fiddler.AI)
Mutual Information between the features and the target (image source: Fiddler.AI)

Model Building

To leverage Explainable Ai, Fiddler offers 2 options.

  1. Import in a custom pre-trained model
  2. Build an interpretable model

In this case, I used option #1 to train an XGBoost model and used Fiddler’s Python library to upload it into the Fiddler platform. The model was a regression model and had a pretty good R2 of 0.68 on training and 0.69 on the test set. It was comparable to some of the other models [3, 6] trained on the same dataset on Kaggle.

Fiddler also allows me to validate the model by looking into the actual vs. prediction scatter plots and error distribution, showing that the model is doing a pretty good job capturing the underlying dataset.

Comparing predictions with actuals (image source: Fiddler.AI)
Comparing predictions with actuals (image source: Fiddler.AI)
Error distribution (image source: Fiddler.AI)
Error distribution (image source: Fiddler.AI)

Given that this model is satisfactory, we can leverage the XAI capabilities to analyze the data and answer our questions.

Model Explainability

Let us start exploring the model along with the dataset to answer our questions.

Q1: What are the top drivers behind purchasing decisions?

Explainable AI offers a way to answer this question by analyzing the data through the model’s eyes and extracting the top drivers that influence purchasing decisions.

Fiddler’s Slice & ExplainTM [10] toolkit allows us to find out very quickly that the top driver by far is Product Category 1 and it is 64% important for consumers that are likely to buy during Black Friday at this store. It is followed by Product Category 2, Occupation Type, Age of the Consumer, etc.

Top Features for Purchase Prediction (image source: Fiddler.AI)
Top Features for Purchase Prediction (image source: Fiddler.AI)

Q2: What are the factors that drive a particular female consumer segment to buy?

Let us say my business team is interested in targeting a marketing campaign for young females who are married and are in the age group of 18–25 and are from city category A as a consumer segment. They are interested in what drives their propensity to purchase. I can express this in the form of a SQL query on Fiddler and generate explainable insights.

As we can see from the chart below, this segment of consumers is only impacted by "Product Category 1" by 47% (dark blue bar) compared to the general population (light blue bar), which is influenced by it as much as 63%. A few more features viz., ‘Occupation Type,’ ‘Years Stayed in Current City’, ‘Category of City’ are much more important for this segment than the general population.

Top Features for this segment (image source: Fiddler.AI)
Top Features for this segment (image source: Fiddler.AI)

Q3: Which cities are likely to spend the most?

Again I use Fiddler’s S&E toolkit that gives me the ability to use SQL to slice and dice the data and explain it with my model. Fiddler allows me to express the query in SQL on a pair of datasets and models. It scores the model against the dataset in real-time and gets back predictions and explanation attributions for visualization.

As shown in the query dialog below, the city category that is likely to get the most $ sales will be "City Category B," which would be $2082 million.

Model Analysis (image source: Fiddler.AI)
Model Analysis (image source: Fiddler.AI)

Q4: How many dollars is a consumer going to spend and why?

Let us say our marketing team wants to do micro-targeting and personalize ads to a given user to drive her/his Black Friday purchases.

Using a similar group-by query like earlier, we get that user "1000869" is likely to buy items worth $8290.53 on average during Black Fridays. And here are the drivers that influence his purchasing decisions.

The presence of Product Category 1 and Product Category 2 drive the purchasing decisions of this user by a whopping 70%.

Top features for the consumer 1000869 (image source: Fiddler.AI)
Top features for the consumer 1000869 (image source: Fiddler.AI)

Q5: Are women more likely to drive more sales $ than men at this store?

We can run a query against all the model predictions and find out, as shown in the figure, that it is not true. We see that the average predicted $ transaction size for men is $9498 is higher than that of women which is $8811.

Women vs. Men average predicted $ spend (image source: Fiddler.AI)
Women vs. Men average predicted $ spend (image source: Fiddler.AI)

Operationalizing XAI with Monitoring

After we’re satisfied with the analysis, we can operationalize this model by connecting it to a Data Warehouse, continuously processing new transactional data, and providing explanations and predictions for business users. Once the model is live, we can monitor the performance in production and close the feedback loop. That way, we can track business KPIs, performance metrics, and set up alerts when something goes out of the ordinary. Fiddler’s Explainable Monitoring [8] features help data scientists and analysts to keep track of the following:

  • Feature Attributions: Outputs of explainability algorithms allow further investigation, helping to understand which features are the most important causal drivers for model predictions within a given time frame.
  • Data Drift: Track the data coming from the Datawarehouse so that analysts and data scientists can get visibility into any training-serving skew.
  • Outliers: The prediction time series from the model outputs and outliers that are automatically detected for egregious high or low-purchase predictions
Operational Dashboard with Explainable Insights (Image Source: Fiddler.AI)
Operational Dashboard with Explainable Insights (Image Source: Fiddler.AI)

Fiddler works with a wide range of data warehouse and BI tools so that analysts and data scientists can create operational dashboards, reports, and the answers to interactive queries using cutting-edge Explainable AI.

Fiddler Product (image source: Fiddler.AI)
Fiddler Product (image source: Fiddler.AI)

Conclusion

In this blog, I described how an Explainable AI Platform can help data scientists and analysts uncover deeper predictive insights from business datasets. I walked through a simple case study with a publicly available Kaggle BlackFriday dataset [11]. If you’re interested in learning more about the platform or try it out – please feel free to email info@fiddler.ai or fill this form.

References

  1. Forbes: Explainable AI a game-changer for Business Analytics
  2. Explainable Churn Analytics using Fiddler and MemSQL
  3. Black Friday: How much will a customer spend?
  4. Explainable AI in Retail
  5. Our weird behavior during Corona is messing with AI models
  6. Black Friday Sales Analysis and Prediction
  7. Why Explainable AI is the future of marketing and e-commerce
  8. Explainable Monitoring: Stop flying blind and monitor your AI
  9. 9 videos about explainable AI in the industry
  10. Introducing Slice & Explain: Automated Insights for your AI Models
  11. Black Friday Kaggle Dataset

The post Explaining the “Why” behind consumer spending on Black Friday appeared first on Towards Data Science.

]]>