Claudia Ng, Author at Towards Data Science https://towardsdatascience.com/author/ds-claudia/ The world’s leading publication for data science, AI, and ML professionals. Thu, 06 Mar 2025 19:59:52 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Claudia Ng, Author at Towards Data Science https://towardsdatascience.com/author/ds-claudia/ 32 32 How to Spot and Prevent Model Drift Before it Impacts Your Business https://towardsdatascience.com/how-to-spot-and-prevent-model-drift-before-it-impacts-your-business/ Thu, 06 Mar 2025 19:22:22 +0000 https://towardsdatascience.com/?p=598826 3 essential methods to track model drift you should know

The post How to Spot and Prevent Model Drift Before it Impacts Your Business appeared first on Towards Data Science.

]]>
Despite the AI hype, many tech companies still rely heavily on machine learning to power critical applications, from personalized recommendations to fraud detection. 

I’ve seen firsthand how undetected drifts can result in significant costs — missed fraud detection, lost revenue, and suboptimal business outcomes, just to name a few. So, it’s crucial to have robust monitoring in place if your company has deployed or plans to deploy machine learning models into production.

Undetected Model Drift can lead to significant financial losses, operational inefficiencies, and even damage to a company’s reputation. To mitigate these risks, it’s important to have effective model monitoring, which involves:

  • Tracking model performance
  • Monitoring feature distributions
  • Detecting both univariate and multivariate drifts

A well-implemented monitoring system can help identify issues early, saving considerable time, money, and resources.

In this comprehensive guide, I’ll provide a framework on how to think about and implement effective Model Monitoring, helping you stay ahead of potential issues and ensure stability and reliability of your models in production.

What’s the difference between feature drift and score drift?

Score drift refers to a gradual change in the distribution of model scores. If left unchecked, this could lead to a decline in model performance, making the model less accurate over time.

On the other hand, feature drift occurs when one or more features experience changes in the distribution. These changes in feature values can affect the underlying relationships that the model has learned, and ultimately lead to inaccurate model predictions.

Simulating score shifts

To model real-world fraud detection challenges, I created a synthetic dataset with five financial transaction features.

The reference dataset represents the original distribution, while the production dataset introduces shifts to simulate an increase in high-value transactions without PIN verification on newer accounts, indicating an increase in fraud.

Each feature has different underlying distributions:

  • Transaction Amount: Log-normal distribution (right-skewed with a long tail)
  • Account Age (months): clipped normal distribution between 0 to 60 (assuming a 5-year-old company)
  • Time Since Last Transaction: Exponential distribution
  • Transaction Count: Poisson distribution
  • Entered PIN: Binomial distribution.

To approximate model scores, I randomly assigned weights to these features and applied a sigmoid function to constrain predictions between 0 to 1. This mimics how a logistic regression fraud model generates risk scores.

As shown in the plot below:

  • Drifted features: Transaction Amount, Account Age, Transaction Count, and Entered PIN all experienced shifts in distribution, scale, or relationships.
Distribution of drifted features (image by author)
  • Stable feature: Time Since Last Transaction remained unchanged.
Distribution of stable feature (image by author)
  • Drifted scores: As a result of the drifted features, the distribution in model scores has also changed.
Distribution of model scores (image by author)

This setup allows us to analyze how feature drift impacts model scores in production.

Detecting model score drift using PSI

To monitor model scores, I used population stability index (PSI) to measure how much model score distribution has shifted over time.

PSI works by binning continuous model scores and comparing the proportion of scores in each bin between the reference and production datasets. It compares the differences in proportions and their logarithmic ratios to compute a single summary statistic to quantify the drift.

Python implementation:

# Define function to calculate PSI given two datasets
def calculate_psi(reference, production, bins=10):
  # Discretize scores into bins
  min_val, max_val = 0, 1
  bin_edges = np.linspace(min_val, max_val, bins + 1)

  # Calculate proportions in each bin
  ref_counts, _ = np.histogram(reference, bins=bin_edges)
  prod_counts, _ = np.histogram(production, bins=bin_edges)

  ref_proportions = ref_counts / len(reference)
  prod_proportions = prod_counts / len(production)
  
  # Avoid division by zero
  ref_proportions = np.clip(ref_proportions, 1e-8, 1)
  prod_proportions = np.clip(prod_proportions, 1e-8, 1)

  # Calculate PSI for each bin
  psi = np.sum((ref_proportions - prod_proportions) * np.log(ref_proportions / prod_proportions))

  return psi
  
# Calculate PSI
psi_value = calculate_psi(ref_data['model_score'], prod_data['model_score'], bins=10)
print(f"PSI Value: {psi_value}")

Below is a summary of how to interpret PSI values:

  • PSI < 0.1: No drift, or very minor drift (distributions are almost identical).
  • 0.1 ≤ PSI < 0.25: Some drift. The distributions are somewhat different.
  • 0.25 ≤ PSI < 0.5: Moderate drift. A noticeable shift between the reference and production distributions.
  • PSI ≥ 0.5: Significant drift. There is a large shift, indicating that the distribution in production has changed substantially from the reference data.
Histogram of model score distributions (image by author)

The PSI value of 0.6374 suggests a significant drift between our reference and production datasets. This aligns with the histogram of model score distributions, which visually confirms the shift towards higher scores in production — indicating an increase in risky transactions.

Detecting feature drift

Kolmogorov-Smirnov test for numeric features

The Kolmogorov-Smirnov (K-S) test is my preferred method for detecting drift in numeric features, because it is non-parametric, meaning it doesn’t assume a normal distribution.

The test compares a feature’s distribution in the reference and production datasets by measuring the maximum difference between the empirical cumulative distribution functions (ECDFs). The resulting K-S statistic ranges from 0 to 1:

  • 0 indicates no difference between the two distributions.
  • Values closer to 1 suggest a greater shift.

Python implementation:

# Create an empty dataframe
ks_results = pd.DataFrame(columns=['Feature', 'KS Statistic', 'p-value', 'Drift Detected'])

# Loop through all features and perform the K-S test
for col in numeric_cols:
    ks_stat, p_value = ks_2samp(ref_data[col], prod_data[col])
    drift_detected = p_value < 0.05
		
		# Store results in the dataframe
    ks_results = pd.concat([
        ks_results,
        pd.DataFrame({
            'Feature': [col],
            'KS Statistic': [ks_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

Below are ECDF charts of the four numeric features in our dataset:

ECDFs of four numeric features (image by author)

Let’s look at the account age feature as an example: the x-axis represents account age (0-50 months), while the y-axis shows the ECDF for both reference and production datasets. The production dataset skews towards newer accounts, as it has a larger proportion of observations with lower account ages.

Chi-Square test for categorical features

To detect shifts in categorical and boolean features, I like to use the Chi-Square test.

This test compares the frequency distribution of a categorical feature in the reference and production datasets, and returns two values:

  • Chi-Square statistic: A higher value indicates a greater shift between the reference and production datasets.
  • P-value: A p-value below 0.05 suggests that the difference between the reference and production datasets is statistically significant, indicating potential feature drift.

Python implementation:

# Create empty dataframe with corresponding column names
chi2_results = pd.DataFrame(columns=['Feature', 'Chi-Square Statistic', 'p-value', 'Drift Detected'])

for col in categorical_cols:
    # Get normalized value counts for both reference and production datasets
    ref_counts = ref_data[col].value_counts(normalize=True)
    prod_counts = prod_data[col].value_counts(normalize=True)

    # Ensure all categories are represented in both
    all_categories = set(ref_counts.index).union(set(prod_counts.index))
    ref_counts = ref_counts.reindex(all_categories, fill_value=0)
    prod_counts = prod_counts.reindex(all_categories, fill_value=0)

    # Create contingency table
    contingency_table = np.array([ref_counts * len(ref_data), prod_counts * len(prod_data)])

    # Perform Chi-Square test
    chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
    drift_detected = p_value < 0.05

    # Store results in chi2_results dataframe
    chi2_results = pd.concat([
        chi2_results,
        pd.DataFrame({
            'Feature': [col],
            'Chi-Square Statistic': [chi2_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

The Chi-Square statistic of 57.31 with a p-value of 3.72e-14 confirms a large shift in our categorical feature, Entered PIN. This finding aligns with the histogram below, which visually illustrates the shift:

Distribution of categorical feature (image by author)

Detecting multivariate shifts

Spearman Correlation for shifts in pairwise interactions

In addition to monitoring individual feature shifts, it’s important to track shifts in relationships or interactions between features, known as multivariate shifts. Even if the distributions of individual features remain stable, multivariate shifts can signal meaningful differences in the data.

By default, Pandas’ .corr() function calculates Pearson correlation, which only captures linear relationships between variables. However, relationships between features are often non-linear yet still follow a consistent trend.

To capture this, we use Spearman correlation to measure monotonic relationships between features. It captures whether features change together in a consistent direction, even if their relationship isn’t strictly linear.

To assess shifts in feature relationships, we compare:

  • Reference correlation (ref_corr): Captures historical feature relationships in the reference dataset.
  • Production correlation (prod_corr): Captures new feature relationships in production.
  • Absolute difference in correlation: Measures how much feature relationships have shifted between the reference and production datasets. Higher values indicate more significant shifts.

Python implementation:

# Calculate correlation matrices
ref_corr = ref_data.corr(method='spearman')
prod_corr = prod_data.corr(method='spearman')

# Calculate correlation difference
corr_diff = abs(ref_corr - prod_corr)

Example: Change in correlation

Now, let’s look at the correlation between transaction_amount and account_age_in_months:

  • In ref_corr, the correlation is 0.00095, indicating a weak relationship between the two features.
  • In prod_corr, the correlation is -0.0325, indicating a weak negative correlation.
  • Absolute difference in the Spearman correlation is 0.0335, which is a small but noticeable shift.

The absolute difference in correlation indicates a shift in the relationship between transaction_amount and account_age_in_months.

There used to be no relationship between these two features, but the production dataset indicates that there is now a weak negative correlation, meaning that newer accounts have higher transaction amounts. This is spot on!

Autoencoder for complex, high-dimensional multivariate shifts

In addition to monitoring pairwise interactions, we can also look for shifts across more dimensions in the data.

Autoencoders are powerful tools for detecting high-dimensional multivariate shifts, where multiple features collectively change in ways that may not be apparent from looking at individual feature distributions or pairwise correlations.

An autoencoder is a neural network that learns a compressed representation of data through two components:

  • Encoder: Compresses input data into a lower-dimensional representation.
  • Decoder: Reconstructs the original input from the compressed representation.

To detect shifts, we compare the reconstructed output to the original input and compute the reconstruction loss.

  • Low reconstruction loss → The autoencoder successfully reconstructs the data, meaning the new observations are similar to what it has seen and learned.
  • High reconstruction loss → The production data deviates significantly from the learned patterns, indicating potential drift.

Unlike traditional drift metrics that focus on individual features or pairwise relationships, autoencoders capture complex, non-linear dependencies across multiple variables simultaneously.

Python implementation:

ref_features = ref_data[numeric_cols + categorical_cols]
prod_features = prod_data[numeric_cols + categorical_cols]

# Normalize the data
scaler = StandardScaler()
ref_scaled = scaler.fit_transform(ref_features)
prod_scaled = scaler.transform(prod_features)

# Split reference data into train and validation
np.random.shuffle(ref_scaled)
train_size = int(0.8 * len(ref_scaled))
train_data = ref_scaled[:train_size]
val_data = ref_scaled[train_size:]

# Build autoencoder
input_dim = ref_features.shape[1]
encoding_dim = 3 
# Input layer
input_layer = Input(shape=(input_dim, ))
# Encoder
encoded = Dense(8, activation="relu")(input_layer)
encoded = Dense(encoding_dim, activation="relu")(encoded)
# Decoder
decoded = Dense(8, activation="relu")(encoded)
decoded = Dense(input_dim, activation="linear")(decoded)
# Autoencoder
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer="adam", loss="mse")

# Train autoencoder
history = autoencoder.fit(
    train_data, train_data,
    epochs=50,
    batch_size=64,
    shuffle=True,
    validation_data=(val_data, val_data),
    verbose=0
)

# Calculate reconstruction error
ref_pred = autoencoder.predict(ref_scaled, verbose=0)
prod_pred = autoencoder.predict(prod_scaled, verbose=0)

ref_mse = np.mean(np.power(ref_scaled - ref_pred, 2), axis=1)
prod_mse = np.mean(np.power(prod_scaled - prod_pred, 2), axis=1)

The charts below show the distribution of reconstruction loss between both datasets.

Distribution of reconstruction loss between actuals and predictions (image by author)

The production dataset has a higher mean reconstruction error than that of the reference dataset, indicating a shift in the overall data. This aligns with the changes in the production dataset with a higher number of newer accounts with high-value transactions.

Summarizing

Model monitoring is an essential, yet often overlooked, responsibility for data scientists and machine learning engineers.

All the statistical methods led to the same conclusion, which aligns with the observed shifts in the data: they detected a trend in production towards newer accounts making higher-value transactions. This shift resulted in higher model scores, signaling an increase in potential fraud.

In this post, I covered techniques for detecting drift on three different levels:

  • Model score drift: Using Population Stability Index (PSI)
  • Individual feature drift: Using Kolmogorov-Smirnov test for numeric features and Chi-Square test for categorical features
  • Multivariate drift: Using Spearman correlation for pairwise interactions and autoencoders for high-dimensional, multivariate shifts.

These are just a few of the techniques I rely on for comprehensive monitoring — there are plenty of other equally valid statistical methods that can also detect drift effectively.

Detected shifts often point to underlying issues that warrant further investigation. The root cause could be as serious as a data collection bug, or as minor as a time change like daylight savings time adjustments.

There are also fantastic python packages, like evidently.ai, that automate many of these comparisons. However, I believe there’s significant value in deeply understanding the statistical techniques behind drift detection, rather than relying solely on these tools.

What’s the model monitoring process like at places you’ve worked?


Want to build your AI skills?

👉🏻 I run the AI Weekender and write weekly blog posts on data science, AI weekend projects, career advice for professionals in data.


Resources

The post How to Spot and Prevent Model Drift Before it Impacts Your Business appeared first on Towards Data Science.

]]>
How to Fine-Tune DistilBERT for Emotion Classification https://towardsdatascience.com/how-to-fine-tune-distilbert-for-emotion-classification/ Wed, 19 Feb 2025 00:10:39 +0000 https://towardsdatascience.com/?p=598095 The customer support teams were drowning with the overwhelming volume of customer inquiries at every company I’ve worked at. Have you had similar experiences? What if I told you that you could use AI to automatically identify, categorize, and even resolve the most common issues? By fine-tuning a transformer model like BERT, you can build […]

The post How to Fine-Tune DistilBERT for Emotion Classification appeared first on Towards Data Science.

]]>
The customer support teams were drowning with the overwhelming volume of customer inquiries at every company I’ve worked at. Have you had similar experiences?

What if I told you that you could use AI to automatically identify, categorize, and even resolve the most common issues?

By fine-tuning a transformer model like BERT, you can build an automated system that tags tickets by issue type and routes them to the right team.

In this tutorial, I’ll show you how to fine-tune a transformer model for emotion classification in five steps:

  1. Set Up Your Environment: Prepare your dataset and install necessary libraries.
  2. Load and Preprocess Data: Parse text files and organize your data.
  3. Fine-Tune Distilbert: Train model to classify emotions using your dataset.
  4. Evaluate Performance: Use metrics like accuracy, F1-score, and confusion matrices to measure model performance.
  5. Interpret Predictions: Visualize and understand predictions using SHAP (SHapley Additive exPlanations).

By the end, you’ll have a fine-tuned model that classifies emotions from text inputs with high accuracy, and you’ll also learn how to interpret these predictions using SHAP.

This same approach can be applied to real-world use cases beyond emotion classification, such as customer support automation, sentiment analysis, content moderation, and more.

Let’s dive in!

Choosing the Right Transformer Model

When selecting a transformer model for Text Classification, here’s a quick breakdown of the most common models:

  • BERT: Great for general NLP tasks, but computationally expensive for both training and inference.
  • DistilBERT: 60% faster than BERT while retaining 97% of its capabilities, making it ideal for real-time applications.
  • RoBERTa: A more robust version of BERT, but requires more resources.
  • XLM-RoBERTa: A multilingual variant of RoBERTa trained on 100 languages. It is perfect for multilingual tasks, but is quite resource-intensive.

For this tutorial, I chose to fine-tune DistilBERT because it offers the best balance between performance and efficiency.

Step 1: Setup and Installing Dependencies

Ensure you have the required libraries installed:

!pip install datasets transformers torch scikit-learn shap

Step 2: Load and Preprocess Data

I used the Emotions dataset for NLP by Praveen Govi, available on Kaggle and licensed for commercial use. It contains text labeled with emotions. The data comes in three .txt files: train, validation, and test.

Each line contains a sentence and its corresponding emotion label, separated by a semicolon:

text; emotion
"i didnt feel humiliated"; "sadness"
"i am feeling grouchy"; "anger"
"im updating my blog because i feel shitty"; "sadness"

Parsing the Dataset into a Pandas DataFrame

Let’s load the dataset:

def parse_emotion_file(file_path):
"""
    Parses a text file with each line in the format: {text; emotion}
    and returns a pandas DataFrame with 'text' and 'emotion' columns.

    Args:
    - file_path (str): Path to the .txt file to be parsed

    Returns:
    - df (pd.DataFrame): DataFrame containing 'text' and 'emotion' columns
    """
    texts = []
    emotions = []
   
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            try:
                # Split each line by the semicolon separator
                text, emotion = line.strip().split(';')
               
                # append text and emotion to separate lists
                texts.append(text)
                emotions.append(emotion)
            except ValueError:
                continue
   
    return pd.DataFrame({'text': texts, 'emotion': emotions})

# Parse text files and store as Pandas DataFrames
train_df = parse_emotion_file("train.txt")
val_df = parse_emotion_file("val.txt")
test_df = parse_emotion_file("test.txt")

Understanding the Label Distribution

This dataset contains 16k training examples and 2k examples for the validation and testing. Here’s the label distribution breakdown:

Image by author.

The bar chart above shows that the dataset is imbalanced, with the majority of samples labels as joy and sadness.

For a fine-tuning a production model, I would consider experimenting with different sampling techniques to overcome this class imbalance problem and improve the model’s performance.

Step 3: Tokenization and Data Preprocessing

Next, I loaded in DistilBERT’s tokenizer:

from transformers import AutoTokenizer

# Define the model path for DistilBERT
model_name = "distilbert-base-uncased"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

Then, I used it to tokenize text data and transform the labels into numerical IDs:

# Tokenize data
def preprocess_function(df, label2id):
    """
    Tokenizes text data and transforms labels into numerical IDs.

    Args:
        df (dict or pandas.Series): A dictionary-like object containing "text" and "emotion" fields.
        label2id (dict): A mapping from emotion labels to numerical IDs.

    Returns:
        dict: A dictionary containing:
              - "input_ids": Encoded token sequences
              - "attention_mask": Mask to indicate padding tokens
              - "label": Numerical labels for classification

    Example usage:
        train_dataset = train_dataset.map(lambda x: preprocess_function(x, tokenizer, label2id), batched=True)
    """
    tokenized_inputs = tokenizer(
        df["text"],
        padding="longest",
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )

    tokenized_inputs["label"] = [label2id.get(emotion, -1) for emotion in df["emotion"]]
    return tokenized_inputs
   
# Convert the DataFrames to HuggingFace Dataset format
train_dataset = Dataset.from_pandas(train_df)

# Apply the 'preprocess_function' to tokenize text data and transform labels
train_dataset = train_dataset.map(lambda x: preprocess_function(x, label2id), batched=True)

Step 4: Fine-Tuning Model

Next, I loaded a pre-trained DistilBERT model with a classification head for our text classification text. I also specified what the labels for this dataset looks like:

# Get the unique emotion labels from the 'emotion' column in the training DataFrame
labels = train_df["emotion"].unique()

# Create label-to-id and id-to-label mappings
label2id = {label: idx for idx, label in enumerate(labels)}
id2label = {idx: label for idx, label in enumerate(labels)}

# Initialize model
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id
)

The pre-trained DistilBERT model for classification consists of five layers plus a classification head.

To prevent overfitting, I froze the first four layers, preserving the knowledge learned during pre-training. This allows the model to retain general language understanding while only fine-tuning the fifth layer and classification head to adapt to my dataset. Here’s how I did this:

# freeze base model parameters
for name, param in model.base_model.named_parameters():
    param.requires_grad = False

# keep classifier trainable
for name, param in model.base_model.named_parameters():
    if "transformer.layer.5" in name or "classifier" in name:
        param.requires_grad = True

Defining Metrics

Given the label imbalance, I thought accuracy may not be the most appropriate metric, so I chose to include other metrics suited for classification problems like precision, recall, F1-score, and AUC score.

I also used “weighted” averaging for F1-score, precision, and recall to address the class imbalance problem. This parameter ensures that all classes contribute proportionally to the metric and prevent any single class from dominating the results:

def compute_metrics(p):
    """
    Computes accuracy, F1 score, precision, and recall metrics for multiclass classification.

    Args:
    p (tuple): Tuple containing predictions and labels.

    Returns:
    dict: Dictionary with accuracy, F1 score, precision, and recall metrics, using weighted averaging
          to account for class imbalance in multiclass classification tasks.
    """
    logits, labels = p
   
    # Convert logits to probabilities using softmax (PyTorch)
    softmax = torch.nn.Softmax(dim=1)
    probs = softmax(torch.tensor(logits))
   
    # Convert logits to predicted class labels
    preds = probs.argmax(axis=1)

    return {
        "accuracy": accuracy_score(labels, preds),  # Accuracy metric
        "f1_score": f1_score(labels, preds, average='weighted'),  # F1 score with weighted average for imbalanced data
        "precision": precision_score(labels, preds, average='weighted'),  # Precision score with weighted average
        "recall": recall_score(labels, preds, average='weighted'),  # Recall score with weighted average
        "auc_score": roc_auc_score(labels, probs, average="macro", multi_class="ovr")
    }

Let’s set up the training process:

# Define hyperparameters
lr = 2e-5
batch_size = 16
num_epochs = 3
weight_decay = 0.01

# Set up training arguments for fine-tuning models
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="steps",
    eval_steps=500,
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=weight_decay,
    logging_dir="./logs",
    logging_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="eval_f1_score",
    greater_is_better=True,
)

# Initialize the Trainer with the model, arguments, and datasets
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
print(f"Training {model_name}...")
trainer.train()

Step 5: Evaluating Model Performance

After training, I evaluated the model’s performance on the test set:

# Generate predictions on the test dataset with fine-tuned model
predictions_finetuned_model = trainer.predict(test_dataset)
preds_finetuned = predictions_finetuned_model.predictions.argmax(axis=1)

# Compute evaluation metrics (accuracy, precision, recall, and F1 score)
eval_results_finetuned_model = compute_metrics((predictions_finetuned_model.predictions, test_dataset["label"]))

This is how the fine-tuned DistilBERT model did on the test set compared to the pre-trained base model:

Radar chart of fine-tuned DistilBERT model. Image by author.

Before fine-tuning, the pre-trained model performed poorly on our dataset, because it hasn’t seen the specific emotion labels before. It was essentially guessing at random, as reflected in an AUC score of 0.5 that indicates no better than chance.

After fine-tuning, the model significantly improved across all metrics, achieving 83% accuracy in correctly identifying emotions. This demonstrates that the model has successfully learned meaningful patterns in the data, even with just 16k training samples.

That’s amazing!

Step 6: Interpreting Predictions with SHAP

I tested the fine-tuned model on three sentences and here are the emotions that it predicted:

  1. The thought of speaking in front of a large crowd makes my heart race, and I start to feel overwhelmed with anxiety.” → fear 😱
  2. “I can’t believe how disrespectful they were! I worked so hard on this project, and they just dismissed it without even listening. It’s infuriating!” → anger 😡
  3. “I absolutely love this new phone! The camera quality is amazing, the battery lasts all day, and it’s so fast. I couldn’t be happier with my purchase, and I highly recommend it to anyone looking for a new phone.” → joy 😀

Impressive, right?!

I wanted to understand how the model made its predictions, I used using SHAP (Shapley Additive exPlanations) to visualize feature importance.

I started by creating an explainer:

# Build a pipeline object for predictions
preds = pipeline(
    "text-classification",
    model=model_finetuned,
    tokenizer=tokenizer,
    return_all_scores=True,
)

# Create an explainer
explainer = shap.Explainer(preds)

Then, I computed SHAP values using the explainer:

# Compute SHAP values using explainer
shap_values = explainer(example_texts)

# Make SHAP text plot
shap.plots.text(shap_values)

The plot below visualizes how each word in the input text contributes to the model’s output using SHAP values:

SHAP text plot. Image by author.

In this case, the plot shows that “anxiety” is the most important factor in predicting “fear” as the emotion.

The SHAP text plot is a nice, intuitive, and interactive way to understand predictions by breaking down how much each word influences the final prediction.

Summary

You’ve successfully learned to fine-tune DistilBERT for emotion classification from text data! (You can check out the model on Hugging Face here).

Transformer models can be fine-tuned for many real-world applications, including:

  • Tagging customer service tickets (as discussed in the introduction),
  • Flagging mental health risks in text-based conversations,
  • Detecting sentiment in product reviews.

Fine-tuning is an effective and efficient way to adapt powerful pre-trained models to specific tasks with a relatively small dataset.

What will you fine-tune next?


Want to build your AI skills?

👉🏻 I run the AI Weekender and write weekly blog posts on data science, AI weekend projects, career advice for professionals in data.


Resources

  • Jupyter notebook [HERE]
  • Model card on Hugging Face [HERE]

The post How to Fine-Tune DistilBERT for Emotion Classification appeared first on Towards Data Science.

]]>