Statistics | Towards Data Science

How to Spot and Prevent Model Drift Before it Impacts Your Business

Claudia Ng — Thu, 06 Mar 2025 19:22:22 +0000

Despite the AI hype, many tech companies still rely heavily on machine learning to power critical applications, from personalized recommendations to fraud detection.

I’ve seen firsthand how undetected drifts can result in significant costs — missed fraud detection, lost revenue, and suboptimal business outcomes, just to name a few. So, it’s crucial to have robust monitoring in place if your company has deployed or plans to deploy machine learning models into production.

Undetected Model Drift can lead to significant financial losses, operational inefficiencies, and even damage to a company’s reputation. To mitigate these risks, it’s important to have effective model monitoring, which involves:

Tracking model performance
Monitoring feature distributions
Detecting both univariate and multivariate drifts

A well-implemented monitoring system can help identify issues early, saving considerable time, money, and resources.

In this comprehensive guide, I’ll provide a framework on how to think about and implement effective Model Monitoring, helping you stay ahead of potential issues and ensure stability and reliability of your models in production.

What’s the difference between feature drift and score drift?

Score drift refers to a gradual change in the distribution of model scores. If left unchecked, this could lead to a decline in model performance, making the model less accurate over time.

On the other hand, feature drift occurs when one or more features experience changes in the distribution. These changes in feature values can affect the underlying relationships that the model has learned, and ultimately lead to inaccurate model predictions.

Simulating score shifts

To model real-world fraud detection challenges, I created a synthetic dataset with five financial transaction features.

The reference dataset represents the original distribution, while the production dataset introduces shifts to simulate an increase in high-value transactions without PIN verification on newer accounts, indicating an increase in fraud.

Each feature has different underlying distributions:

Transaction Amount: Log-normal distribution (right-skewed with a long tail)
Account Age (months): clipped normal distribution between 0 to 60 (assuming a 5-year-old company)
Time Since Last Transaction: Exponential distribution
Transaction Count: Poisson distribution
Entered PIN: Binomial distribution.

To approximate model scores, I randomly assigned weights to these features and applied a sigmoid function to constrain predictions between 0 to 1. This mimics how a logistic regression fraud model generates risk scores.

As shown in the plot below:

Drifted features: Transaction Amount, Account Age, Transaction Count, and Entered PIN all experienced shifts in distribution, scale, or relationships.

Distribution of drifted features (image by author)

Stable feature: Time Since Last Transaction remained unchanged.

Distribution of stable feature (image by author)

Drifted scores: As a result of the drifted features, the distribution in model scores has also changed.

Distribution of model scores (image by author)

This setup allows us to analyze how feature drift impacts model scores in production.

Detecting model score drift using PSI

To monitor model scores, I used population stability index (PSI) to measure how much model score distribution has shifted over time.

PSI works by binning continuous model scores and comparing the proportion of scores in each bin between the reference and production datasets. It compares the differences in proportions and their logarithmic ratios to compute a single summary statistic to quantify the drift.

Python implementation:

# Define function to calculate PSI given two datasets
def calculate_psi(reference, production, bins=10):
  # Discretize scores into bins
  min_val, max_val = 0, 1
  bin_edges = np.linspace(min_val, max_val, bins + 1)

  # Calculate proportions in each bin
  ref_counts, _ = np.histogram(reference, bins=bin_edges)
  prod_counts, _ = np.histogram(production, bins=bin_edges)

  ref_proportions = ref_counts / len(reference)
  prod_proportions = prod_counts / len(production)
  
  # Avoid division by zero
  ref_proportions = np.clip(ref_proportions, 1e-8, 1)
  prod_proportions = np.clip(prod_proportions, 1e-8, 1)

  # Calculate PSI for each bin
  psi = np.sum((ref_proportions - prod_proportions) * np.log(ref_proportions / prod_proportions))

  return psi
  
# Calculate PSI
psi_value = calculate_psi(ref_data['model_score'], prod_data['model_score'], bins=10)
print(f"PSI Value: {psi_value}")

Below is a summary of how to interpret PSI values:

PSI < 0.1: No drift, or very minor drift (distributions are almost identical).
0.1 ≤ PSI < 0.25: Some drift. The distributions are somewhat different.
0.25 ≤ PSI < 0.5: Moderate drift. A noticeable shift between the reference and production distributions.
PSI ≥ 0.5: Significant drift. There is a large shift, indicating that the distribution in production has changed substantially from the reference data.

Histogram of model score distributions (image by author)

The PSI value of 0.6374 suggests a significant drift between our reference and production datasets. This aligns with the histogram of model score distributions, which visually confirms the shift towards higher scores in production — indicating an increase in risky transactions.

Detecting feature drift

Kolmogorov-Smirnov test for numeric features

The Kolmogorov-Smirnov (K-S) test is my preferred method for detecting drift in numeric features, because it is non-parametric, meaning it doesn’t assume a normal distribution.

The test compares a feature’s distribution in the reference and production datasets by measuring the maximum difference between the empirical cumulative distribution functions (ECDFs). The resulting K-S statistic ranges from 0 to 1:

0 indicates no difference between the two distributions.
Values closer to 1 suggest a greater shift.

Python implementation:

# Create an empty dataframe
ks_results = pd.DataFrame(columns=['Feature', 'KS Statistic', 'p-value', 'Drift Detected'])

# Loop through all features and perform the K-S test
for col in numeric_cols:
    ks_stat, p_value = ks_2samp(ref_data[col], prod_data[col])
    drift_detected = p_value < 0.05
		
		# Store results in the dataframe
    ks_results = pd.concat([
        ks_results,
        pd.DataFrame({
            'Feature': [col],
            'KS Statistic': [ks_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

Below are ECDF charts of the four numeric features in our dataset:

ECDFs of four numeric features (image by author)

Let’s look at the account age feature as an example: the x-axis represents account age (0-50 months), while the y-axis shows the ECDF for both reference and production datasets. The production dataset skews towards newer accounts, as it has a larger proportion of observations with lower account ages.

Chi-Square test for categorical features

To detect shifts in categorical and boolean features, I like to use the Chi-Square test.

This test compares the frequency distribution of a categorical feature in the reference and production datasets, and returns two values:

Chi-Square statistic: A higher value indicates a greater shift between the reference and production datasets.
P-value: A p-value below 0.05 suggests that the difference between the reference and production datasets is statistically significant, indicating potential feature drift.

Python implementation:

# Create empty dataframe with corresponding column names
chi2_results = pd.DataFrame(columns=['Feature', 'Chi-Square Statistic', 'p-value', 'Drift Detected'])

for col in categorical_cols:
    # Get normalized value counts for both reference and production datasets
    ref_counts = ref_data[col].value_counts(normalize=True)
    prod_counts = prod_data[col].value_counts(normalize=True)

    # Ensure all categories are represented in both
    all_categories = set(ref_counts.index).union(set(prod_counts.index))
    ref_counts = ref_counts.reindex(all_categories, fill_value=0)
    prod_counts = prod_counts.reindex(all_categories, fill_value=0)

    # Create contingency table
    contingency_table = np.array([ref_counts * len(ref_data), prod_counts * len(prod_data)])

    # Perform Chi-Square test
    chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
    drift_detected = p_value < 0.05

    # Store results in chi2_results dataframe
    chi2_results = pd.concat([
        chi2_results,
        pd.DataFrame({
            'Feature': [col],
            'Chi-Square Statistic': [chi2_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

The Chi-Square statistic of 57.31 with a p-value of 3.72e-14 confirms a large shift in our categorical feature, Entered PIN. This finding aligns with the histogram below, which visually illustrates the shift:

Distribution of categorical feature (image by author)

Detecting multivariate shifts

Spearman Correlation for shifts in pairwise interactions

In addition to monitoring individual feature shifts, it’s important to track shifts in relationships or interactions between features, known as multivariate shifts. Even if the distributions of individual features remain stable, multivariate shifts can signal meaningful differences in the data.

By default, Pandas’ .corr() function calculates Pearson correlation, which only captures linear relationships between variables. However, relationships between features are often non-linear yet still follow a consistent trend.

To capture this, we use Spearman correlation to measure monotonic relationships between features. It captures whether features change together in a consistent direction, even if their relationship isn’t strictly linear.

To assess shifts in feature relationships, we compare:

Reference correlation (ref_corr): Captures historical feature relationships in the reference dataset.
Production correlation (prod_corr): Captures new feature relationships in production.
Absolute difference in correlation: Measures how much feature relationships have shifted between the reference and production datasets. Higher values indicate more significant shifts.

Python implementation:

# Calculate correlation matrices
ref_corr = ref_data.corr(method='spearman')
prod_corr = prod_data.corr(method='spearman')

# Calculate correlation difference
corr_diff = abs(ref_corr - prod_corr)

Example: Change in correlation

Now, let’s look at the correlation between transaction_amount and account_age_in_months:

In ref_corr, the correlation is 0.00095, indicating a weak relationship between the two features.
In prod_corr, the correlation is -0.0325, indicating a weak negative correlation.
Absolute difference in the Spearman correlation is 0.0335, which is a small but noticeable shift.

The absolute difference in correlation indicates a shift in the relationship between transaction_amount and account_age_in_months.

There used to be no relationship between these two features, but the production dataset indicates that there is now a weak negative correlation, meaning that newer accounts have higher transaction amounts. This is spot on!

Autoencoder for complex, high-dimensional multivariate shifts

In addition to monitoring pairwise interactions, we can also look for shifts across more dimensions in the data.

Autoencoders are powerful tools for detecting high-dimensional multivariate shifts, where multiple features collectively change in ways that may not be apparent from looking at individual feature distributions or pairwise correlations.

An autoencoder is a neural network that learns a compressed representation of data through two components:

Encoder: Compresses input data into a lower-dimensional representation.
Decoder: Reconstructs the original input from the compressed representation.

To detect shifts, we compare the reconstructed output to the original input and compute the reconstruction loss.

Low reconstruction loss → The autoencoder successfully reconstructs the data, meaning the new observations are similar to what it has seen and learned.
High reconstruction loss → The production data deviates significantly from the learned patterns, indicating potential drift.

Unlike traditional drift metrics that focus on individual features or pairwise relationships, autoencoders capture complex, non-linear dependencies across multiple variables simultaneously.

Python implementation:

ref_features = ref_data[numeric_cols + categorical_cols]
prod_features = prod_data[numeric_cols + categorical_cols]

# Normalize the data
scaler = StandardScaler()
ref_scaled = scaler.fit_transform(ref_features)
prod_scaled = scaler.transform(prod_features)

# Split reference data into train and validation
np.random.shuffle(ref_scaled)
train_size = int(0.8 * len(ref_scaled))
train_data = ref_scaled[:train_size]
val_data = ref_scaled[train_size:]

# Build autoencoder
input_dim = ref_features.shape[1]
encoding_dim = 3 
# Input layer
input_layer = Input(shape=(input_dim, ))
# Encoder
encoded = Dense(8, activation="relu")(input_layer)
encoded = Dense(encoding_dim, activation="relu")(encoded)
# Decoder
decoded = Dense(8, activation="relu")(encoded)
decoded = Dense(input_dim, activation="linear")(decoded)
# Autoencoder
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer="adam", loss="mse")

# Train autoencoder
history = autoencoder.fit(
    train_data, train_data,
    epochs=50,
    batch_size=64,
    shuffle=True,
    validation_data=(val_data, val_data),
    verbose=0
)

# Calculate reconstruction error
ref_pred = autoencoder.predict(ref_scaled, verbose=0)
prod_pred = autoencoder.predict(prod_scaled, verbose=0)

ref_mse = np.mean(np.power(ref_scaled - ref_pred, 2), axis=1)
prod_mse = np.mean(np.power(prod_scaled - prod_pred, 2), axis=1)

The charts below show the distribution of reconstruction loss between both datasets.

Distribution of reconstruction loss between actuals and predictions (image by author)

The production dataset has a higher mean reconstruction error than that of the reference dataset, indicating a shift in the overall data. This aligns with the changes in the production dataset with a higher number of newer accounts with high-value transactions.

Summarizing

Model monitoring is an essential, yet often overlooked, responsibility for data scientists and machine learning engineers.

All the statistical methods led to the same conclusion, which aligns with the observed shifts in the data: they detected a trend in production towards newer accounts making higher-value transactions. This shift resulted in higher model scores, signaling an increase in potential fraud.

In this post, I covered techniques for detecting drift on three different levels:

Model score drift: Using Population Stability Index (PSI)
Individual feature drift: Using Kolmogorov-Smirnov test for numeric features and Chi-Square test for categorical features
Multivariate drift: Using Spearman correlation for pairwise interactions and autoencoders for high-dimensional, multivariate shifts.

These are just a few of the techniques I rely on for comprehensive monitoring — there are plenty of other equally valid statistical methods that can also detect drift effectively.

Detected shifts often point to underlying issues that warrant further investigation. The root cause could be as serious as a data collection bug, or as minor as a time change like daylight savings time adjustments.

There are also fantastic python packages, like evidently.ai, that automate many of these comparisons. However, I believe there’s significant value in deeply understanding the statistical techniques behind drift detection, rather than relying solely on these tools.

What’s the model monitoring process like at places you’ve worked?

Want to build your AI skills?

I run the AI Weekender and write weekly blog posts on data science, AI weekend projects, career advice for professionals in data.

Resources

Jupyter Notebook: https://colab.research.google.com/drive/1qQFKjg3wLWmj2z4w6_U_xsqRREaB2sBP?authuser=3#scrollTo=EdpoxjNY_CUX

The post How to Spot and Prevent Model Drift Before it Impacts Your Business appeared first on Towards Data Science.

One-Tailed Vs. Two-Tailed Tests

Allon Korem — Thu, 06 Mar 2025 04:22:42 +0000

Introduction

If you’ve ever analyzed data using built-in t-test functions, such as those in R or SciPy, here’s a question for you: have you ever adjusted the default setting for the alternative hypothesis? If your answer is no—or if you’re not even sure what this means—then this blog post is for you!

The alternative hypothesis parameter, commonly referred to as “one-tailed” versus “two-tailed” in statistics, defines the expected direction of the difference between control and treatment groups. In a two-tailed test, we assess whether there is any difference in mean values between the groups, without specifying a direction. A one-tailed test, on the other hand, posits a specific direction—whether the control group’s mean is either less than or greater than that of the treatment group.

Choosing between one- and two-tailed hypotheses might seem like a minor detail, but it affects every stage of A/B testing: from test planning to Data Analysis and results interpretation. This article builds a theoretical foundation on why the hypothesis direction matters and explores the pros and cons of each approach.

One-tailed vs. two-tailed hypothesis testing: Understanding the difference

To understand the importance of choosing between one-tailed and two-tailed hypotheses, let’s briefly review the basics of the t-test, the commonly used method in A/B testing. Like other Hypothesis Testing methods, the t-test begins with a conservative assumption: there is no difference between the two groups (the null hypothesis). Only if we find strong evidence against this assumption can we reject the null hypothesis and conclude that the treatment has had an effect.

But what qualifies as “strong evidence”? To that end, a rejection region is determined under the null hypothesis and all results that fall within this region are deemed so unlikely that we take them as evidence against the feasibility of the null hypothesis. The size of this rejection region is based on a predetermined probability, known as alpha (α), which represents the likelihood of incorrectly rejecting the null hypothesis.

What does this have to do with the direction of the alternative hypothesis? Quite a bit, actually. While the alpha level determines the size of the rejection region, the alternative hypothesis dictates its placement. In a one-tailed test, where we hypothesize a specific direction of difference, the rejection region is situated in only one tail of the distribution. For a hypothesized positive effect (e..g., that the treatment group mean is higher than the control group mean), the rejection region would lie in the right tail, creating a right-tailed test. Conversely, if we hypothesize a negative effect (e.g., that the treatment group mean is less than the control group mean), the rejection region would be placed in the left tail, resulting in a left-tailed test.

In contrast, a two-tailed test allows for the detection of a difference in either direction, so the rejection region is split between both tails of the distribution. This accommodates the possibility of observing extreme values in either direction, whether the effect is positive or negative.

To build intuition, let’s visualize how the rejection regions appear under the different hypotheses. Recall that according to the null hypothesis, the difference between the two groups should center around zero. Thanks to the central limit theorem, we also know this distribution approximates a normal distribution. Consequently, the rejection areas corresponding to the different alternative hypothesis look like that:

Why does it make a difference?

The choice of direction for the alternative hypothesis impacts the entire A/B testing process, starting with the planning phase—specifically, in determining the sample size. Sample size is calculated based on the desired power of the test, which is the probability of detecting a true difference between the two groups when one exists. To compute power, we examine the area under the alternative hypothesis that corresponds to the rejection region (since power reflects the ability to reject the null hypothesis when the alternative hypothesis is true).

Since the direction of the hypothesis affects the size of this rejection region, power is generally lower for a two-tailed hypothesis. This is due to the rejection region being divided across both tails, making it more challenging to detect an effect in any one direction. The following graph illustrates the comparison between the two types of hypotheses. Note that the purple area is larger for the one-tailed hypothesis, compared to the two-tailed hypothesis:

In practice, to maintain the desired power level, we compensate for the reduced power of a two-tailed hypothesis by increasing the sample size (Increasing sample size raises power, though the mechanics of this can be a topic for a separate article). Thus, the choice between one- and two-tailed hypotheses directly influences the required sample size for your test.

Beyond the planning phase, the choice of alternative hypothesis directly impacts the analysis and interpretation of results. There are cases where a test may reach significance with a one-tailed approach but not with a two-tailed one, and vice versa. Reviewing the previous graph can help illustrate this: for example, a result in the left tail might be significant under a two-tailed hypothesis but not under a right one-tailed hypothesis. Conversely, certain results might fall within the rejection region of a right one-tailed test but lie outside the rejection area in a two-tailed test.

How to decide between a one-tailed and two-tailed hypothesis

Let’s start with the bottom line: there’s no absolute right or wrong choice here. Both approaches are valid, and the primary consideration should be your specific business needs. To help you decide which option best suits your company, we’ll outline the key pros and cons of each.

At first glance, a one-tailed alternative may appear to be the clear choice, as it often aligns better with business objectives. In industry applications, the focus is typically on improving specific metrics rather than exploring a treatment’s impact in both directions. This is especially relevant in A/B testing, where the goal is often to optimize conversion rates or enhance revenue. If the treatment doesn’t lead to a significant improvement the examined change won’t be implemented.

Beyond this conceptual advantage, we have already mentioned one key benefit of a one-tailed hypothesis: it requires a smaller sample size. Thus, choosing a one-tailed alternative can save both time and resources. To illustrate this advantage, the following graphs show the required sample sizes for one- and two-tailed hypotheses with different power levels (alpha is set at 5%).

In this context, the decision between one- and two-tailed hypotheses becomes particularly important in sequential testing—a method that allows for ongoing data analysis without inflating the alpha level. Here, selecting a one-tailed test can significantly reduce the duration of the test, enabling faster decision-making, which is especially valuable in dynamic business environments where prompt responses are essential.

However, don’t be too quick to dismiss the two-tailed hypothesis! It has its own advantages. In some business contexts, the ability to detect “negative significant results” is a major benefit. As one client once shared, he preferred negative significant results over inconclusive ones because they offer valuable learning opportunities. Even if the outcome wasn’t as expected, he could conclude that the treatment had a negative effect and gain insights into the product.

Another benefit of two-tailed tests is their straightforward interpretation using confidence intervals (CIs). In two-tailed tests, a CI that doesn’t include zero directly indicates significance, making it easier for practitioners to interpret results at a glance. This clarity is particularly appealing since CIs are widely used in A/B testing platforms. Conversely, with one-tailed tests, a significant result might still include zero in the CI, potentially leading to confusion or mistrust in the findings. Although one-sided confidence intervals can be employed with one-tailed tests, this practice is less common.

Conclusions

By adjusting a single parameter, you can significantly impact your A/B testing: specifically, the sample size you need to collect and the interpretation of the results. When deciding between one- and two-tailed hypotheses, consider factors such as the available sample size, the advantages of detecting negative effects, and the convenience of aligning confidence intervals (CIs) with hypothesis testing. Ultimately, this decision should be made thoughtfully, taking into account what best fits your business needs.

(Note: all the images in this post were created by the author)

The post One-Tailed Vs. Two-Tailed Tests appeared first on Towards Data Science.

I Won’t Change Unless You Do

Dorian Drost — Fri, 28 Feb 2025 12:00:00 +0000

In Game Theory, how can players ever come to an end if there still might be a better option to decide for? Maybe one player still wants to change their decision. But if they do, maybe the other player wants to change too. How can they ever hope to escape from this vicious circle? To solve this problem, the concept of a Nash equilibrium, which I will explain in this article, is fundamental to game theory.

This article is the second part of a four-chapter series on game theory. If you haven’t checked out the first chapter yet, I’d encourage you to do that to get familiar with the main terms and concepts of game theory. If you did so, you are prepared for the next steps of our journey through game theory. Let’s go!

Finding the solution

Finding a solution to a game in game theory can be tricky sometimes. Photo by Mel Poole on Unsplash

We will now try to find a solution for a game in game theory. A solution is a set of actions, where each player maximizes their utility and therefore behaves rationally. That does not necessarily mean, that each player wins the game, but that they do the best they can do, given that they don’t know what the other players will do. Let’s consider the following game:

If you are unfamiliar with this matrix-notation, you might want to take a look back at Chapter 1 and refresh your memory. Do you remember that this matrix gives you the reward for each player given a specific pair of actions? For example, if player 1 chooses action Y and player 2 chooses action B, player 1 will get a reward of 1 and player 2 will get a reward of 3.

Okay, what actions should the players decide for now? Player 1 does not know what player 2 will do, but they can still try to find out what would be the best action depending on player 2’s choice. If we compare the utilities of actions Y and Z (indicated by the blue and red boxes in the next figure), we notice something interesting: If player 2 chooses action A (first column of the matrix), player 1 will get a reward of 3, if they choose action Y and a reward of 2, if they choose action Z, so action Y is better in that case. But what happens, if player 2 decides for action B (second column)? In that case, action Y gives a reward of 1 and action Z gives a reward of 0, so Y is better than Z again. And if player 2 chooses action C (third column), Y is still better than Z (reward of 2 vs. reward of 1). That means, that player 1 should never use action Z, because action Y is always better.

We compare the rewards for player 1for actions Y and Z.

With the aforementioned considerations, player 2 can anticipate, that player 1 would never use action Z and hence player 2 doesn’t have to care about the rewards that belong to action Z. This makes the game much smaller, because now there are only two options left for player 1, and this also helps player 2 decide for their action.

We found out, that for player 1 Y is always better than Z, so we don’t consider Z anymore.

If we look at the truncated game, we see, that for player 2, option B is always better than action A. If player 1 chooses X, action B (with a reward of 2) is better than option A (with a reward of 1), and the same applies if player 1 chooses action Y. Note that this would not be the case if action Z was still in the game. However, we already saw that action Z will never be played by player 1 anyway.

We compare the rewards for player 2 for actions A and B.

As a consequence, player 2 would never use action A. Now if player 1 anticipates that player 2 never uses action A, the game becomes smaller again and fewer options have to be considered.

We saw, that for player 2 action B is always better than action A, so we don’t have to consider A anymore.

We can easily continue in a likewise fashion and see that for player 1, X is now always better than Y (2>1 and 4>2). Finally, if player 1 chooses action A, player 2 will choose action B, which is better than C (2>0). In the end, only the action X (for player 1) and B (for player 2) are left. That is the solution of our game:

In the end, only one option remains, namely player 1 using X and player 2 using B.

It would be rational for player 1 to choose action X and for player 2 to choose action B. Note that we came to that conclusion without exactly knowing what the other player would do. We just anticipated that some actions would never be taken, because they are always worse than other actions. Such actions are called strictly dominated. For example, action Z is strictly dominated by action Y, because Y is always better than Z.

The best answer

Scrabble is one of those games, where searching for the best answer can take ages. Photo by Freysteinn G. Jonsson on Unsplash

Such strictly dominated actions do not always exist, but there is a similar concept that is of importance for us and is called a best answer. Say we know which action the other player chooses. In that case, deciding on an action becomes very easy: We just take the action that has the highest reward. If player 1 knew that player 2 chose option A, the best answer for player 1 would be Y, because Y has the highest reward in that column. Do you see how we always searched for the best answers before? For each possible action of the other player we searched for the best answer, if the other player chose that action. More formally, player i’s best answer to a given set of actions of all other players is the action of player 1 which maximises the utility given the other players’ actions. Also be aware, that a strictly dominated action can never be a best answer.

Let us come back to a game we introduced in the first chapter: The prisoners’ dilemma. What are the best answers here?

Prisoners’ dilemma

How should player 1 decide, if player 2 confesses or denies? If player 2 confesses, player 1 should confess as well, because a reward of -3 is better than a reward of -6. And what happens, if player 2 denies? In that case, confessing is better again, because it would give a reward of 0, which is better than a reward of -1 for denying. That means, for player 1 confessing is the best answer for both actions of player 2. Player 1 doesn’t have to worry about the other player’s actions at all but should always confess. Because of the game’s symmetry, the same applies to player 2. For them, confessing is also the best answer, no matter what player 1 does.

The Nash Equilibrium

The Nash equilibrium is somewhat like the master key that allows us to solve game-theoretic problems. Researchers were very happy when they found it. Photo by rc.xyz NFT gallery on Unsplash

If all players play their best answer, we have reached a solution of the game that is called a Nash Equilibrium. This is a key concept in game theory, because of an important property: In a Nash Equilibrium, no player has any reason to change their action, unless any other player does. That means all players are as happy as they can be in the situation and they wouldn’t change, even if they could. Consider the prisoner’s dilemma from above: The Nash equilibrium is reached when both confess. In this case, no player would change his action without the other. They could become better if both changed their action and decided to deny, but since they can’t communicate, they don’t expect any change from the other player and so they don’t change themselves either.

You may wonder if there is always a single Nash equilibrium for each game. Let me tell you there can also be multiple ones, as in the Bach vs. Stravinsky game that we already got to know in Chapter 1:

Bach vs. Stravinsky

This game has two Nash equilibria: (Bach, Bach) and (Stravinsky, Stravinsky). In both scenarios, you can easily imagine that there is no reason for any player to change their action in isolation. If you sit in the Bach concerto with your friend, you would not leave your seat to go to the Stravinsky concerto alone, even if you favour Stravinsky over Bach. In a likewise fashion, the Bach fan wouldn’t go away from the Stravinsky concerto if that meant leaving his friend alone. In the remaining two scenarios, you would think differently though: If you were in the Stravinsky concerto alone, you would want to get out there and join your friend in the Bach concerto. That is, you would change your action even if the other player doesn’t change theirs. This tells you, that the scenario you have been in was not a Nash equilibrium.

However, there can also be games that have no Nash equilibrium at all. Imagine you are a soccer keeper during a penalty shot. For simplicity, we assume you can jump to the left or to the right. The soccer player of the opposing team can also shoot in the left or right corner, and we assume, that you catch the ball if you decide for the same corner as they do and that you don’t catch it if you decide for opposing corners. We can display this game as follows:

A game matrix for a penalty shooting.

You won’t find any Nash equilibrium here. Each scenario has a clear winner (reward 1) and a clear loser (reward -1), and hence one of the players will always want to change. If you jump to the right and catch the ball, your opponent will wish to change to the left corner. But then you again will want to change your decision, which will make your opponent choose the other corner again and so on.

Summary

We learned about finding a point of balance, where nobody wants to change anymore. That is a Nash equilibrium. Photo by Eran Menashri on Unsplash

This chapter showed how to find solutions for games by using the concept of a Nash equilibrium. Let us summarize, what we have learned so far:

A solution of a game in game theory maximizes every player’s utility or reward.
An action is called strictly dominated if there is another action that is always better. In this case, it would be irrational to ever play the strictly dominated action.
The action that yields the highest reward given the actions taken by the other players is called a best answer.
A Nash equilibrium is a state where every player plays their best answer.
In a Nash Equilibrium, no player wants to change their action unless any other play does. In that sense, Nash equilibria are optimal states.
Some games have multiple Nash equilibria and some games have none.

If you were saddened by the fact that there is no Nash equilibrium in some games, don’t despair! In the next chapter, we will introduce probabilities of actions and this will allow us to find more equilibria. Stay tuned!

References

The topics introduced here are typically covered in standard textbooks on game theory. I mainly used this one, which is written in German though:

Bartholomae, F., & Wiens, M. (2016). Spieltheorie. Ein anwendungsorientiertes Lehrbuch. Wiesbaden: Springer Fachmedien Wiesbaden.

An alternative in English language could be this one:

Espinola-Arredondo, A., & Muñoz-Garcia, F. (2023). Game Theory: An Introduction with Step-by-step Examples. Springer Nature.

Game theory is a rather young field of research, with the first main textbook being this one:

Von Neumann, J., & Morgenstern, O. (1944). Theory of games and economic behavior.

Like this article? Follow me to be notified of my future posts.

The post I Won’t Change Unless You Do appeared first on Towards Data Science.

The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines

Murtaza Ali — Thu, 27 Feb 2025 02:15:25 +0000

“You don’t have to be an expert to deceive someone, though you might need some expertise to reliably recognize when you are being deceived.”

When my co-instructor and I start our quarterly lesson on deceptive visualizations for the data visualization course we teach at the University of Washington, he emphasizes the point above to our students. With the advent of modern technology, developing pretty and convincing claims about data is easier than ever. Anyone can make something that seems passable, but contains oversights that render it inaccurate and even harmful. Furthermore, there are also malicious actors who actively want to deceive you, and who have studied some of the best ways to do it.

I often start this lecture with a bit of a quip, looking seriously at my students and asking two questions:

“Is it a good thing if someone is gaslighting you?”
After the general murmur of confusion followed by agreement that gaslighting is indeed bad, I ask the second question: “What’s the best way to ensure no one ever gaslights you?”

The students generally ponder that second question for a bit longer, before chuckling a bit and realizing the answer: It’s to learn how people gaslight in the first place. Not so you can take advantage of others, but so you can prevent others from taking advantage of you.

The same applies in the realm of misinformation and disinformation. People who want to mislead with data are empowered with a host of tools, from high-speed internet to social media to, most recently, generative AI and large language models. To protect yourself from being misled, you need to learn their tricks.

In this article, I’ve taken the key ideas from my data visualization course’s unit on deception–drawn from Alberto Cairo’s excellent book How Charts Lie–and broadened them into some general principles about deception and data. My hope is that you read it, internalize it, and take it with you to arm yourself against the onslaught of lies perpetuated by ill-intentioned people powered with data.

Humans Cannot Interpret Area

At least, not as well as we interpret other visual cues. Let’s illustrate this with an example. Say we have an extremely simple numerical data set; it’s one dimensional and consists of just two values: 50 and 100. One way to represent this visually is via the length of bars, as follows:

This is true to the underlying data. Length is a one-dimensional quantity, and we have doubled it in order to indicate a doubling of value. But what happens if we want to represent the same data with circles? Well, circles aren’t really defined by a length or width. One option is to double the radius:

Hmm. The first circle has a radius of 100 pixels, and the second has a radius of 50 pixels–so this is technically correct if we wanted to double the radius. However, because of the way that area is calculated (πr²), we’ve way more than doubled the area. So what if we tried just doing that, since it seems more visually accurate? Here is a revised version:

Now we have a different problem. The larger circle is mathematically twice the area of the smaller one, but it no longer looks that way. In other words, even though it is a visually accurate comparison of a doubled quantity, human eyes have difficulty perceiving it.

The issue here is trying to use area as a visual marker in the first place. It’s not necessarily wrong, but it is confusing. We’re increasing a one-dimensional value, but area is a two-dimensional quantity. To the human eye, it’s always going to be difficult to interpret accurately, especially when compared with a more natural visual representation like bars.

Now, this may seem like it’s not a huge deal–but let’s take a look at what happens when you extend this to an actual data set. Below, I’ve pasted two images of charts I made in Altair (a Python-based visualization package). Each chart shows the maximum temperature (in Celsius) during the first week of 2012 in Seattle, USA. The first one uses bar lengths to make the comparison, and the second uses circle areas.

Which one makes it easier to see the differences? The legend helps in the second one, but if we’re being honest, it’s a lost cause. It is much easier to make precise comparisons with the bars, even in a setting where we have such limited data.

Remember that the point of a visualization is to clarify data–to make hidden trends easier to see for the average person. To achieve this goal, it’s best to use visual cues that simplify the process of making that distinction.

Beware Political Headlines (In Any Direction)

There is a small trick question I sometimes ask my students on a homework assignment around the fourth week of class. The assignment mostly involves generating visualizations in Python–but for the last question, I give them a chart I myself generated accompanied by a single question:

Question: There is one thing egregiously wrong with the chart above, an unforgivable error in Data Visualization. What is it?

Most think it has something to do with the axes, marks, or some other visual aspect, often suggesting improvements like filling in the circles or making the axis labels more informative. Those are fine suggestions, but not the most pressing.

The most flawed trait (or lack thereof, rather) in the chart above is the missing title. A title is crucial to an effective data visualization. Without it, how are we supposed to know what this visualization is even about? As of now, we can only ascertain that it must vaguely have something to do with carbon dioxide levels across a span of years. That isn’t much.

Many folks, feeling this requirement is too stringent, argue that a visualization is often meant to be understood in context, as part of a larger article or press release or other accompanying piece of text. Unfortunately, this line of thinking is far too idealistic; in reality, a visualization must stand alone, because it will often be the only thing people look at–and in social media blow-up cases, the only thing that gets shared widely. As a result, it should have a title to explain itself.

Of course, the title of this very subsection tells you to be wary of such headlines. That is true. While they are necessary, they are a double-edged sword. Since visualization designers know viewers will pay attention to the title, ill-meaning ones can also use it to sway people in less-than-accurate directions. Let’s look at an example:

It's time to end Chain Migration: https://t.co/kad5A8Slw7 pic.twitter.com/735JzAZIUa
— The White House 45 Archived (@WhiteHouse45) December 18, 2017

The above is a picture shared by the White House’s public Twitter account in 2017. The picture is also referenced by Alberto Cairo in his book, which emphasizes many of the points I will now make.

First things first. The word “chain migration,” referring to what is formally known as family-based migration (where an immigrant may sponsor family members to come to the United States), has been criticized by many who argue that it is needlessly aggressive and makes legal immigrants sound threatening for no reason.

Of course, politics is by its very nature divisive, and it is possible for any side to make a heated argument. The primary issue here is actually a data-related one–specifically, what the use of the word “chain” implies in the context of the chart shared with the tweet. “Chain” migration seems to indicate that people can immigrate one after the other, in a seemingly endless stream, uninhibited and unperturbed by the distance of family relations. The reality, of course, is that a single immigrant can mostly just sponsor immediate family members, and even that takes quite a bit of time. But when one reads the phrase “chain migration” and then immediately looks at a seemingly sensible chart depicting it, it is easy to believe that an individual can in fact spawn additional immigrants at a base-3 exponential growth rate.

That is the issue with any kind of political headline–it makes it far too easy to conceal dishonest, inaccurate workings with actual data processing, analysis, and visualization.

There is no data underlying the chart above. None. Zero. It is completely random, and that is not okay for a chart that is purposefully made to appear as if it is showing something meaningful and quantitative.

As a fun little rabbit hole to go down which highlights the dangers of political headlining within data, here is a link to FloorCharts, a Twitter account that posts the most absurd graphics shown on the U.S. Congress floor.

Don’t Use 3D. Please.

I’ll end this article on a slightly lighter topic–but still an important one. Under no circumstances–none at all–should you ever utilize a 3D chart. And if you’re in the shoes of the viewer–that is, if you’re looking at a 3D pie chart made by someone else–don’t trust it.

The reason for this is simple, and connects back to what I discussed with circles and rectangles: a third dimension severely distorts the actuality behind what are usually one-dimensional measures. Area was already hard to interpret–how well do you really think the human eye does with volume?

Here is a 3D pie chart I generated with random numbers:

Now, here is the exact same pie chart, but in two dimensions:

Notice how the blue is not quite as dominant as the 3D version seems to suggest, and that the red and orange are closer to one another in size than originally portrayed. I also removed the percentage labels intentionally (technically bad practice) in order to emphasize how even with the labels present in the first one, our eyes automatically pay more attention to the more drastic visual differences. If you’re reading this article with an analytical eye, perhaps you think it doesn’t make that much of a difference. But the fact is, you’ll often see such charts in the news or on social media, and a quick glance is all they’ll ever get.

It is important to ensure that the story told by that quick glance is a truthful one.

Final Thoughts

Data science is often touted as the perfect synthesis of Statistics, computing, and society, a way to obtain and share deep and meaningful insights about an information-heavy world. This is true–but as the capacity to widely share such insights expands, so must our general ability to interpret them accurately. It is my hope that in light of that, you have found this primer to be helpful.

Stay tuned for Part 2, in which I’ll talk about a few deceptive techniques a bit more involved in nature–including base proportions, (un)trustworthy statistical measures, and measures of correlation.

In the meantime, try not to get deceived.

The post The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines appeared first on Towards Data Science.

Do European M&Ms Actually Taste Better than American M&Ms?

Erin Wilson — Fri, 21 Feb 2025 21:52:58 +0000

(Oh, I am the only one who’s been asking this question…? Hm. Well, if you have a minute, please enjoy this exploratory Data Analysis — featuring experimental design, statistics, and interactive visualization — applied a bit too earnestly to resolve an international debate.)

1. Introduction

1.1 Background and motivation

Chocolate is enjoyed around the world. From ancient practices harvesting organic cacao in the Amazon basin, to chocolatiers sculpting edible art in the mountains of Switzerland, and enormous factories in Hershey, Pennsylvania churning out 70 million kisses per day, the nuanced forms and flavors of chocolate have been integrated into many cultures and their customs. While quality can greatly vary across chocolate products, a well-known, shelf-stable, easily shareable form of chocolate are M&Ms. Readily found by convenience store check-out counters and in hotel vending machines, the brightly colored pellets are a popular treat whose packaging is re-branded to fit nearly any commercializable American holiday.

While living in Denmark in 2022, I heard a concerning claim: M&Ms manufactured in Europe taste different, and arguably “better,” than M&Ms produced in the United States. While I recognized that fancy European chocolate is indeed quite tasty and often superior to American chocolate, it was unclear to me if the same claim should hold for M&Ms. I learned that many Europeans perceive an “unpleasant” or “tangy” taste in American chocolate, which is largely attributed to butyric acid, a compound resulting from differences in how milk is treated before incorporation into milk chocolate.

But honestly, how much of a difference could this make for M&Ms? M&Ms!? I imagined M&Ms would retain a relatively processed/mass-produced/cheap candy flavor wherever they were manufactured. As the lone American visiting a diverse lab of international scientists pursuing cutting-edge research in biosustainability, I was inspired to break out my data science toolbox and investigate this M&M flavor phenomenon.

1.2 Previous work

To quote a European woman, who shall remain anonymous, after she tasted an American M&M while traveling in New York:

“They taste so gross. Like vomit. I don’t understand how people can eat this. I threw the rest of the bag away.”

Vomit? Really? In my experience, children raised in the United States had no qualms about eating M&Ms. Growing up, I was accustomed to bowls of M&Ms strategically placed in high traffic areas around my house to provide readily available sugar. Clearly American M&Ms are edible. But are they significantly different and/or inferior to their European equivalent?

In response to the anonymous European woman’s scathing report, myself and two other Americans visiting Denmark sampled M&Ms purchased locally in the Lyngby Storcenter Føtex. We hoped to experience the incredible improvement in M&M flavor that was apparently hidden from us throughout our youths. But curiously, we detected no obvious flavor improvements.

Unfortunately, neither preliminary study was able to conduct a side-by-side taste test with proper controls and randomized M&M sampling. Thus, we turn to science.

1.3 Study Goals

This study seeks to remedy the previous lack of thoroughness and investigate the following questions:

Is there a global consensus that European M&Ms are in fact better than American M&Ms?
Can Europeans actually detect a difference between M&Ms purchased in the US vs in Europe when they don’t know which one they are eating? Or is this a grand, coordinated lie amongst Europeans to make Americans feel embarrassed?
Are Americans actually taste-blind to American vs European M&Ms? Or can they taste a difference but simply don’t describe this difference as “an improvement” in flavor?
Can these alleged taste differences be perceived by citizens of other continents? If so, do they find one flavor obviously superior?

2. Methods

2.1 Experimental design and data collection

Participants were recruited by luring — er, inviting them to a social gathering (with the promise of free food) that was conveniently co-located with the testing site. Once a participant agreed to pause socializing and join the study, they were positioned at a testing station with a trained experimenter who guided them through the following steps:

Participants sat at a table and received two cups: 1 empty and 1 full of water. With one cup in each hand, the participant was asked to close their eyes, and keep them closed through the remainder of the experiment.
The experimenter randomly extracted one M&M with a spoon, delivered it to the participant’s empty cup, and the participant was asked to eat the M&M (eyes still closed).
After eating each M&M, the experimenter collected the taste response by asking the participant to report if they thought the M&M tasted: Especially Good, Especially Bad, or Normal.
Each participant received a total of 10 M&Ms (5 European, 5 American), one at a time, in a random sequence determined by random.org.
Between eating each M&M, the participant was asked to take a sip of water to help “cleanse their palate.”
Data collected: for each participant, the experimenter recorded the participant’s continent of origin (if this was ambiguous, the participant was asked to list the continent on which they have the strongest memories of eating candy as a child). For each of the 10 M&Ms delivered, the experimenter recorded the M&M origin (“Denmark” or “USA”), the M&M color, and the participant’s taste response. Experimenters were also encouraged to jot down any amusing phrases uttered by the participant during the test, recorded under notes (data available here).

2.2 Sourcing materials and recruiting participants

Two bags of M&Ms were purchased for this study. The American-sourced M&Ms (“USA M&M”) were acquired at the SFO airport and delivered by the author’s parents, who visited her in Denmark. The European-sourced M&Ms (“Denmark M&M”) were purchased at a local Føtex grocery store in Lyngby, a little north of Copenhagen.

Experiments were conducted at two main time points. The first 14 participants were tested in Lyngby, Denmark in August 2022. They mostly consisted of friends and housemates the author met at the Novo Nordisk Foundation Center for Biosustainability at the Technical University of Denmark (DTU) who came to a “going away party” into which the experimental procedure was inserted. A few additional friends and family who visited Denmark were also tested during their travels (e.g. on the train).

The remaining 37 participants were tested in Seattle, WA, USA in October 2022, primarily during a “TGIF happy hour” hosted by graduate students in the computer science PhD program at the University of Washington. This second batch mostly consisted of students and staff of the Paul. G. Allen School of Computer Science & Engineering (UW CSE) who responded to the weekly Friday summoning to the Allen Center atrium for free snacks and drinks.

Figure 1. Distribution of participants recruited to the study. In the first sampling event in Lyngby, participants primarily hailed from North America and Europe, and a few additionally came from Asia, South America, or Australia. Our second sampling event in Seattle greatly increased participants, primarily from North America and Asia, and a few more from Europe. Neither event recruited participants from Africa. Figure made with Altair.

While this study set out to analyze global trends, unfortunately data was only collected from 51 participants the author was able to lure to the study sites and is not well-balanced nor representative of the 6 inhabited continents of Earth (Figure 1). We hope to improve our recruitment tactics in future work. For now, our analytical power with this dataset is limited to response trends for individuals from North America, Europe, and Asia, highly biased by subcommunities the author happened to engage with in late 2022.

2.3 Risks

While we did not acquire formal approval for experimentation with human test subjects, there were minor risks associated with this experiment: participants were warned that they may be subjected to increased levels of sugar and possible “unpleasant flavors” as a result of participating in this study. No other risks were anticipated.

After the experiment however, we unfortunately observed several cases of deflated pride when a participant learned their taste response was skewed more positively towards the M&M type they were not expecting. This pride deflation seemed most severe among European participants who learned their own or their fiancé’s preference skewed towards USA M&Ms, though this was not quantitatively measured and cannot be confirmed beyond anecdotal evidence.

3. Results & Discussion

3.1 Overall response to “USA M&Ms” vs “Denmark M&Ms”

3.1.1 Categorical response analysis — entire dataset

In our first analysis, we count the total number of “Bad”, “Normal”, and “Good” taste responses and report the percentage of each response received by each M&M type. M&Ms from Denmark more frequently received “Good” responses than USA M&Ms but also more frequently received “Bad” responses. M&Ms from the USA were most frequently reported to taste “Normal” (Figure 2). This may result from the elevated number of participants hailing from North America, where the USA M&M is the default and thus more “Normal,” while the Denmark M&M was more often perceived as better or worse than the baseline.

^{Figure 2. Qualitative taste response distribution across the whole dataset. The percentage of taste responses for “Bad”, “Normal” or “Good” was calculated for each type of M&M. Figure made with Altair.}

Now let’s break out some Statistics, such as a chi-squared (X2) test to compare our observed distributions of categorical taste responses. Using the scipy.stats chi2_contingency function, we built contingency tables of the observed counts of “Good,” “Normal,” and “Bad” responses to each M&M type. Using the X2 test to evaluate the null hypothesis that there is no difference between the two M&Ms, we found the p-value for the test statistic to be 0.0185, which is significant at the common p-value cut off of 0.05, but not at 0.01. So a solid “maybe,” depending on whether you’d like this result to be significant or not.

3.1.2 Quantitative response analysis — entire dataset.

The X2 test helps evaluate if there is a difference in categorical responses, but next, we want to determine a relative taste ranking between the two M&M types. To do this, we converted taste responses to a quantitative distribution and calculated a taste score. Briefly, “Bad” = 1, “Normal” = 2, “Good” = 3. For each participant, we averaged the taste scores across the 5 M&Ms they tasted of each type, maintaining separate taste scores for each M&M type.

Figure 3. Quantitative taste score distributions across the whole dataset. Kernel density estimation of the average taste score calculated for each participant for each M&M type. Figure made with Seaborn.

With the average taste score for each M&M type in hand, we turn to scipy.stats ttest_ind (“T-test”) to evaluate if the means of the USA and Denmark M&M taste scores are different (the null hypothesis being that the means are identical). If the means are significantly different, it would provide evidence that one M&M is perceived as significantly tastier than the other.

We found the average taste scores for USA M&Ms and Denmark M&Ms to be quite close (Figure 3), and not significantly different (T-test: p = 0.721). Thus, across all participants, we do not observe a difference between the perceived taste of the two M&M types (or if you enjoy parsing triple negatives: “we cannot reject the null hypothesis that there is not a difference”).

But does this change if we separate participants by continent of origin?

3.2 Continent-specific responses to “USA M&Ms” vs “Denmark M&Ms”

We repeated the above X2 and T-test analyses after grouping participants by their continents of origin. The Australia and South America groups were combined as a minimal attempt to preserve data privacy. Due to the relatively small sample size of even the combined Australia/South America group (n=3), we will refrain from analyzing trends for this group but include the data in several figures for completeness and enjoyment of the participants who may eventually read this.

3.2.1 Categorical response analysis — by continent

In Figure 4, we display both the taste response counts (upper panel, note the interactive legend) and the response percentages (lower panel) for each continent group. Both North America and Asia follow a similar trend to the whole population dataset: participants report Denmark M&Ms as “Good” more frequently than USA M&Ms, but also report Denmark M&Ms as “Bad” more frequently. USA M&Ms were most frequently reported as “Normal” (Figure 4).

On the contrary, European participants report USA M&Ms as “Bad” nearly 50% of the time and “Good” only 18% of the time, which is the most negative and least positive response pattern, respectively (when excluding the under-sampled Australia/South America group).

^{Figure 4. Qualitative taste response distribution by continent. Upper panel: counts of taste responses — click the legend to interactively filter! Lower panel: percentage of taste responses for each type of M&M. Figure made with Altair.}

This appeared striking in bar chart form, however only North America had a significant X2 p-value (p = 0.0058) when evaluating each continent’s difference in taste response profile between the two M&M types. The European p-value is perhaps “approaching significance” in some circles, but we’re about to accumulate several more hypothesis tests and should be mindful of multiple hypothesis testing (Table 1). A false positive result here would be devastating.

When comparing the taste response profiles between two continents for the same M&M type, there are a couple interesting notes. First, we observed no major taste discrepancies between all pairs of continents when evaluating Denmark M&Ms — the world seems generally consistent in their range of feelings about M&Ms sourced from Europe (right column X2 p-values, Table 2). To visualize this comparison more easily, we reorganize the bars in Figure 4 to group them by M&M type (Figure 5).

^{Figure 5. Qualitative taste response distribution by M&M type, reported as percentages. (Same data as Figure 4 but re-arranged). Figure made with Altair.}

However, when comparing continents to each other in response to USA M&Ms, we see larger discrepancies. We found one pairing to be significantly different: European and North American participants evaluated USA M&Ms very differently (p = 0.000007) (Table 2). It seems very unlikely that this observed difference is by random chance (left column, Table 2).

3.2.2 Quantitative response analysis — by continent

We again convert the categorical profiles to quantitative distributions to assess continents’ relative preference of M&M types. For North America, we see that the taste score means of the two M&M types are actually quite similar, but there is a higher density around “Normal” scores for USA M&Ms (Figure 6A). The European distributions maintain a bit more of a separation in their means (though not quite significantly so), with USA M&Ms scoring lower (Figure 6B). The taste score distributions of Asian participants is most similar (Figure 6C).

Reorienting to compare the quantitative means between continents’ taste scores for the same M&M type, only the comparison between North American and European participants on USA M&Ms is significantly different based on a T-test (p = 0.001) (Figure 6D), though now we really are in danger of multiple hypothesis testing! Be cautious if you are taking this analysis at all seriously.

Figure 6. Quantitative taste score distributions by continent. Kernel density estimation of the average taste score calculated for each each continent for each M&M type. A. Comparison of North America responses to each M&M. B. Comparison of Europe responses to each M&M. C. Comparison of Asia responses to each M&M. D. Comparison of continents for USA M&Ms. E. Comparison of continents for Denmark M&Ms. Figure made with Seaborn.

At this point, I feel myself considering that maybe Europeans are not just making this up. I’m not saying it’s as dramatic as some of them claim, but perhaps a difference does indeed exist… To some degree, North American participants also perceive a difference, but the evaluation of Europe-sourced M&Ms is not consistently positive or negative.

3.3 M&M taste alignment chart

In our analyses thus far, we did not account for the baseline differences in M&M appreciation between participants. For example, say Person 1 scored all Denmark M&Ms as “Good” and all USA M&Ms as “Normal”, while Person 2 scored all Denmark M&Ms as “Normal” and all USA M&Ms as “Bad.” They would have the same relative preference for Denmark M&Ms over USA M&Ms, but Person 2 perhaps just does not enjoy M&Ms as much as Person 1, and the relative preference signal is muddled by averaging the raw scores.

Inspired by the Lawful/Chaotic x Good/Evil alignment chart used in tabletop role playing games like Dungeons & Dragons©, in Figure 7, we establish an M&M alignment chart to help determine the distribution of participants across M&M enjoyment classes.

Figure 7. M&M enjoyment alignment chart. The x-axis represents a participant’s average taste score for USA M&Ms; the y-axis is a participant’s average taste score for Denmark M&Ms. Figure made with Altair.

Notably, the upper right quadrant where both M&M types are perceived as “Good” to “Normal” is mostly occupied by North American participants and a few Asian participants. All European participants land in the left half of the figure where USA M&Ms are “Normal” to “Bad”, but Europeans are somewhat split between the upper and lower halves, where perceptions of Denmark M&Ms range from “Good” to “Bad.”

An interactive version of Figure 7 is provided below for the reader to explore the counts of various M&M alignment regions.

^{Figure 7 (interactive): click and brush your mouse over the scatter plot to see the counts of continents in different M&M enjoyment regions. Figure made with Altair.}

3.4 Participant taste response ratio

Next, to factor out baseline M&M enjoyment and focus on participants’ relative preference between the two M&M types, we took the log ratio of each person’s USA M&M taste score average divided by their Denmark M&M taste score average.

Equation 1: Equation to calculate each participant’s overall M&M preference ratio.

As such, positive scores indicate a preference towards USA M&Ms while negative scores indicate a preference towards Denmark M&Ms.

On average, European participants had the strongest preference towards Denmark M&Ms, with Asians also exhibiting a slight preference towards Denmark M&Ms (Figure 8). To the two Europeans who exhibited deflated pride upon learning their slight preference towards USA M&Ms, fear not: you did not think USA M&Ms were “Good,” but simply ranked them as less bad than Denmark M&Ms (see participant_id 4 and 17 in the interactive version of Figure 7). If you assert that M&Ms are a bad American invention not worth replicating and return to consuming artisanal European chocolate, your honor can likely be restored.

Figure 8. Distribution of participant M&M preference ratios by continent. Preference ratios are calculated as in Equation 1. Positive numbers indicate a relative preference for USA M&Ms, while negative indicate a relative preference for Denmark M&Ms. Figure made with Seaborn.

North American participants are pretty split in their preference ratios: some fall quite neutrally around 0, others strongly prefer the familiar USA M&M, while a handful moderately prefer Denmark M&Ms. Anecdotally, North Americans who learned their preference skewed towards European M&Ms displayed signals of inflated pride, as if their results signaled posh refinement.

Overall, a T-test comparing the distributions of M&M preference ratios shows a possibly significant difference in the means between European and North American participants (p = 0.049), but come on, this is like the 20th p-value I’ve reported — this one is probably too close to call.

3.5 Taste inconsistency and “Perfect Classifiers”

For each participant, we assessed their taste score consistency by averaging the standard deviations of their responses to each M&M type, and plotting that against their preference ratio (Figure 9).

^{Figure 9. Participant taste consistency by preference ratio. The x-axis is a participant’s relative M&M preference ratio. The y-axis is the average of the standard deviation of their USA M&M scores and the standard deviation of their Denmark M&M scores. A value of 0 on the y-axis indicates perfect consistency in responses, while higher values indicate more inconsistent responses. Figure made with Altair.}

Most participants were somewhat inconsistent in their ratings, ranking the same M&M type differently across the 5 samples. This would be expected if the taste difference between European-sourced and American-sourced M&Ms is not actually all that perceptible. Most inconsistent were participants who gave the same M&M type “Good”, “Normal”, and “Bad” responses (e.g., points high on the y-axis, with wider standard deviations of taste scores), indicating lower taste perception abilities.

Intriguingly, four participants — one from each continent group — were perfectly consistent: they reported the same taste response for each of the 5 M&Ms from each M&M type, resulting in an average standard deviation of 0.0 (bottom of Figure 9). Excluding the one of the four who simply rated all 10 M&Ms as “Normal”, the other three appeared to be “Perfect Classifiers” — either rating all M&Ms of one type “Good” and the other “Normal”, or rating all M&Ms of one type “Normal” and the other “Bad.” Perhaps these folks are “super tasters.”

3.6 M&M color

Another possible explanation for the inconsistency in individual taste responses is that there exists a perceptible taste difference based on the M&M color. Visually, the USA M&Ms were noticeably more smooth and vibrant than the Denmark M&Ms, which were somewhat more “splotchy” in appearance (Figure 10A). M&M color was recorded during the experiment, and although balanced sampling was not formally built into the experimental design, colors seemed to be sampled roughly evenly, with the exception of Blue USA M&Ms, which were oversampled (Figure 10B).

Figure 10. M&M colors. A. Photo of each M&M color of each type. It’s perhaps a bit hard to perceive on screen in my unprofessionally lit photo, but with the naked eye, USA M&Ms seemed to be brighter and more uniformly colored while Denmark M&Ms have a duller and more mottled color. Is it just me, or can you already hear the Europeans saying “They are brighter because of all those extra chemicals you put in your food that we ban here!” B. Distribution of M&Ms of each color sampled over the course of the experiment. The Blue USA M&Ms were not intentionally oversampled — they must be especially bright/tempting to experimenters. Figure made with Altair.

We briefly visualized possible differences in taste responses based on color (Figure 11), however we do not believe there are enough data to support firm conclusions. After all, on average each participant would likely only taste 5 of the 6 M&M colors once, and 1 color not at all. We leave further M&M color investigations to future work.

Figure 11. Taste response profiles for M&Ms of each color and type. Profiles are reported as percentages of “Bad”, “Normal”, and “Good” responses, though not all M&Ms were sampled exactly evenly. Figure made with Altair.

3.7 Colorful commentary

We assured each participant that there was no “right “answer” in this experiment and that all feelings are valid. While some participants took this to heart and occasionally spent over a minute deeply savoring each M&M and evaluating it as if they were a sommelier, many participants seemed to view the experiment as a competition (which occasionally led to deflated or inflated pride). Experimenters wrote down quotes and notes in conjunction with M&M responses, some of which were a bit “colorful.” We provide a hastily rendered word cloud for each M&M type for entertainment purposes (Figure 12) though we caution against reading too far into them without diligent sentiment analysis.

Figure 11. A simple word cloud generated from the notes column of each M&M type. Fair warning — these have not been properly analyzed for sentiment and some inappropriate language was recorded. Figure made with WordCloud.

4. Conclusion

Overall, there does not appear to be a “global consensus” that European M&Ms are better than American M&Ms. However, European participants tended to more strongly express negative reactions to USA M&Ms while North American participants seemed relatively split on whether they preferred M&Ms sourced from the USA vs from Europe. The preference trends of Asian participants often fell somewhere between the North Americans and Europeans.

Therefore, I’ll admit that it’s probable that Europeans are not engaged in a grand coordinated lie about M&Ms. The skew of most European participants towards Denmark M&Ms is compelling, especially since I was the experimenter who personally collected much of the taste response data. If they found a way to cheat, it was done well enough to exceed my own passive perception such that I didn’t notice. However, based on this study, it would appear that a strongly negative “vomit flavor” is not universally perceived and does not become apparent to non-Europeans when tasting both M&Ms types side by side.

We hope this study has been illuminating! We would look forward to extensions of this work with improved participant sampling, additional M&M types sourced from other continents, and deeper investigations into possible taste differences due to color.

Thank you to everyone who participated and ate M&Ms in the name of science!

Figures and analysis can be found on github: https://github.com/erinhwilson/mnm-taste-test

Article by Erin H. Wilson, Ph.D.[1,2,3] who decided the time between defending her dissertation and starting her next job would be best spent on this highly valuable analysis. Hopefully it is clear that this article is intended to be comedic— I do not actually harbor any negative feelings towards Europeans who don’t like American M&Ms, but enjoyed the chance to be sassy and poke fun at our lively debates with overly-enthusiastic data analysis.

Shout out to Matt, Galen, Ameya, and Gian-Marco for assisting in data collection!

[1] Former Ph.D. student in the Paul G. Allen School of Computer Science and Engineering at the University of Washington

[2] Former visiting Ph.D. student at the Novo Nordisk Foundation Center for Biosustainability at the Technical University of Denmark

[3] Future data scientist at LanzaTech

The post Do European M&Ms Actually Taste Better than American M&Ms? appeared first on Towards Data Science.

Talking about Games

Dorian Drost — Fri, 21 Feb 2025 19:14:25 +0000

Game theory is a field of research that is quite prominent in Economics but rather unpopular in other scientific disciplines. However, the concepts used in game theory can be of interest to a wider audience, including data scientists, statisticians, computer scientists or psychologists, to name just a few. This article is the opener to a four-chapter tutorial series on the fundamentals of game theory, so stay tuned for the upcoming articles.

In this article, I will explain the kinds of problems Game Theory deals with and introduce the main terms and concepts used to describe a game. We will see some examples of games that are typically analysed within game theory and lay the foundation for deeper insights into the capabilities of game theory in the later chapters. But before we go into the details, I want to introduce you to some applications of game theory, that show the multitude of areas game-theoretic concepts can be applied to.

Applications of game theory

Even french fries can be an application of game theory. Photo by engin akyurt on Unsplash

Does it make sense to vote for a small party in an election if this party may not have a chance to win anyway? Is it worth starting a price war with your competitor who offers the same goods as you? Do you gain anything if you reduce your catch rate of overfished areas if your competitors simply carry on as before? Should you take out insurance if you believe that the government will pay for the reconstruction after the next hurricane anyway? And how should you behave in the next auction where you are about to bid on your favourite Picasso painting?

All these questions (and many more) live within the area of applications that can be modelled with game theory. Whenever a situation includes strategic decisions in interaction with others, game-theoretic concepts can be applied to describe this situation formally and search for decisions that are not made intuitively but that are backed by a notion of rationality. Key to all the situations above is that your decisions depend on other people’s behaviour. If everybody agrees to conserve the overfished areas, you want to play along to preserve nature, but if you think that everybody else will continue fishing, why should you be the only one to stop? Likewise, your voting behaviour in an election might heavily depend on your assumptions about other people’s votes. If nobody votes for that candidate, your vote will be wasted, but if everybody thinks so, the candidate doesn’t have a chance at all. Maybe there are many people who say “I would vote for him if others vote for him too”.

Similar situations can happen in very different situations. Have you ever thought about having food delivered and everybody said “You don’t have to order anything because of me, but if you order anyway, I’d take some french fries”? All these examples can be applications of game theory, so let’s start understanding what game theory is all about.

Understanding the game

Before playing, you need to understand the components of the game. Photo by Laine Cooper on Unsplash

When you hear the word game, you might think of video games such as Minecraft, board games such as Monopoly, or card games such as poker. There are some common principles to all these games: We always have some players who are allowed to do certain things determined by the game’s rules. For example, in poker, you can raise, check or give up. In Monopoly, you can buy a property you land on or don’t buy it. What we also have is some notion of how to win the game. In poker, you have to get the best hand to win and in Monopoly, you have to be the last person standing after everybody went bankrupt. That also means that some actions are better than others in some scenarios. If you have two aces on the hand, staying in the game is better than giving up.

When we look at games from the perspective of game theory, we use the same concepts, just more formally.

A game in game theory consists of n players, where each player has a strategy set and a utility function.

A game consists of a set of players I = {1, .., n}, where each player has a set of strategies S and a utility function ui(s1, s2, … sn). The set of strategies is determined by the rules of the games. For example, it could be S = {check, raise, give-up} and the player would have to decide which of these actions they want to use. The utility function u (also called reward) describes how valuable a certain action of a player would be, given the actions of the other players. Every player wants to maximize their utility, but now comes the tricky part: The utility of an action of yours depends on the other players’ actions. But for them, the same applies: Their actions’ utilities depend on the actions of the other players (including yours).

Let’s consider a well-known game to illustrate this point. In rock-paper-scissors, we have n=2 players and each player can choose between three actions, hence the strategy set is S={rock, paper, scissors} for each player. But the utility of an action depends on what the other player does. If our opponent chooses rock, the utility of paper is high (1), because paper beats rock. But if your opponent chooses scissors, the utility of paper is low (-1), because you would lose. Finally, if your opponent chooses paper as well, you reach a draw and the utility is 0.

Utility values for player one choosing paper for three choices of the opponents strategy.

Instead of writing down the utility function for each case individually, it is common to display games in a matrix like this:

The first player decides for the row of the matrix by selecting his action and the second player decides for the column. For example, if player 1 chooses paper and player 2 chooses scissors, we end up in the cell in the third column and second row. The value in this cell is the utility for both players, where the first value corresponds to player 1 and the second value corresponds to player 2. (-1,1) means that player 1 has a utility of -1 and player 2 has a utility of 1. Scissors beat paper.

Some more details

Now we have understood the main components of a game in game theory. Let me add a few more hints on what game theory is about and what assumptions it uses to describe its scenarios.

We often assume that the players select their actions at the same time (like in rock-paper-scissors). We call such games static games. There are also dynamic games in which players take turns deciding on their actions (like in chess). We will consider these cases in a later chapter of this tutorial.
In game theory, it is typically assumed that the players can not communicate with each other so they can’t come to an agreement before deciding on their actions. In rock-paper-scissors, you wouldn’t want to do that anyway, but there are other games where communication would make it easier to choose an action. However, we will always assume that communication is not possible.
Game theory is considered a normative theory, not a descriptive one. That means we will analyse games concerning the question “What would be the rational solution?” This may not always be what people do in a likewise situation in reality. Such descriptions of real human behaviour are part of the research field of behavioural economics, which is located on the border between Psychology and economics.

The prisoner’s dilemma

The prisoner’s dilemma is all about not ending up here. Photo by De an Sun on Unsplash

Let us become more familiar with the main concepts of game theory by looking at some typical games that are often analyzed. Often, such games are derived from are story or scenario that may happen in the real world and require people to decide between some actions. One such story could be as follows:

Say we have two criminals who are suspected of having committed a crime. The police have some circumstantial evidence, but no actual proof for their guilt. Hence they question the two criminals, who now have to decide if they want to confess or deny the crime. If you are in the situation of one of the criminals, you might think that denying is always better than confessing, but now comes the tricky part: The police propose a deal to you. If you confess while your partner denies, you are considered a crown witness and will not be punished. In this case, you are free to go but your partner will go to jail for six years. Sounds like a good deal, but be aware, that the outcome also depends on your partner’s action. If you both confess, there is no crown witness anymore and you both go to jail for three years. If you both deny, the police can only use circumstantial evidence against you, which will lead to one year in prison for both you and your partner. But be aware, that your partner is offered the same deal. If you deny and he confesses, he is the crown witness and you go to jail for six years. How do you decide?

The prisoner’s dilemma.

The game derived from this story is called the prisoner’s dilemma and is a typical example of a game in game theory. We can visualize it as a matrix just like we did with rock-paper-scissors before and in this matrix, we easily see the dilemma the players are in. If both deny, they receive a rather low punishment. But if you assume that your partner denies, you might be tempted to confess, which would prevent you from going to jail. But your partner might think the same, and if you both confess, you both go to jail for longer. Such a game can easily make you go round in circles. We will talk about solutions to this problem in the next chapter of this tutorial. First, let’s consider some more examples.

Bach vs. Stravinsky

Who do you prefer, Bach or Stravinsky? Photo by Sigmund on Unsplash

You and your friend want to go to a concert together. You are a fan of Bach’s music but your friend favors the Russian 20th. century composer Igor Stravinsky. However, you both want to avoid being alone in any concert. Although you prefer Bach over Stravinsky, you would rather go to the Stravinsky concert with your friend than go to the Bach concert alone. We can create a matrix for this game:

Bach vs. Stravinsky

You decide for the row by going to the Bach or Stravinsky concert and your friend decides for the column by going to one of the concerts as well. For you, it would be best if you both chose Bach. Your reward would be 2 and your friend would get a reward of 1, which is still better for him than being in the Stravinsky concert all by himself. However, he would be even happier, if you were in the Stravinsky concert together.

Do you remember, that we said players are not allowed to communicate before making their decision? This example illustrates why. If you could just call each other and decide where to go, this would not be a game to investigate with game theory anymore. But you can’t call each other so you just have to go to any of the concerts and hope you will meet your friend there. What do you do?

Arm or disarm?

Make love, not war. Photo by Artem Beliaikin on Unsplash

A third example brings us to the realm of international politics. The world would be a much happier place with fewer firearms, wouldn’t it? However, if nations think about disarmament, they also have to consider the choices other nations make. If the USA disarms, the Soviet Union might want to rearm, to be able to attack the USA — that was the thinking during the Cold War, at least. Such a scenario could be described with the following matrix:

The matrix for the disarm vs. upgrade game.

As you see, when both nations disarm, they get the highest reward (3 each), because there are fewer firearms in the world and the risk of war is minimized. However, if you disarm, while the opponent upgrades, your opponent is in the better position and gets a reward of 2, while you only get 0. Then again, it might have been better to upgrade yourself, which gives a reward of 1 for both players. That is better than being the only one who disarms, but not as good as it would get if both nations disarmed.

The solution?

All these examples have one thing in common: There is no single option that is always the best. Instead, the utility of an action for one player always depends on the other player’s action, which, in turn, depends on the first player’s action and so on. Game theory is now interested in finding the optimal solution and deciding what would be the rational action; that is, the action that maximizes the expected reward. Different ideas on how exactly such a solution looks like will be part of the next chapter in this series.

Summary

Learning about game theory is as much fun as playing a game, don’t you think? Photo by Christopher Paul High on Unsplash

Before continuing with finding solutions in the next chapter, let us recap what we have learned so far.

A game consists of players, that decide for actions, which have a utility or reward.
The utility/reward of an action depends on the other players’ actions.
In static games, players decide for their actions simultaneously. In dynamic games, they take turns.
The prisoner’s dilemma is a very popular example of a game in game theory.
Games become increasingly interesting if there is no single action that is better than any other.

Now that you are familiar with how games are described in game theory, you can check out the next chapter to learn how to find solutions for games in game theory.

References

The topics introduced here are typically covered in standard textbooks on game theory. I mainly used this one, which is written in German though:

Bartholomae, F., & Wiens, M. (2016). Spieltheorie. Ein anwendungsorientiertes Lehrbuch. Wiesbaden: Springer Fachmedien Wiesbaden.

An alternative in English language could be this one:

Espinola-Arredondo, A., & Muñoz-Garcia, F. (2023). Game Theory: An Introduction with Step-by-step Examples. Springer Nature.

Game theory is a rather young field of research, with the first main textbook being this one:

Von Neumann, J., & Morgenstern, O. (1944). Theory of games and economic behavior.

Like this article? Follow me to be notified of my future posts.

The post Talking about Games appeared first on Towards Data Science.

Unraveling Spatially Variable Genes: A Statistical Perspective on Spatial Transcriptomics

Jingyi Jessica Li — Fri, 21 Feb 2025 06:06:04 +0000

\[\]

The article was written by Guanao Yan, Ph.D. student of Statistics and Data Science at UCLA. Guanao is the first author of the Nature Communications review article [1].

Spatially resolved transcriptomics (SRT) is revolutionizing Genomics by enabling the high-throughput measurement of gene expression while preserving spatial context. Unlike single-cell RNA sequencing (scRNA-seq), which captures transcriptomes without spatial location information, SRT allows researchers to map gene expression to precise locations within a tissue, providing insights into tissue organization, cellular interactions, and spatially coordinated gene activity. The increasing volume and complexity of SRT data necessitate the development of robust statistical and computational methods, making this field highly relevant to data scientists, statisticians, and machine learning (ML) professionals. Techniques such as spatial statistics, graph-based models, and deep learning have been applied to extract meaningful biological insights from these data.

A key step in SRT analysis is the detection of spatially variable genes (SVGs)—genes whose expression varies non-randomly across spatial locations. Identifying SVGs is crucial for characterizing tissue architecture, functional gene modules, and cellular heterogeneity. However, despite the rapid development of computational methods for SVG detection, these methods vary widely in their definitions and statistical frameworks, leading to inconsistent results and challenges in interpretation.

In our recent review published in Nature Communications [1], we systematically examined 34 peer-reviewed SVG detection methods and introduced a classification framework that clarifies the biological significance of different SVG types. This article provides an overview of our findings, focusing on the three major categories of SVGs and the statistical principles underlying their detection.

SVG detection methods aim to uncover genes whose spatial expression reflects biological patterns rather than technical noise. Based on our review of 34 peer-reviewed methods, we categorize SVGs into three groups: Overall SVGs, Cell-Type-Specific SVGs, and Spatial-Domain-Marker SVGs (Figure 2).

Image created by the authors, adapted from [1]. Publication timeline of 34 SVG detection methods. Colors represent three SVG categories: overall SVGs (green), cell-type-specific SVGs (red), and spatial-domain-marker SVGs (purple).

Methods for detecting the three SVG categories serve different purposes (Fig. 3). First, the detection of overall SVGs screens informative genes for downstream analyses, including the identification of spatial domains and functional gene modules. Second, detecting cell-type-specific SVGs aims to reveal spatial variation within a cell type and help identify distinct cell subpopulations or states within cell types. Third, spatial-domain-marker SVG detection is used to find marker genes to annotate and interpret spatial domains already detected. These markers help understand the molecular mechanisms underlying spatial domains and assist in annotating tissue layers in other datasets.

Image created by the authors, adapted from [1]. Conceptual visualization of three SVG categories: overall SVGs, cell-type-specific SVGs, and spatial-domain-marker SVGs. The left column shows a tissue slice with two cell types and three spatial domains. The right column shows exemplar genes with colors representing the expression levels shown for an overall SVG, a cell-type-specific SVG, and a spatial-domain-marker SVG, respectively.

The relationship among the three SVG categories depends on the detection methods, particularly the null and alternative hypotheses they employ. If an overall SVG detection method uses the null hypothesis that a non-SVG’s expression is independent of spatial location and the alternative hypothesis that any deviation from this independence indicates an SVG, then its SVGs should theoretically include both cell-type-specific SVGs and spatial-domain-marker SVGs. For example, DESpace [2] is a method that detects both overall SVGs and spatial-domain-marker SVGs, and its detected overall SVGs must be marker genes for some spatial domains. This inclusion relationship holds true except in extreme scenarios, such as when a gene exhibits opposite cell-type-specific spatial patterns that effectively cancel each other out. However, if an overall SVG detection method’s alternative hypothesis is defined for a specific spatial expression pattern, then its SVGs may not include some cell-type-specific SVGs or spatial-domain-marker SVGs.

To understand how SVGs are detected, we categorized the statistical approaches into three major types of hypothesis tests:

Dependence Test – Examines the dependence between a gene’s expression level and the spatial location.
Regression Fixed-Effect Test – Examines whether some or all of the fixed-effect covariates, for instance, spatial location, contribute to the mean of the response variable, i.e., a gene’s expression.
Regression Random-Effect Test (Variance Component Test) – Examines whether the random-effect covariates, for instance, spatial location, contribute to the variance of the response variable, i.e., a gene’s expression.

To further explain how these tests are used for SVG detection, we denote 𝑌 as gene’s expression level and 𝑆 as the spatial locations. Dependence test is the most general hypothesis test for SVG detection. For a given gene, it decides whether the gene’s expression level 𝑌 is independent of the spatial location 𝑆, i.e., the null hypothesis is:

There are two types of regression tests: fixed-effect tests, where the effect of the spatial location is assumed to be fixed, and random-effect tests, which assume the effect of the spatial location as random. To explain these two types of tests, we use a linear mixed model for a given gene as an example:

where the response variable \( Y_i \) is the gene’s expression level at spot \( i \), \( x_i \) \( \epsilon \) \( R^p \) indicates the fixed-effect covariates of spot \( i \), \( z_i \) \( \epsilon \) \( R^q \) denotes the random-effect covariates of spot \( i \), and \( \epsilon_i \) is the random measurement error at spot \( i \) with zero mean. In the model parameters, \( \beta_0 \) is the (fixed) intercept, \( \beta \) \( \epsilon \) \( R^p \) indicates the fixed effects, and \( \gamma \) \( \epsilon \) \( R^q \) denotes the random effects with zero means and the covariance matrix:

In this linear mixed model, independence is assumed between random effect and random errors and among random errors.

Fixed-effect tests examine whether some or all of the fixed-effect covariates \( x_i \) (dependent on spatial locations S) contribute to the mean of the response variable. If all fixed-effect covariates make no contribution, then:

The null hypothesis

implies

Random-effect tests examine whether the random-effect covariates \( z_i \) (dependent on spatial locations S) contribute to the variance of the response variable Var⁡Yi, focusing on the decomposition:

and testing if the contribution of the random-effect covariates is zero. The null hypothesis:

implies

Among the 23 methods that use frequentist hypothesis tests, dependence tests and random-effect regression tests have been primarily applied to detect overall SVGs, whereas fixed-effect regression tests have been used across all three SVG categories. Understanding these distinctions is key to selecting the right method for specific research questions.

Improving SVG detection methods requires balancing detection power, specificity, and scalability while addressing key challenges in spatial transcriptomics analysis. Future developments should focus on adapting methods to different SRT technologies and tissue types, as well as extending support for multi-sample SRT data to enhance biological insights. Additionally, strengthening statistical rigor and validation frameworks will be crucial for ensuring the reliability of SVG detection. Benchmarking studies also need refinement, with clearer evaluation metrics and standardized datasets to provide robust method comparisons.

References

[1] Yan, G., Hua, S.H. & Li, J.J. (2025). Categorization of 34 computational methods to detect spatially variable genes from spatially resolved transcriptomics data. Nature Communication, 16, 1141. https://doi.org/10.1038/s41467-025-56080-w

[2] Cai, P., Robinson, M. D., & Tiberi, S. (2024). DESpace: spatially variable gene detection via differential expression testing of spatial clusters. Bioinformatics, 40(2). https://doi.org/10.1093/bioinformatics/btae027

The post Unraveling Spatially Variable Genes: A Statistical Perspective on Spatial Transcriptomics appeared first on Towards Data Science.

Honestly Uncertain

Malte Tichy — Tue, 18 Feb 2025 18:53:02 +0000

Ethical issues aside, should you be honest when asked how certain you are about some belief? Of course, it depends. In this blog post, you’ll learn on what.

Different ways of evaluating probabilistic predictions come with dramatically different degrees of “optimal honesty”.
Perhaps surprisingly, the linear function that assigns +1 to true and fully confident statements, 0 to admitted ignorance and -1 to wrong but fully confident statements incentivizes exaggerated, dishonest boldness. If you rate forecasts that way, you’ll be surrounded by self-important fools and suffer from badly calibrated machine forecasts.
If you want people (or machines) to give their truly unbiased and honest assessment, your scoring function should penalize confident but wrong convictions more strongly than it rewards confident correct ones.

A probabilistic quiz game

David Spiegelhalter’s new (as of 2025) fantastic book, “The Art of Uncertainty” – a must-read for everyone who deals with probabilities and their communication – features a short section on scoring rules. Spiegelhalter walks the reader through the quadratic scoring rule, and briefly mentions that a linear scoring rule will lead to dishonest behavior. I elaborate on that interesting point in this blog post.

Let’s set the stage: Just like in so many other scenarios and paradoxes, you find yourself in a TV show (yes, what an old-fashioned way to start). You have the opportunity to answer questions on common knowledge and win some cash. You are asked yes/no-questions that are expressed in a binary fashion, such as: Is the area of France larger than the area of Spain? Was Marie Curie born earlier than Albert Einstein? Is Montreal’s population larger than Kyoto’s?

Depending on your background, these questions might be obvious for you, or they might be difficult. In any case, you will have a subjective “best guess” in mind, and some degree of certainty. For example, I feel comfortable answering the first, slightly less for the second, and I already forgot the answer to the third, even though I looked it up to build the example. You might experience a similar level of confidence, or a very different one. Degrees of certainty are, of course, subjective.

The twist of the quiz: You are not supposed to give a binary yes/no-answer as in a multiple-choice test, but to honestly communicate your degree of conviction, that is, to produce the probability that you personally assign to the true answer being “yes”. The number 0 then means “definitely not”, 1 expresses “definitely yes”, and 0.5 reflects the degree of uncertainty corresponding to the toss of a fair coin — you then have absolutely no idea. Let’s call P(A) your true subjective conviction that statement A is true. That probability can take any value between 0 and 1, whereas A is bound to be either 0 or 1. You can then communicate that number, but you don’t have to, so we’ll call Q(A) the probability that you eventually express in that quiz.

In general, not every probabilistic expression Q is met with the same excitement, because humans generally dislike uncertainty. We are much happier with the expert that gives us “99.99%” or “0.01%” probabilities for something to be or not to be the case, and we favor them considerably over the experts producing “25%” and “75%” maybe-ish assessments. From a rational perspective, more informative probabilities (“sharp predictions”, close to 0 or close to 1) are favorable over uninformative ones (“unsharp predictions”, close to 0.5). However, a modest but truthful prediction is still worth more than a bold but unreliable one that would make you go all-in. We should therefore ensure that people do not lie about their degree of conviction, so that really 99% of the “99%-sure” predictions are actually true, 12% or the “12%-sure”, and so on. How can the quiz master ensure that?

The Linear Scoring Rule

The most straightforward way that one might come up with to judge probabilistic statements is to use a linear scoring rule: In the best case, you are very confident and right, which means Q(A)=P(A)=1 and A is true, or Q(A)=P(A)=0 and A is false. We then add the score +1=r(Q=1, A=1)=r(Q=0, A=0) to the balance. In the worst case, you were very sure of yourself, but wrong; that is, Q(A)=P(A)=1 while A is false, or Q(A)=P(A)=0 while A is true. In that unfortunate case, we subtract –1=r(Q=1, A=0)=r(Q=0, A=1) from the score. Between these extreme cases, we draw a straight line. When you express maximal uncertainty via Q(A)=0.5, we have 0=r(Q=0.5, A=1)=r(Q=0.5, A=0), and neither add nor subtract anything.

The functional form of this linear reward function is not particularly spectacular, but its visualization will come handy in the following:

Linear scoring function: You are rewarded +1 for being very sure about your true belief, subtracted -1 when being equally sure about a wrong belief, you don’t get any reward nor punishment when you are openly ignorant with Q=0.5. Image by the author.

No surprise here: If A is true, the best thing you could have done is to communicate “Q=1”, if A is false, the best strategy would have been to produce “Q=0”. That’s what is visualized by the black dots: They point to the largest value that the reward function can attain for the particular value of the truth. That’s a good start.

But you typically do not know with absolute certainty whether the answer is “yes, A is true” or “no, A is false”, you only have a subjective gut feeling. So what should you do? Should you just be honest and communicate your true belief, e.g. P=0.7 or P=0.1?

Let’s set ethics aside, and consider the reward that we want to maximize. It then turns out that you should not be honest. When evaluated via the linear scoring rule, you should lie, and communicate Q(A)=0 when P(A)<0.5 and Q(A)=1 when P(A)>0.5.

To see this surprising result, let’s compute the expectation value of the reward function, assuming that your belief is, on average, correct (cognitive psychology teaches us that this is an unrealistically optimistic assumption in the first place, we’ll come back to that below). That is, we assume that in about 70% of the cases when you say P=0.7, the true answer is “yes, A is true”, in about 75% of the cases when you say P=0.25, the true answer is “no, A is false”. The expected reward R(P, Q) is then a function of both the honest subjective probability P and of the communicated probability Q, namely the weighted sum of the reward r(Q, A=1) and r(Q, A=0):

R(P, Q) = P * r(Q, A=1) + (1-P) * r(Q, A=0)

Here come the resulting R(P,Q) for four different values of the honest subjective probability P:

Expected reward as a function of honest and communicated probabilities P and Q. Image by the author.

The maximally attainable reward on the long term is not always 1 anymore, but it’s bounded by 2|P-0.5| — ignorance comes at a cost. Clearly, the best strategy is to confidently communicate Q=1 as long as P>0.5, and to communicate an equally confident Q=0 when P<0.5 — see where the black dots lie in the figure.

Under a linear scoring rule, when it is more likely than not that the event occurs — pretend you are absolutely certain that it will occur. When it’s marginally more likely that it does not occur — be bold and proclaim “that can never happen”. You will be wrong sometimes, but, on average, it’s more profitable to be bold than to be honest.

Even worse: What happens when you have absolutely no clue, no idea about the outcome, and your subjective belief is P=0.5? Then you can play safe and communicate that, or you can take the chance and communicate Q=1 or Q=0 — the expectation value is the same.

If find this a disturbing result: A linear reward function makes people go all-in! There is no way as forecast consumer to distinguish a slight tendency of 51% from a “quite likely” conviction of 95% or from an almost-certain 99.9999999%. In that quiz, the smart players will always go all-in.

Worse, many situations in life reward unsupported confidence more than thoughtful and careful assessments. Cautiously said, not many people are being heavily sanctioned for making clearly exaggerated claims…

A quiz show is one thing, but, obviously, it’s quite a problem when people (or machines…) are pushed to not communicate their true degree of conviction when it comes to estimating the risk of serious and dramatic events such as earthquakes, war and catastrophes.

How can we make them to be honest (in the case of people) or calibrated (in the case of machines)?

Punishing confident wrongness: The Quadratic Scoring Rule

If the probability for something to happen is estimated to be P=55% by some expert, I want that expert to communicate Q=55%, and not Q=100%. For probabilities to have any value for our decisions, they should reflect the true level of conviction, and not an opportunistically optimized value.

This reasonable ask has been formalized by statisticians by proper scoring rules: A proper scoring rule is one that incentivizes the forecaster to communicate their true degree of conviction, it is maximized when the communicated probabilities are calibrated, i.e. when predicted events are realized with the predicted frequency. At first, the question might arise whether such a scoring rule can exist at all. Thankfully, it can!

One proper scoring rule is the quadratic scoring rule, also known as the Brier score. For extreme communicated probabilities (Q=1, Q=0), the values are the very same as for the linear scoring rule, but we don’t draw straight line between these, but a parabola. By doing that, we reward honest ignorance: +0.5 is awarded for a communicated probability of Q=0.5.

Quadratic reward as a function of outcome A and communicated probability Q. Image by the author.

This reward function is asymmetric: When you increase your confidence from Q=0.95 to Q=0.98 (and A is true), the reward function only increases marginally. On the other hand, when A is false, that same increase of confidence leaning towards the wrong outcome is pushing down the reward considerably. Clearly, the quadratic reward thereby nudges one to be more cautious than the linear reward. But will it suffice to make people honest?

To see that, let’s compute the expectation value of the quadratic reward as a function of both the true honest probability P and the communicated one Q, just like we did in the linear case:

R(P, Q) = P * r(Q, A=1) + (1-P) * r(Q, A=0)

The resulting expected reward, for different values of the honest probability P, is shown in the next figure:

Image by the author.

Now, the maxima of the curves lie exactly at the point for which Q=P, which makes the correct strategy communicating honestly one’s own probability P. Both exaggerated confidence and excessive caution are penalized. Of course, by knowing more in the first place, you’ll be able to make sharper and more confident statements (more predictions Q=P that are either close to 1 or close to 0). But honest ignorance is now rewarded with +0.5. Better be safe than sorry.

What do we learn from that? The reward that is maximized by honestly communicated probabilities sanctions “surprises” (Q<0.5 and the event is actually true, or Q>0.5 and the event is actually false) quite strongly. You lose more when you are wrong with your tendency (Q>0.5 or Q<0.5) than you would win when you are correct. At the same time, not knowing and being honest about it is rewarded a non-negligible value.

Logarithmic reward

The quadratic reward function is not the only one that rewards honesty (there are infinitely many proper scoring rules): The logarithmic reward penalizes being confidently wrong (P=0, but truth is “yes, A is true”; P=1, yet truth is “no, A is false”) with an unassailable -infinity: The score is simply the logarithm of the probability that had been predicted for the event that eventually occurred — the plot is cut off on the y-axis for that reason:

Logarithmic reward as a function of the communicated probability. Image by the author.

The logarithmic reward breaks the symmetry between “having communicated a slightly too-high” and “having expressed a slightly too-low” probability: Towards uninformative Q=0.5, the penalty is weaker than towards informative Q=0 or Q=1, which we see in the expectation values:

Image by the author.

The logarithmic scoring rule heavily penalizes the assignment of a probability of 0 to something that then very surprisingly happened: Somebody who has to admit “I really though it was absolutely impossible” after the fact that they assigned Q=0 won’t be invited to provide predictions ever again…

Incentivizing sandbagging: The Cubic Scoring Rule

Scoring rules can push forecasters to be over-confident (see the linear scoring rule), they can be proper (see the quadratic and logarithmic scoring rules), but they can also punish “being boldly wrong” so thoroughly that forecasters would rather pretend they don’t know really even if they do. A cubic scoring rule would lead to such excessive caution:

Image by the author.

The expectation values of the reward now make people rather communicate values that are less informative (closer to 0.5) than their true convictions: Instead of an honest Q=P=0.2, the optimum is at Q=0.333, instead of honest Q=P=0.4, the optimum is Q=0.4495.

Image by the author.

In other words, to be provided honest judgements, don’t exaggerate the punishment of strong but eventually wrong convictions either — otherwise you’ll be surrounded by indecisive and hesitant cowards…

Honest and communicated probabilities

The following plot recapitulates the argument by showing the optimal communicated probability Q as a function of the true belief P. For a linear reward (Exponent 1), you will either communicate Q=0 or Q=1, and not disclose any information about your true degree of conviction. The quadratic reward (Exponent 2) makes you be honest (Q=P), while the cubic reward (Exponent 3) lets you set overly cautious Q values.

Optimally communicated probability Q as a function of the true conviction P, for different reward functions. A proper scoring rule ensures Q=P. Image by the author.

In reality, our choices are often binary, and, depending on the “false positive” and “false negative” cost and the “true positive” and “true negative” reward, we will set the threshold on our subjective probability to take or not take a certain action to different values. It is not at all irrational to plan thoroughly for a probability P=0.01=1% catastrophe.

If probabilities are subjective, how can they be “wrong”?

Scoring rules have two main applications: On a technical level, when training a probabilistic statistical or machine learning model on data, optimizing a proper scoring rule will yield calibrated and as-sharp-as-possible probabilistic forecasts. In a more informal setting, when several experts estimate the probability for something (typically dramatic) to happen, one wants to make sure that the experts are honest and don’t try to overplay or downplay their subjective uncertainty (beware of group dynamics!). Super-forecasters indeed use quadratic scoring rules to help reflect on their degree of confidence and to train themselves to become more calibrated.

Back to our initial quiz game. Before answering, you should definitely ask how you are evaluated. The evaluation procedure does matter, even if you are told it does not. Similarly, when you are given a multiple-choice-test, be sure to understand whether it might be worthwhile to check a box even if you are only very marginally certain about its correctness.

But how can a quiz involving subjective probabilities be evaluated at all in an objective fashion? According to Bruno De Finetti, “probability does not exist”, so how can we then judge the probabilities that people express? We don’t judge people’s taste either! David Spiegelhalter emphasizes in “The Art of Uncertainty” that uncertainty is not “a property of the world, but of our relationship with the world”.

However, subjective does not mean unfalsifiable.

I might be 99% sure that France is larger than Spain, 75% sure that Marie Curie was born before Albert Einstein, and 55% sure that Montreal is larger than Kyoto. The numbers that you assign to these statements will probably (pun intended) be different. Your relationship to the world is a different one than mine. That’s OK.

We can be both right in the sense that we express calibrated probabilities, even if we assign different probabilities to the same events.

A more commonplace setting: When I enter a supermarket, I can assign quite informative (quite high or quite low) probabilities to me buying certain products — I typically know well what I intend to shop. The data scientist working at the supermarket does not know my personal shopping list, even after having collected considerable personal data. The probability that they assign to me buying a bottle of orange juice will be quite different from the one that I assign to me doing that — both probabilities can be “correct” in the sense that they are calibrated on the long term.

Subjectivity does not mean arbitrariness: We can aggregate predictions and outcomes, and evaluate to which extent the predictions are calibrated. Scoring rules help us precisely with that task, because they simultaneously grade honesty and information: Each forecaster can be evaluated separately upon their predicted probabilities. The one that is most informed (producing close-to-1 and close-to-0 probabilities) while being honest at the same time will win the quiz. Different scoring rules can then rank strong-but-slightly-uncalibrated against weaker-but-calibrated predictions differently.

As mentioned above, honesty and calibration are not equivalent in practice. We might truly believe 100 times that certain events should occur in 20% of each case — but the true number of occurrences might significantly differ from 20. We might be honest about our belief and express P=Q, but that belief itself is typically uncalibrated! Kahneman and Tversky have studied the cognitive biases that typically make more confident than we should be. In a way, we often behave as if a linear scoring rule judged our predictions, making us lean towards the bold side.

The post Honestly Uncertain appeared first on Towards Data Science.

How Does Luck Influence the Board Game Sequence?

Lars R. Nielsen — Fri, 14 Feb 2025 20:02:15 +0000

Losing in Board Games is rarely fun. And certainly not when luck decides the game. But how much influence does luck actually have in popular board games like Sequence? I played the game and looked at the numbers. This analysis can not be generalized to other players, not to mention Sequence variants with more than 2 players, as it is based only on matches between me and my partner.

What Is the Game About?

If you’ve never played Sequence, here’s a quick introduction:

The game board is a 10 by 10 grid, where each space corresponds to a card (Figure 1). The game is played with two decks with a total of 104 cards. Each player is dealt a number of cards and takes turns playing one card and placing a token on the corresponding space on the board, then draws a new card from the deck. There are also some lucky cards:

Two-eyed Jack: Allows you to place a token anywhere. 4 cards in total.
One-eyed Jack: Removes one of your opponent’s tokens. 4 cards in total.

In this analysis, we focus on a two-player Sequence, where a player wins by being the first to make two sequences of five tokens in a row — horizontally, vertically, or diagonally.

Figure 1: Sequence board game before (lef) and after blue wins (right). Image by author.

Let us play

In the name of Statistics, I played 51 games with my partner. Going forward, we will refer to ourselves as Player 1 and Player 2. We recorded several variables, including the number of rounds, the winner, and the number of one-eyed and two-eyed jacks drawn in each game, excluding those drawn in the last two rounds unless they were played.

Distribution of jacks

Our first task on this meaningless journey was to answer the question: Did we make some lucky draws?

To that end, let N, X_1eye and X_2eyebe the random variables which describe the number of rounds, the one-eyed cards and two-eyed cards seen by one player. If we assume a game lasts N = n rounds, then we can ask what is the expected number of one- or two-eyed jacks in such a game. Assuming the deck of 104 cards is randomly shuffled, and each player is dealt cards interchangeably, this question is similar to asking what is the expected number of successes in N draws without replacement, given a population of size 104. This is the well-known hypergeometric distribution, and since there are 4 one-eyed and 4 two-eyed jacks, it follows:

Using the same argument, we find:

where X is the number of jacks (any type) seen by a player. The variance of the distribution is:

We can look at Figure 2 and verify that jacks drawn follow the distribution nicely with only 6.25% of extreme cases (i.e. outside the 95% prediction interval), which also reassures us that the cards are properly shuffled.

Our recordings provide estimates for the mean rounds E[N]≈19 and variance V[N]≈24 and using the tower property of conditional expectation we find:

and from the law of total variance, we find:

Empirically, we found:

for Player 1, and:

for Player 2, suggesting that we might have been more than average lucky with drawing Jacks.

What is the real value of jacks?

One thing is getting lucky cards, another thing is how much a lucky card (a jack) actually impacts the game. To investigate, we performed a simple logistic regression, with the aim of measuring the effect of one-eyed and two-eyed jacks on the winning chance, denoted p:

where ⍺ is the effect on winning from drawing a one-eyed jack and β is the effect on winning from drawing a two-eyed jack. The parameter ɣ is other contributions. We fit the model in R using the function glm. The estimates [p-value] are:

The only significant effect is from the two-eyed jacks, which also has the largest effect on the winning outcome. Specifically, we find that the odds of winning with a two-eyed jack is:

=2.27 times those with a one-eyed jack.

Metric of luck

Based on our regression, the value of a one-eyed jack and two-eyed jack, written v₁and v₂ respectively, was determined by the equations:

For each player and each game i, we defined a luck score ( H_i ) as the total value of the player’s jacks. A score of H_i = 0 meant no luck (no jacks), while H_i = 1 meant maximum luck (all 8 jacks; 4 × v₁ + 4 × v₂).

We could further calculate the difference in luck ∆H_i = H_i^{player 1} – H_i^{player 2} between players. A score of ∆H_i = 0 meant an equal amount of luck (measured by the total value of the jacks), while ∆H_i = 1 meant that Player 1 drew all 8 jacks. When ∆H_i = -1 it meant that Player 2 drew all 8 jacks. We agreed on some (semi) arbitrary bounds:

∆H_i = -0.5 corresponds to very lucky (player 2)
∆H_i = -0.25 corresponds to lucky (player 2)
∆H_i = 0 corresponds to no luck (either player)
∆H_i = 0.25 corresponds to lucky (player 1)
∆H_i = 0.5 corresponds to very lucky (player 1)

We found that when a player experienced luck, the chance of winning was dramatically increased. Only in 7 out of 51 games did the unlucky player manage to win. See Figure 3, where the compelling image is shown.

Metric of skill

Now, it would be controversial to claim that two-player Sequence is no different from a coin toss. So how much is skill? Here we defined a skill score E_ifor each game i using the following formula:

The expression 1 + H_irepresents the number of skill points we want to award the player. However, we will only award points if the player has not been lucky. This is described by the indicator function ↿_{{∆H≤ 0}} for Player 1, and ↿_{∆H≥0} for Player 2. Also, we will only add skill points if the player actually wins, described by the indicator function ↿_{{Winner = Player 1}} for Player 1 and ↿_{{Winner = Player 2}} for Player 2. Thus, assuming a player wins, the skill score is zero, if the player was lucky, 1 if either player was equally lucky, and 1 + H_i otherwise. The skill score is also zero, if the player loses.

Our best estimate of a player’s skill in Sequence is the average skill score Ē over all games where E_i ≥ 1(i.e. where the player has been skillful). For both Player 1 and Player 2, there are m = 7 games where the players have been skillful, as seen from Figure 3. The score is then:

In Figure 4, we used a technique called bootstrap resampling to create 5 million replicas of the average skill score Ē for both Player 2 and Player 1, based on our games. We then computed 5 million differences ∆Ē =Ē_{player 1} – Ē_{player 2} and plotted the histogram, along with a 95% (bootstrap) confidence interval, and the mean difference.

The figure shows significant uncertainty about which player performs the best. In particular, it is noted that the confidence interval contains 0, which means that at a 5% significance level, we cannot reject the hypothesis that the two players are equally skilled.

A game of chance?

We have demonstrated that luck plays a major role in our games of Sequence. When either me or my partner drew enough jacks, victory was nearly certain. To evaluate the ability to strategically navigate a fair game or one with opposition, we developed a skill score. The analysis of 51 games showed no significant difference in skill between my partner and me. Other luck factors, such as card order, are not accounted for in our model, meaning that the games used to calculate the skill score may themselves be influenced by elements of luck. If this is the case, the game is more affected by chance than our analysis suggests.

Our study is based on 51 games, which is a relative low number. These games were played by me and my partner and cannot be generalized. Other players with greater or lesser strategic sense will likely affect the outcome differently. Still, our analysis revealed that two-eyed jacks displayed a significant and substantial effect on the chance of winning, suggesting that they play an important role, regardless of the players. Although one-eyed jacks were not statistically significant, we still considered them to be lucky cards, as they are logically expected to have some influence. Ultimately, while our findings suggest that the strongest predictor of victory is the number of two-eyed jacks, further analysis over many more games with a broader range of players would obviously be needed to fully understand the dynamics at play. For example, it might be hypothesized that equally skilled players profit significantly more from luck, as opposed to players of uneven skill.

The post How Does Luck Influence the Board Game Sequence? appeared first on Towards Data Science.