Ab Testing | Towards Data Science https://towardsdatascience.com/tag/ab-testing/ The world’s leading publication for data science, AI, and ML professionals. Thu, 06 Mar 2025 05:42:22 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Ab Testing | Towards Data Science https://towardsdatascience.com/tag/ab-testing/ 32 32 One-Tailed Vs. Two-Tailed Tests https://towardsdatascience.com/one-tailed-vs-two-tailed-tests/ Thu, 06 Mar 2025 04:22:42 +0000 https://towardsdatascience.com/?p=598815 Choosing between one- and two-tailed hypotheses affects every stage of A/B testing. Learn why the hypothesis direction matters and explore the pros and cons of each approach.

The post One-Tailed Vs. Two-Tailed Tests appeared first on Towards Data Science.

]]>
Introduction

If you’ve ever analyzed data using built-in t-test functions, such as those in R or SciPy, here’s a question for you: have you ever adjusted the default setting for the alternative hypothesis? If your answer is no—or if you’re not even sure what this means—then this blog post is for you!

The alternative hypothesis parameter, commonly referred to as “one-tailed” versus “two-tailed” in statistics, defines the expected direction of the difference between control and treatment groups. In a two-tailed test, we assess whether there is any difference in mean values between the groups, without specifying a direction. A one-tailed test, on the other hand, posits a specific direction—whether the control group’s mean is either less than or greater than that of the treatment group.

Choosing between one- and two-tailed hypotheses might seem like a minor detail, but it affects every stage of A/B testing: from test planning to Data Analysis and results interpretation. This article builds a theoretical foundation on why the hypothesis direction matters and explores the pros and cons of each approach.

One-tailed vs. two-tailed hypothesis testing: Understanding the difference

To understand the importance of choosing between one-tailed and two-tailed hypotheses, let’s briefly review the basics of the t-test, the commonly used method in A/B testing. Like other Hypothesis Testing methods, the t-test begins with a conservative assumption: there is no difference between the two groups (the null hypothesis). Only if we find strong evidence against this assumption can we reject the null hypothesis and conclude that the treatment has had an effect.

But what qualifies as “strong evidence”? To that end, a rejection region is determined under the null hypothesis and all results that fall within this region are deemed so unlikely that we take them as evidence against the feasibility of the null hypothesis. The size of this rejection region is based on a predetermined probability, known as alpha (α), which represents the likelihood of incorrectly rejecting the null hypothesis. 

What does this have to do with the direction of the alternative hypothesis? Quite a bit, actually. While the alpha level determines the size of the rejection region, the alternative hypothesis dictates its placement. In a one-tailed test, where we hypothesize a specific direction of difference, the rejection region is situated in only one tail of the distribution. For a hypothesized positive effect (e..g., that the treatment group mean is higher than the control group mean), the rejection region would lie in the right tail, creating a right-tailed test. Conversely, if we hypothesize a negative effect (e.g., that the treatment group mean is less than the control group mean), the rejection region would be placed in the left tail, resulting in a left-tailed test.

In contrast, a two-tailed test allows for the detection of a difference in either direction, so the rejection region is split between both tails of the distribution. This accommodates the possibility of observing extreme values in either direction, whether the effect is positive or negative.

To build intuition, let’s visualize how the rejection regions appear under the different hypotheses. Recall that according to the null hypothesis, the difference between the two groups should center around zero. Thanks to the central limit theorem, we also know this distribution approximates a normal distribution. Consequently, the rejection areas corresponding to the different alternative hypothesis look like that:

Why does it make a difference?

The choice of direction for the alternative hypothesis impacts the entire A/B testing process, starting with the planning phase—specifically, in determining the sample size. Sample size is calculated based on the desired power of the test, which is the probability of detecting a true difference between the two groups when one exists. To compute power, we examine the area under the alternative hypothesis that corresponds to the rejection region (since power reflects the ability to reject the null hypothesis when the alternative hypothesis is true).

Since the direction of the hypothesis affects the size of this rejection region, power is generally lower for a two-tailed hypothesis. This is due to the rejection region being divided across both tails, making it more challenging to detect an effect in any one direction. The following graph illustrates the comparison between the two types of hypotheses. Note that the purple area is larger for the one-tailed hypothesis, compared to the two-tailed hypothesis:

In practice, to maintain the desired power level, we compensate for the reduced power of a two-tailed hypothesis by increasing the sample size (Increasing sample size raises power, though the mechanics of this can be a topic for a separate article). Thus, the choice between one- and two-tailed hypotheses directly influences the required sample size for your test. 

Beyond the planning phase, the choice of alternative hypothesis directly impacts the analysis and interpretation of results. There are cases where a test may reach significance with a one-tailed approach but not with a two-tailed one, and vice versa. Reviewing the previous graph can help illustrate this: for example, a result in the left tail might be significant under a two-tailed hypothesis but not under a right one-tailed hypothesis. Conversely, certain results might fall within the rejection region of a right one-tailed test but lie outside the rejection area in a two-tailed test.

How to decide between a one-tailed and two-tailed hypothesis

Let’s start with the bottom line: there’s no absolute right or wrong choice here. Both approaches are valid, and the primary consideration should be your specific business needs. To help you decide which option best suits your company, we’ll outline the key pros and cons of each.

At first glance, a one-tailed alternative may appear to be the clear choice, as it often aligns better with business objectives. In industry applications, the focus is typically on improving specific metrics rather than exploring a treatment’s impact in both directions. This is especially relevant in A/B testing, where the goal is often to optimize conversion rates or enhance revenue. If the treatment doesn’t lead to a significant improvement the examined change won’t be implemented.

Beyond this conceptual advantage, we have already mentioned one key benefit of a one-tailed hypothesis: it requires a smaller sample size. Thus, choosing a one-tailed alternative can save both time and resources. To illustrate this advantage, the following graphs show the required sample sizes for one- and two-tailed hypotheses with different power levels (alpha is set at 5%).

In this context, the decision between one- and two-tailed hypotheses becomes particularly important in sequential testing—a method that allows for ongoing data analysis without inflating the alpha level. Here, selecting a one-tailed test can significantly reduce the duration of the test, enabling faster decision-making, which is especially valuable in dynamic business environments where prompt responses are essential.

However, don’t be too quick to dismiss the two-tailed hypothesis! It has its own advantages. In some business contexts, the ability to detect “negative significant results” is a major benefit. As one client once shared, he preferred negative significant results over inconclusive ones because they offer valuable learning opportunities. Even if the outcome wasn’t as expected, he could conclude that the treatment had a negative effect and gain insights into the product.

Another benefit of two-tailed tests is their straightforward interpretation using confidence intervals (CIs). In two-tailed tests, a CI that doesn’t include zero directly indicates significance, making it easier for practitioners to interpret results at a glance. This clarity is particularly appealing since CIs are widely used in A/B testing platforms. Conversely, with one-tailed tests, a significant result might still include zero in the CI, potentially leading to confusion or mistrust in the findings. Although one-sided confidence intervals can be employed with one-tailed tests, this practice is less common.

Conclusions

By adjusting a single parameter, you can significantly impact your A/B testing: specifically, the sample size you need to collect and the interpretation of the results. When deciding between one- and two-tailed hypotheses, consider factors such as the available sample size, the advantages of detecting negative effects, and the convenience of aligning confidence intervals (CIs) with hypothesis testing. Ultimately, this decision should be made thoughtfully, taking into account what best fits your business needs.

(Note: all the images in this post were created by the author)

The post One-Tailed Vs. Two-Tailed Tests appeared first on Towards Data Science.

]]>
Synthetic Control Sample for Before and After A/B Test https://towardsdatascience.com/synthetic-control-sample-for-before-and-after-a-b-test-683bac36ffc1/ Thu, 19 Dec 2024 19:32:30 +0000 https://towardsdatascience.com/synthetic-control-sample-for-before-and-after-a-b-test-683bac36ffc1/ Learn a simple way to use linear regression to create a synthetic control sample for your A/B test

The post Synthetic Control Sample for Before and After A/B Test appeared first on Towards Data Science.

]]>

Introduction

A/B Testing is very powerful. I like this kind of experiment because it gives us the power to compare outcomes and determine if something is performing better than another.

A/B Testing has a specific type that adds the time component, which is the Before and After A/B Test. On that test, the comparison is between the situation of a given subject before and after an intervention.

Let us translate that previous sentence to a real-world example.

A company wants to know if an advertising would drive sales increase, so they can show that ad to a treatment group and compare the results to a control group that did not see the ad. The difference before and after the ad would indicate whether the intervention was effective or not.

Now, sometimes it is not possible to plan ahead and make that split of control and treatment groups before the intervention.

That is when the Synthetic Control sample will be useful. Using some statistics and machine learning, it is possible to simulate what would have happened with a sample if the intervention didn’t happen.

This is what we are going to learn in this post.


You can learn more about A/B testing in this post linked next.

My Easy Guide to Pre vs. Post Treatment Tests

Problem Description

We are working with a dataset from a retail chain of stores. To generate it, I used the Rossmann Stores dataset from Kaggle as my baseline to get a sense of sales distribution by store, so it would be closer to reality. Additionally, I selected control stores that are similar to the tested store. Therefore, the data was completely recreated and modified and has different store numbers and sales numbers.

The variables are:

  • Store : numbered from 1 to 10. The test store is the #10.
  • Date : daily dates from 2013 to 2015.
  • Sales : Amount of sales on that date in dollars.

Intervention:

  • A competitor opened a new store next to our Store #10, in March of 2014.
  • The control stores (1–9) don’t have a competitor around.

We intend to understand the impact caused by this competitor on Store #10’s sales to take any necessary actions and fight the competition, if applicable.

Plotting the sales trends from control and treatment stores, we get the next figure.

Control stores sales vs. Treatment Store sales. Image by the author.
Control stores sales vs. Treatment Store sales. Image by the author.

Notice that the control stores keep following a steady pattern, with a slight growth even. On the other hand, the treatment store has declined considerably after the opening of the competitor.

Could this be related to the competition? Let’s keep digging.

Synthetic Control Sample

We just saw that there was a considerable decrease in sales for test Store #10 after the treatment event (the opening of a competitor in March 2014).

If we get the data and calculate the mean sales by week before the intervention date it was 68,199.28. After the intervention, 51,283.35. The average reduction is ~25%.

Now we could get the control stores and compare their performance against the test store like a regular Before and After A/B Test. After all, I pre-selected similar control and test stores when I was creating this data frame.

However, our intention is to compare the current performance of Store #10 with a hypothetical performance of Store #10 if no competition had been opened. To accomplish that, we need to simulate a performance without competition, thus we use the data from the control stores.

Let’s think for a minute:

  • The control stores don’t have competition around, so the assumption is that Store #10 would have a similar behavior after March 2014 without competition.
  • The Before versus After A/B Test is a comparison of two points in time of the same subject. Group A Before versus Group A after. Group B before versus Group B after.
  • We have Group B before and after, which is the test Store #10 as is, with competition.
  • We need a simulation of the behavior of Store #10 without competition opening.

So, here is what we are going to do:

  • Use Linear Regression to create a synthetic sales series for Store #10 without competition around.
  • That synthetic sample is fitted on the control stores – and that is the reason why the control stores must be similar to whatever we are willing to simulate.

Let’s do that.

Code

The libraries and modules used in this exercise.

# Data manipulation
import pandas as pd
import numpy as np

# DataViz
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Stats
import scipy.stats as scs
import pingouin as pg

# Modeling
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_percentage_error

We start separating control and test stores.

# Select the control group
control = df2.query('Store != "10"')

# Treatment Group
treat = df2.query('Store == "10"')

Next, we must resample our control data from daily sales to weekly sales (sum every 7 days). Here, we can already get the series of sales by date for the test store – y_sales, given that it is a single store.

# Create dataset of control stores sales resampled from daily to weekly (7 days) sales
unit_control = control.pivot(index='Date', columns='Store', values='Sales').resample('7D').sum()

# Agregating sales for Store #10 by each 7 days
y_sales = treat.set_index('Date')['Sales'].resample('7D').sum()

Then we will filter the control stores’ dates on the same dates we have available for the test store, aligning both datasets. If any value is NA, we fill it with the median, and then we take the mean value of all the control stores, transforming the data on a series of sales indexed by date.

# Filter the control sample with only the dates contained in the treatment dataset
aligned_dates = unit_control.index.intersection(y_sales.index)

# Get the control data filtered with dates from the treatment series
X_sales = unit_control.loc[aligned_dates].fillna(unit_control.median())

# Transform the control data into a single series aggregated as the mean of sales by week
X_sales_mean = X_sales.mean(axis=1)

Here is a view of the X_sales_mean data.

Series of control stores sales indexed by date. Image by the author.
Series of control stores sales indexed by date. Image by the author.

Let’s plot both series y_sales and X_sales_mean.

# Set Date of the intervention
intervention_date = pd.to_datetime("2014-03-01")

# Plot
plt.figure(figsize=(12, 6))
plt.plot(y_sales.index, y_sales, label="Store 10", color='green', lw=2)
plt.plot(X_sales_mean.index, X_sales_mean, label="No Comp Stores", color='gray', linestyle='--')
plt.axvline(x=intervention_date, color="red", linestyle=":", lw=2, label="Opened Competitor for Store 10 ")
plt.xlabel("D A T E")
plt.ylabel("S A L E S")
plt.title("Sales GAP Store 10 vs Stores W/o Competition")
plt.legend()
plt.show()
STORE #10 versus the mean of the control stores (no competition). Image by the author.
STORE #10 versus the mean of the control stores (no competition). Image by the author.

We can observe that the results of Store #10 are consistently higher than the average of the control stores before competition. But once the competitor opened, sales became consistently under the average of the control stores.

Now we should fit the regression model to create the synthetic Store #10 without competition.

  • X: The sales of the control stores will be the predictor variables
  • y : The sales of the test store will be the target.
  • X_before and y_before are the training sets.
  • X_after and y_after are the test sets.
# X & y sets
X_before = X_sales[:intervention_date]
X_after = X_sales[intervention_date:]
y_before = y_sales[:intervention_date]
y_after = y_sales[intervention_date:]

# Linear Regression fit
lm = LinearRegression(fit_intercept=False)
lm.fit(X_before, y_before)

# Predictions before and after intervention
preds_before = lm.predict(X_before)
preds_after = lm.predict(X_after)

# Synthetic Control Series
synthetic_series = pd.Series(
    np.concatenate([preds_before, preds_after]),
    index= X_sales.index
)

If we plot again both series, we are comparing Store #10 with competition versus Store #10 without competition (synthetic sample).

# Plot Synthetic (No competition) vs 10
plt.figure(figsize=(12, 6))
plt.plot(y_sales.index, y_sales, label="Store 10", color='green', lw=2)
plt.plot(synthetic_series.index, synthetic_series, label="10 W/O Competition", color='gray', linestyle='--')
plt.axvline(x=intervention_date, color="red", linestyle=":", lw=2, label="Opened Competitor for Store 10 ")
plt.xlabel("D A T E")
plt.ylabel("S  A  L  E  S")
plt.title("Comparison: Store 10 vs Synthetic 10 No Comp")
plt.legend()
plt.show()

Here’s the graphic.

STORE #10 actuals versus synthetic STORE #10 w/o competition. Image by the author.
STORE #10 actuals versus synthetic STORE #10 w/o competition. Image by the author.

We can notice that the fit is pretty good for the initial part of the series, before the intervention date, then the lines open a considerable gap, where the actual Store #10 underperforms the synthetic pair, leading us to conclude that the cause of the drop in performance is, in fact, due to the opening of the competition.

To quantify this gap, let’s calculate the Mean Average Percentage Error (MAPE) of the predictions versus the real data.

# MAPE Store 10 versus Synthetic Store 10
mape_after = mean_absolute_percentage_error(y_after, preds_after)
print(f'MAPE after intervention: {mape_after:.2f}')

--------- OUT -----------
MAPE after intervention: 0.32

The calculation points to a drop of 32% in Store #10 sales after the opening of the competitor.

Let’s go one step further and perform a Before vs. After A/B Test using the actual values for the control and test stores, so we can check if the results are consistent.

Before and After A/B Test

A/B Test | Image generated by AI. Meta Llama, 2024. https://meta.ai
A/B Test | Image generated by AI. Meta Llama, 2024. https://meta.ai

The Before and After Test compares the performance of the control stores against the performance of the test store pre and post-intervention, which is the opening of a competitor around store 10.

Let’s set the intervention date for the test.

# Intervention date
intervention_date = pd.to_datetime("2014-03-01")

Next, we must resample our data from daily sales to sales aggregated by 7 days.

df_sales7d = (
    df2
    .pivot(index='Date', columns='Store', values='Sales') #pivot for resampling
    .resample('7D') #aggregate sales by 7 days
    .sum() #sum sales
    .reset_index() #reset to make Date as column again
    .melt(id_vars='Date', var_name='Store', value_name='Sales') # unpivot
    .assign(after= lambda x: np.select([x.Date <= intervention_date, x.Date > intervention_date],
                                        [0, 1])) # add after Yes or No
    .assign(group= lambda x: np.select([x.Store == "10", x.Store != "10"],
                                        ["treatment", "control"])) # add group treatment or control
    )

# View
df_sales7d.head(2)

Here is the view of the data.

Sales aggregated 7 days. Image by the author.
Sales aggregated 7 days. Image by the author.

Next, we calculate the averages and standard errors of the samples.

# Calculating averages and standard errors
ab_means = (df_sales7d
 .groupby(['group', 'after'])
 .agg({'Sales':['mean', 'std']})
 .round(2)
 )
Avg and Std. error of the samples. Image by the author.
Avg and Std. error of the samples. Image by the author.

Moving forward, we have to define a couple of functions:

  • std_error_two_samples: This function calculates a single standard error for the control and test samples. This will be important for the confidence interval calculation of the difference between the Before and After groups.
  • ab_test: a function to perform the A/B test and calculate the confidence interval for the difference between the groups.
  • Note: Both functions can be found in the GitHub repository, here.
print('Control Before vs After:')
ab_test(data=  df_sales7d.query('group == "control"').rename(columns={'group':'grp','after':'group'}),
        group_col= 'group',
        target_col = 'Sales')

print('---------------------------------------------------------------------')

print('Treatment Before vs After:')
ab_test(data=  df_sales7d.query('group == "treatment"').rename(columns={'group':'grp','after':'group'}),
        group_col= 'group',
        target_col = 'Sales')

Resulting in the following output:

Control Before vs After:
The calculated standard error is 1635.206108090877
The difference in means Group B - A : 1118.971107665042
P-Value (two tails): 0.49378591381056647
confidence interval with 95% confidence is [-2085.97  4323.92]
---------------------------------------------------------------------
Treatment Before vs After:
The calculated standard error is 1695.1522766982898
The difference in means Group B - A : -16915.930695613657
P-Value (two tails): 0.0
confidence interval with 95% confidence is [-20238.37 -13593.49]

For the control samples, at a significance level of 5%, we cannot reject the null hypothesis of equal means. So the stores are statistically behaving similarly in both periods. The difference between both samples means will fluctuate between -$2,085 to $4,323 a week.

For the test Store #10, the p-value suggests that the sample means are statistically different. So that store was impacted by the competition and dropped about $17k (-25%) in sales a week, with this difference varying from -$20k (-29%) to -$13.5k (-19%), both numbers well below zero.

But, hey… I am a visual guy. I like to see the difference in a graphic.

plt.figure(figsize=(18, 5))

# Control Samples Before vs. After
# Creating Normal Distribution of both Control samples
plt.subplot(1, 2, 1)
plot_a = np.random.normal(loc= ab_means.iloc[0,0], scale= ab_means.iloc[0,1], size=10000)
plot_b = np.random.normal(loc= ab_means.iloc[1,0], scale= ab_means.iloc[1,1], size=10000)
plot = pd.DataFrame({'group': ['Control_Before']*10000 + ['Control After']*10000,
                     'mu': np.concatenate([plot_a, plot_b])})
# Intervention date line
plt.axvline(x=ab_means.iloc[0,0], color="royalblue", linestyle="--", lw=1)
plt.axvline(x=ab_means.iloc[1,0], color="darkorange", lw=1)
sns.kdeplot(plot, x='mu', hue='group')
plt.title('Control Before vs After');

plt.subplot(1, 2, 2)
# Treatment Samples Before vs. After
# Creating Normal Distribution of both Treatment samples
plot_a = np.random.normal(loc= ab_means.iloc[2,0], scale= ab_means.iloc[2,1], size=10000)
plot_b = np.random.normal(loc= ab_means.iloc[3,0], scale= ab_means.iloc[3,1], size=10000)
plot = pd.DataFrame({'group': ['Treatment Before']*10000 + ['Treatment After']*10000,
                     'mu': np.concatenate([plot_a, plot_b])})
# Intervention date line
plt.axvline(x=ab_means.iloc[2,0], color="royalblue", linestyle="--", lw=1)
plt.axvline(x=ab_means.iloc[3,0], color="darkorange", lw=1)
sns.kdeplot(plot, x='mu', hue='group')
plt.title('Treatment Before vs After');

And here it is.

Before vs After A/B Test | Control vs Treatment samples. Image by the author.
Before vs After A/B Test | Control vs Treatment samples. Image by the author.

I believe we have eliminated any doubts about the impact on sales caused by the new competitor.

Before You Go

Causal Inference has been growing lately. It is more and more common to see new content about the topic. After all, as it is said, "correlation does not imply causation", so when we are asked by a client or the leadership at work what is the impact (or cause) of some effect, we must have tools to assess that.

Some nice Python packages like DoWhy, Causal Impact are out there to help us infer causality out of the results.

The exercise we performed in this post is very similar to what the package causalimpact does. If we run some code on it, the result should look familiar to you now.

# !pip install causalimpact --quiet
from causalimpact import CausalImpact

# Create dataset with mean of the control stores (x) and the test store (y)
df4 = (pd.concat([X_sales_mean.round(2), y_sales], axis=1)
       .rename(columns={0: 'x', 'Sales': 'y'})
       .reindex(columns=['y', 'x'])
       )

# Run Causal Impact
impact = CausalImpact(data=df4,
                      pre_period=[pd.to_datetime("2013-01-01"), pd.to_datetime("2014-03-04")],
                      post_period=[pd.to_datetime("2014-03-11"), pd.to_datetime("2015-07-28")])

impact.run()

# Plot result
impact.plot()

# Print Summary
impact.summary()

# Prin Report
impact.summary(output="report")

This is the result. Very similar to our previous comparison.

Causal Impact library output. Image by the author.
Causal Impact library output. Image by the author.

I recommend you to take a look. But be aware of some conflicts with newer versions of Pandas.


That’s it. If you liked this content, follow me for more.

Follow Me | Blog & Website

Gustavo R Santos – Medium

Gustavo R Santos

Complete code GitHub

Before-and-After-Testing/Python/Causal Inference at main · gurezende/Before-and-After-Testing

References

Controle Sintético: Como Aplicar Inferência Causal na Prática

4 Python Packages to Learn Causal Analysis

Notebook on nbviewer

The post Synthetic Control Sample for Before and After A/B Test appeared first on Towards Data Science.

]]>
Sequential Testing: The Secret Sauce for Low-Volume A/B Tests https://towardsdatascience.com/sequential-testing-the-secret-sauce-for-low-volume-a-b-tests-fe62bdf9627b/ Thu, 29 Aug 2024 06:32:17 +0000 https://towardsdatascience.com/sequential-testing-the-secret-sauce-for-low-volume-a-b-tests-fe62bdf9627b/ How to Accelerate Decision-Making with Low Volume Data

The post Sequential Testing: The Secret Sauce for Low-Volume A/B Tests appeared first on Towards Data Science.

]]>
How to accelerate decision-making and improve accuracy when dealing with limited data
Image generated by OpenAI's chatGPT
Image generated by OpenAI’s chatGPT

What is A/B Testing and Why is it Hard?

A/B testing is a simple way to reduce uncertainty in decision making by providing a data driven way to determine which version of a product is more effective. The concept of A/B testing is simple.

  • Imagine you are at a friend’s birthday party. You’ve been painstakingly working on perfecting your cookie recipe. You think you’ve perfected it, but you don’t know if people will prefer the cookie with or without oats. In your opinion, oats give the cookie a nice chewy texture. However, you’re not sure if this is a mass opinion or just your individual preference.
  • You end up showing up to the party with two different versions of the cookie, cookie A has oats and cookie B doesn’t. You randomly give half of your friends cookie A, and the other have get cookie B.
  • You decide that the cookie that gets more "yums" is the better cookie.
  • Once everyone has tasted the cookie, you find that cookie B got more "yums" and conclude that is the better cookie.

This process of randomly distributing cookies to party guests and monitoring their feedback is an example of an A/B test.

In the world of technology, A/B testing provides a data driven way to determine which version of a product is more effective. By randomly routing users to different versions of an experience, you can empirically measure the impact of different product versions on key performance metrics. This allows you to validate changes, and iteratively optimize product offerings.

In my role as a senior Data Science manager, we most commonly use A/B testing to test different pricing models to see which model leads to the most purchases. Consider two pricing strategies – one where the product is priced at $19.99 and the other is priced at $24.99 with a 20% discount. These two pricing strategies lead to the same price, but are customers more likely to purchase if they see a 20% discount? We can test this using A/B testing!

Traditional A/B tests typically require a certain amount of samples before you can conclude that one version of the product or model is better than others. In other words, traditional A/B tests require enough samples so that the test itself can be considered statistically significant. The number of samples required to achieve statistical significance on an A/B test is set before the experiment begins, and then you wait. This is referred to as fixed sample size A/B testing.

Fixed sample size A/B testing is problematic for a plethora of reasons.

  1. Time Intensive: In large companies with huge volumes, you may reach your desired sample size quickly. However, if you’re like me, and work in a small startup where volume isn’t as large – waiting for the test to finish can be time intensive. Recently, my team designed an A/B test only to realize that it would take us 2 years to reach the desired sample size!
  2. Inflexibility: Once you’ve established the required sample size for your A/B test, you’re locked into that decision. If external factors change you can’t easily adjust the test without compromising the test.

What is sequential Testing and why is it (maybe) Easier?

Sequential testing is a version of A/B testing that allows for continuous monitoring of data as it is collected, enabling decisions to be made earlier than in traditional fixed-sample tests. By using predefined stopping rules, you can stop the test as soon as sufficient evidence is gathered.

Sequential testing is an alternative to fixed sample size testing. It’s commonly used in situations where there are:

  • Low volumes: When you have limited data coming in and need to make decisions quickly, sequential testing allows you to draw conclusions without waiting for a large sample size.
  • Cost or time constraints: If the cost or time to collect data is high, sequential testing can help reduce the number of samples needed by allowing the test to stop as soon as a clear result is observed.
  • Adaptive factors: When conditions or user behavior might change over time, sequential testing allows for more flexible decision-making and adaptation as new data is collected.

How does Sequential Testing Work?

Implementing sequential testing relies on the Sequential Probability Ratio Test ("SPRT"). This ratio is used to test two competing hypotheses:

  1. Null Hypothesis (H₀): The parameter of interest (like a conversion rate) is equal to a specified value, often the status quo or baseline. 𝑝 = 𝑝
  2. Alternative Hypothesis (H₁): The desired change in the parameter of interest. 𝑝 = 𝑝 + Δ

Once you have defined the null and alternative hypothesis, you need to set up decision boundaries.

  • SPRT uses two boundaries (upper and lower) to decide wether to accept H₀, accept __ H₁, or continue collecting data.

These boundaries are determined based on desired error rates

  • Type 1 error (⍺): A type 1 error occurs when you conclude that there is a meaningful difference in the A and B groups, when in reality there isn’t a difference. This is also known as a false positive.
  • Type 2 error (β): A type 2 error occurs when you conclude there is no meaningful difference in the A and B groups, when it reality there is a difference. This is also known as a false negative.

In sequential testing, ⍺ and β are commonly set at 0.05 and 0.20. However, these need to be set appropriately to reflect your experiment. Once the desired error rates have been set, you use them to set the relevant boundaries.

  • Upper boundary (U) = (1- β)/⍺
  • Lower boundary (L) = β / (1 -⍺)

For each new observation that comes in, we update the likelihood ratio as LR =LR(n-1)* λ(data|H₁)/ λ(data|H₀). This link has a conditional probability refresher.

Each time this likelihood ratio is updated, it’s compared against the boundaries we set previously:

  • If ℒ > U, reject H₀ and accept H₁
  • If ℒ < L , reject H₁ and accept H₀
  • If L ≤ ℒ ≤U , continue test and collect more data

In the section below, we will walk through an illustrative example.

Sequential Testing Example

Imagine you are a data scientist responsible for figuring out if the model you’ve developed recently results in more conversions than the current production model. You decide that you will randomly route a portion of potential customers to your new model, while the remainder will continue to use the model currently in production.

The existing production model has an associated conversion rate of 5%. We hope that our new model will increase this conversion rate to 7%, but we’re not sure. For that reason, we develop a sequential A/B test to test this.

First, we define our hypotheses.

  • H₀: 𝑝 = 0.05; baseline conversion rate
  • H₁ 𝑝 = 0.07; desired conversion rate

Next, we will set up our decision boundaries. We will use commonly used error rates to set our boundaries (⍺ = 0.05, β = 0.2).

  • Upper boundary (U) = (1- β)/⍺ = (1–0.2)/.05 = 0.8/.05 = 16
  • Lower boundary (L) = β / (1 – ⍺) = 0.2 / (1-.05) = 0.2/0.95 ≅ 0.211

At this point, we will collect some data. For each observation, whether it is a success (conversion) or a failure (no conversion), we will update the likelihood ratio consistently.

  • In the case of a success, we will always multiply the current likelihood ratio by P(success|H₁)/P(success|H₀) = 0.07/0.05 = 1.4.
  • In the case of a failure, we will always multiply the current likelihood ratio by P(failure|H₁)/P(failure|H₀) = (1–0.07)/(1–0.05) = 0.93/0.95 ≅ 0.98.

Below, I’ve simulated some observations and the associated changes to our likelihood ratio. As a disclaimer, I made sure we got a lot of successes early so the table wasn’t thousands of observations long (based on our actual conversion rate, this is unlikely).

Image by author
Image by author

After 11 observations, we find that our model isn’t just good, it’s great! We’re converting nearly everything. We reject H₀ and accept H₁. Obviously, this is a simplified example but this is generally how sequential testing works.

Is Sequential Testing Risk Free?

In this article, we’ve explored the idea of sequential testing as an alternative to fixed sample size A/B testing. Sequential testing offers advantages, such as the potential for faster decision-making and greater adaptability to evolving market conditions. These benefits can lead to more efficient experimentation, particularly in low volume environments where you might not have the time to wait to accumulate a certain sample size. I’ve painted a rather blissful picture for sequential testing, but sequential testing isn’t without it’s own set of risks.

  • Early Data Can Lead you Astray: Since the test is checked as each new observation comes in, the chance incorrectly rejecting the null hypothesis increases, especially on early data. Early data might show a strong effect that diminishes as more data is collected, leading to premature conclusions. In my example above, we had 9 conversions in 11 observations, despite a baseline conversion rate of 5%. These results could be considered an outlier, leading us to reject the null hypothesis too quickly only to revert to the baseline at a later date.
  • Complexity in Interpretation: Statistical interpretation of sequential tests can be more complex. Although we don’t cover it in this article, to maintain the validity of the results, sequential tests often require the use of advanced statistical methods, such as alpha spending functions or other corrections. These methods ensure that the overall type I error rate remains controlled throughout the multiple testing stages. However, the added complexity can make it more challenging to correctly interpret the results, potentially leading to misinformed decisions if not properly understood.

Hopefully by now, you know what sequential testing is and have an idea about the pros/cons of sequential . Although sequential testing can offer many benefits, there are scenarios where the predictability and simplicity provided by fixed sample size A/B testing may be more appropriate for your experiment. The choice between sequential and fixed sample size testing should be guided by the specific goals, constraints, and context of your experiment.

Help me grow my page!

All claps and comments are appreciated. It’s how Medium knows if I’m doing a good job!

  • Follow me on Medium too
  • Subscribe to my newsletter below

The post Sequential Testing: The Secret Sauce for Low-Volume A/B Tests appeared first on Towards Data Science.

]]>
Better A/B Testing with Survival Analysis https://towardsdatascience.com/better-a-b-testing-with-survival-analysis-6205c3a2dc51/ Wed, 31 Jul 2024 11:08:19 +0000 https://towardsdatascience.com/better-a-b-testing-with-survival-analysis-6205c3a2dc51/ When running experiments - don't forget to bring your survival kit

The post Better A/B Testing with Survival Analysis appeared first on Towards Data Science.

]]>
When running experiments – don’t forget to bring your survival kit
Image by author, using DALL-E 3
Image by author, using DALL-E 3

I’ve already made the case in several blog posts (Here, [here](https://www.linkedin.com/pulse/better-churn-prediction-part-3-iyar-lin-ov5af/) and here) that using survival analysis can improve churn prediction.

In this blog post I’ll show another use case where Survival Analysis can improve on common practices: A/B testing!

The problems with common A/B testing practices

Usually when running an A/B test analysts assign users randomly to variants over time and measure conversion rate as the ratio between the number of conversions and the number of users in each variant. Users who just entered the test and those who are in the test for 2 weeks get the same weight.

This can be enough for cases where a conversion either happens or not within a short time frame after assignment to a variant (e.g. Finishing an on-boarding flow).

There are however many instances where conversions are spread over a longer time frame. One example would be first order after visiting a site landing page. Such conversions may happen within minutes, but a large portion could still happen days after the first visit.

In such cases the business KPIs are usually "bounded" to a certain period – e.g. "conversion within 7 days" or "churn within 1 month".

In those instances measuring conversions without considering their timing has 2 major flaws:

  1. It makes the statistic we’re measuring unintelligible – Average conversions at any point in time does not translate to any bounded metric. In fact as the test keeps running – conversion rates will increase just because users get more time to convert. The experiment results will be thus hard to relate to the business KPIs.
  2. It discards the timing information which could lead to reduced power compared with methods that do take conversion timing into account.

To demonstrate point #2 we’ll run a small simulation study

We’ll have users join the experiment randomly over a 30 day period. Users’ time to convert will be simulated from a Weibull distribution with scale 𝜎=30,000, 𝛼_ctrl=0.18 for the control group and 𝛼_trt=0.157 for the treatment group.

Below are the corresponding survival curves:

alpha_ctrl <- 0.18 
alpha_trt <- 0.157 
sigma <- 30000 
conv_7d_ctrl <- format_pct(pweibull(7, alpha_ctrl, sigma)) 
conv_7d_trt <- format_pct(pweibull(7, alpha_trt, sigma)) 
t <- seq(0, 7, 0.1) 
surv_ctrl <- 1 - pweibull(t, alpha_ctrl, sigma) 
surv_trt <- 1 - pweibull(t, alpha_trt, sigma) 
plot(t, surv_trt, type = "line", col = "red", ylab = "S(t)", xlab = "t (days)",
  ylim = c(0.7, 1)) lines(t, surv_ctrl, col = "black") 
legend("topright", col = c("black", "red"), 
  legend = c("Control", "Treatment"), lty = 1, title = "Variant" )
Image by author
Image by author

Assuming we’re interested with conversions within 7 days, the true (unknown) conversion rate in the control group is 19.9% and in the treatment is 23.6%.

Below is the function which will generate the simulation data:

n <- 2000 
test_duration <- 30 
gen_surv_data <- function(m, alpha){ 
  set.seed(m) 
  tstart <- runif(n, 0, test_duration) 
  tconvert <- rweibull(n, alpha, sigma) 
  status <- as.integer(tstart + tconvert < test_duration) 
  tstatus <- ifelse(status == 0, test_duration - tstart, tconvert) 
  return(data.frame(tstatus=tstatus, status=status)) 
}

To demonstrate the benefits of using survival in A/B testing we’ll compare the power of 3 test statistics:

  1. T-test on conversions (the common procedure)
  2. T-test on 7 day conversion (estimated using a Kaplan-Meier curve)
  3. Peto & Peto modification of the Gehan-Wilcoxon test

Below is the code that implements the above:

run_simulation <- function(m, alpha1, alpha2){ 
  data_1 <- gen_surv_data(m, alpha1) 
  data_2 <- gen_surv_data(m+1, alpha2) 
  # T-test on conversions (the common procedure): 
  p1_hat <- mean(data_1$status) 
  p1_var <- p1_hat*(1-p1_hat)/length(data_1$status) 
  p2_hat <- mean(data_2$status) 
  p2_var <- p2_hat*(1-p2_hat)/length(data_2$status) 
  stat <- abs(p2_hat - p1_hat)/sqrt(p1_var + p2_var) 
  ans1 <- pnorm(stat, lower.tail = F)*2 
  # T-test on 7 day conversion (estimated using a Kaplan-Meier curve): 
  data_1$variant <- "control" 
  data_2$variant <- "treatment" 
  surv_data <- rbind(data_1, data_2) 
  surv_model <- summary(survfit(Surv(tstatus, status)~variant, 
    data = surv_data), times = 7, extend = T) 
  p1_hat <- 1 - surv_model$surv[1] 
  p1_var <- surv_model$std.err[1]^2 
  p2_hat <- 1 - surv_model$surv[2] 
  p2_var <- surv_model$std.err[2]^2 
  stat <- abs(p2_hat - p1_hat)/sqrt(p1_var + p2_var) 
  ans2 <- pnorm(stat, lower.tail = F)*2 
  # Peto &amp; Peto modification of the Gehan-Wilcoxon test: 
  mgw_test <- survdiff(Surv(tstatus, status)~variant, data = surv_data, 
    rho = 1) 
  ans3 <- mgw_test$pvalue 
  return(data.frame(
    `T-test conversions` = ans1, 
    `T-test KM 7 day conversion` = ans2, 
    `Modified Gehan-Wilcoxon test` = ans3, check.names = F)) 
}

Before measuring power let’s verify our statistics satisfy the desired false positive rate 𝛼=0.05 (5%) when both variants have the same conversion rates:

alpha <- 0.05 
M <- 500 
res <- Reduce("rbind", lapply(1:M, function(m) 
  run_simulation(m, alpha_ctrl, alpha_ctrl))) 
res <- data.frame(Statistic = names(res), 
  `False positive rate` = format_pct(sapply(res, function(x) mean(x<=alpha))), 
  check.names = F, row.names = NULL) 
knitr::kable(res, align = "c")
Image by author
Image by author

Next, let’s examine power:

M <- 2000 
res <- Reduce("rbind", lapply(1:M, function(m) 
  run_simulation(m, alpha_ctrl, alpha_trt))) 
res <- data.frame(
  Statistic = names(res), 
  Power = sapply(res, function(x) mean(x<=alpha)), 
  check.names = F, row.names = NULL) 
uplift_logrank <- format_pct((res[3,2] - res[1,2])/res[1,2]) 
uplift_km <- format_pct((res[2,2] - res[1,2])/res[1,2]) 
res$Power <- format_pct(res$Power) knitr::kable(res, align = "c")
Image by author
Image by author

While the T-test on KM 7 day conversion relates better to business KPIs than the T-test on conversions (the common procedure), it is only marginally more powerful.

The modified Gehan-Wilcoxon statistic on the other hand yields a substantial uplift in power, while only weakly relating to the business KPIs like the regular conversions T-test.

It should be noted that the power gains vary somewhat according to the point compared on the survival curve, the actual survival curve shape, experiment duration etc.

In a future post I hope to further explore this topic over a wider set of scenarios and test statistics (The R ComparisonSurv package looks promising).

When doing A/B testing in scenarios where time to convert varies – it’s often useful to apply survival analysis to take advantage of the time dimension. Either compare a point of interest on the survival curve to make the result relate directly to the business KPIs, or use the modified Gehan-Wilcoxon statistic statistic for improved power.


Originally published at https://www.linkedin.com.

The post Better A/B Testing with Survival Analysis appeared first on Towards Data Science.

]]>
Product Quasi-Experimentation: Statistical Techniques When Standard A/B Testing Is Not Possible https://towardsdatascience.com/product-quasi-experimentation-statistical-techniques-when-standard-a-b-testing-is-not-possible-68d516a59b1c/ Fri, 19 Jul 2024 17:53:47 +0000 https://towardsdatascience.com/product-quasi-experimentation-statistical-techniques-when-standard-a-b-testing-is-not-possible-68d516a59b1c/ A guide to the most popular techniques when randomized A/B testing is not possible

The post Product Quasi-Experimentation: Statistical Techniques When Standard A/B Testing Is Not Possible appeared first on Towards Data Science.

]]>
Randomized Control Trials (RCT) is the most classical form of product A/B Testing. In Tech, companies use widely A/B testing as a way to measure the effect of an algorithmic change on user behavior or the impact of a new user interface on user engagement.

Randomization of the unit ensures that the results of the experiment are uncorrelated with the treatment assignment, eliminating selection bias and hence enabling us to rely upon assumptions of statistical theory to draw conclusions from what is observed.

However, random assignment is not always possible, i.e. subjects of an experiment cannot be randomly assigned to the control and treatment groups. There are cases where targeting a specific user is impractical due to spillover effects or unethical, hence experiments need to happen on the city/country level, or cases where you cannot practically enforce the user to be in the treatment group, like when testing a software update. In those cases, statistical techniques need to be applied since the basic assumptions of statistical theory are no longer valid once the randomization is violated.

Let’s see some of the most commonly used techniques, how they work in simple terms and when they are applied.

Statistical Techniques

Difference in Differences (DiD)

This method is usually used when the subject of the experiment is aggregated at the group level. Most common cases is when the subject of the experiment is a city or a country. When, for example, a company tests a new feature by launching it only in a specific city or country (treatment group) and then compares the outcome to the rest of the cities/countries (control group). Note that in that case, cities or countries are often selected based on their product market fit, rather than being randomly assigned. This approach helps ensure that the test results are relevant and generalizable to the target market.

DiD measures the change in the difference in the average outcome between the control and treatment groups over the course of pre and post intervention periods. If the treatment has no effect on the subjects, you would expect to see a constant difference between the treatment and control groups over time. This means that the trends in both groups would be similar, with no significant changes or deviations after the intervention.

Therefore DiD compares the average outcome in treatment vs control groups post treatment and searches for statistical significance under the null assumption that pre treatment the treatment and control groups had parallel trends and that trends remain parallel post treatment (Ho). If a treatment has no impact, the treatment and control groups will show similar patterns over time. However, if the treatment is effective, the patterns will diverge after the intervention, with the treatment group showing a significant change in direction, slope, or level compared to the control group.

If the assumption of parallel trends is met, DiD can provide a credible estimate of the treatment effect. However, if the trends are not parallel, the results may be biased, and alternative methods (such as the Synthetic Control methods discussed below) or adjustments may be necessary to obtain a reliable estimate of the treatment effect.

DiD Application

Let’s see how DiD is applied in practice by looking at Card and Krueger study (1993) that used the DiD approach to analyze the impact of a minimum wage increase on employment. The study analyzed 410 fast-food restaurants in New Jersey and Pennsylvania following the increase in New Jersey’s minimum wage from $4.25 to $5.05 per hour. Full-time equivalent employment in New Jersey was compared against Pennsylvania’s before and after the rise of the minimum wage. New Jersey, in this natural experiment, becomes the treatment group and Pennsylvania the control group.

By using this dataset from the study, I tried to replicate the DiD analysis.

import pandas as pd
import statsmodels.formula.api as smf 

df = pd.read_csv('njmin3.csv')
df.head()
Dataset as obtained using this dataset
Dataset as obtained using this dataset

In the data, column "nj" is 1 if it is New Jersey, column "d" is 1 if it is after the NJ min wage increase and column "d_nj" is the nj × d interaction.

Based on the basic DiD regression equation, here we have fte (i.e. full-time employment) is

fte_it = α+ β nj_it + γ d_t + δ * (nj_it × d_t) + ϵ_it

where ϵ_it is the error term.

model = smf.ols(formula = "fte ~ d_nj + d + nj", data = df).fit()
print(model.summary())

The key parameter of interest is the nj × d interaction (i.e. "d_nj"), which estimates the average treatment effect of the intervention. The result of the regression shows that "d_nj" is not statistically significant (since p-value is 0.103 > 0.05), meaning the minimum wage law has no impact on employment.

Synthetic Controls

Synthetic control methods compare the unit of interest (city/country in treatment) to a weighted average of the unaffected units (cities/countries in control), where the weights are selected in a way that the synthetic control unit best matches the treatment unit pre-treatment behavior.

The post-treatment outcome of the treatment unit is then compared to the synthetic unit, which serves as a counterfactual estimate of what would have happened if the treatment unit had not received the treatment. By using a weighted average of control units, synthetic control methods can create a more accurate and personalized counterfactual scenario, reducing bias and improving the estimates of the treatment effect.

For more detailed explanation on how the Synthetic Control methods work by means of an example, I found the Understanding the Synthetic Control Methods particularly helpful.

Propensity Score Matching (PSM)

Think about designing an experiment to evaluate let’s say the impact of a Prime subscription on revenue per customer. There is no way you can randomly assign users to subscribe or not. Instead, you can use propensity score matching to find non-Prime users (control group) that are similar to Prime users (treatment group) based on characteristics like age, demographics, and behavior.

The propensity score used in the matching is basically the probability of a unit receiving a particular treatment given a set of observed characteristics and it is calculated using logistic regression or other statistical methods. Once the propensity score is calculated, units in the treatment and control group are matched based on these scores, creating a synthetic control group that is statistically similar to the treatment. This way, you can create a comparable control group to estimate the effect of the Prime subscription.

Similarly, when studying the effect of a new feature or intervention on teenagers and parents, you can use PSM to create a control group that resembles the treatment group, ensuring a more accurate estimate of the treatment effect. These methods help mitigate confounding variables and bias, allowing for a more reliable evaluation of the treatment effect in non-randomized settings.

Takeaway

When standard A/B testing and randomization of the units is not possible, we can no longer rely upon assumptions of statistical theory to draw conclusions from what is observed. Statistical techniques, like DiD, Synthetic Controls and PSM, need to be applied once the randomization is violated.

On top of those, there are more techniques, also popular, in addition to the ones discussed here, such as the Instrumental Variables (IV), the Bayesian Structural Time Series (BSTS) and the Regression Discontinuity Design (RDD) that are used to estimated the treatment effect when randomization is not possible or when there is no control group at all.

The post Product Quasi-Experimentation: Statistical Techniques When Standard A/B Testing Is Not Possible appeared first on Towards Data Science.

]]>
Chi-Squared Test: Revealing Hidden Patterns in Your Data https://towardsdatascience.com/chi-squared-test-revealing-hidden-patterns-in-your-data-d939df2dda71/ Tue, 25 Jun 2024 04:32:41 +0000 https://towardsdatascience.com/chi-squared-test-revealing-hidden-patterns-in-your-data-d939df2dda71/ Unlock Hidden Patterns in Your Data with the Chi-Squared Test in Python.

The post Chi-Squared Test: Revealing Hidden Patterns in Your Data appeared first on Towards Data Science.

]]>
Unlock hidden patterns in your data with the chi-squared test in Python.
Cover Photo by Sulthan Auliya on Unsplash
Cover Photo by Sulthan Auliya on Unsplash

Part 1: What is Chi-Squared Test?

When discussing Hypothesis Testing, there are many approaches we can take, depending on the particular cases. Common tests like the z-test and t-test are the go-to methods to test our hypotheses (null and alternative hypotheses). The metric we want to test differs depending on the problem. Usually, in generating hypotheses, we involve population mean or population proportion as the metric to state them. Let’s say we want to test whether the population proportion of the students who took the math test who got 75 is more than 80%. Let the null hypothesis be denoted by H0, and the alternative hypothesis be denoted by H1; we generate the hypotheses by:

Figure 1: Example of generating hypotheses by Author
Figure 1: Example of generating hypotheses by Author

After that, we should see our data, whether the population variance is known or unknown, to decide which test statistic formula we should use. In this case, we use z-statistic for proportion formula. To calculate the test statistics from our sample, first, we estimate the population proportion by dividing the total number of students who got 75 by the total number of students who participated in the test. After that, we plug in the estimated proportion to calculate the test statistic using the test statistic formula. Then, we determine from the test statistic result if it will reject or fail to reject the null hypothesis by comparing it with the rejection region or p-value.

But what if we want to test different cases? What if we make inferences about the proportion of the group of students (e.g., class A, B, C, etc.) variable in our dataset? What if we want to test if there is any association between groups of students and their preparation before the exam (are they doing extra courses outside school or not)? Is it independent or not? What if we want to test categorical data and infer their population in our dataset? To test that, we’ll be using the chi-squared test.

The chi-squared test is crafted to help us draw conclusions about categorical data that fall into different categories. It compares each category’s observed frequencies (counts) to the expected frequencies under the null hypothesis. Denoted as X², chi-squared has a distribution, namely chi-squared distribution, allowing us to determine the significance of the observed deviations from expected values.

Figure 2: Chi-Squared Distribution made in Matplotlib by Author
Figure 2: Chi-Squared Distribution made in Matplotlib by Author

The plot describes the continuous distribution of each degree of freedom in the chi-squared test. In the chi-squared test, to prove whether we will reject or fail to reject the null hypothesis, we don’t use the z or t table to decide, but we use the chi-squared table. It lists probabilities of selected significance level and degree of freedom of chi-squared. There are two types of chi-squared tests, the chi-squared goodness-of-fit test and the chi-squared test of a contingency table. Each of these types has a different purpose when tackling the hypothesis test. In parallel with the theoretical approach of each test, I’ll show you how to demonstrate those two tests in practical examples.

Part 2: Chi-squared goodness-of-fit test

This is the first type of the chi-squared test. This test analyzes a group of categorical data from a single categorical variable with k categories. It is used to specifically explain the proportion of observations in each category within the population. For example, we surveyed 1000 students who got at least 75 on their math test. We observed that from 5 groups of students (Class A to E), the distribution is like this:

Figure 3: Dummy data generated randomly by Author
Figure 3: Dummy data generated randomly by Author

We will do it in both manual and Python ways. Let’s start with the manual one.

Form Hypotheses

As we know, we have already surveyed 1000 students. I want to test whether the population proportions in each class are equal. The hypotheses will be:

Figure 4: Hypotheses of Students who at least got 75 from 5 classes by Author
Figure 4: Hypotheses of Students who at least got 75 from 5 classes by Author

Test Statistic

The test statistic formula for the chi-squared goodness-of-fit test is like this:

Figure 5: The Chi-squared goodness-of-fit test by Author
Figure 5: The Chi-squared goodness-of-fit test by Author

Where:

  • k: number of categories
  • fi: observed counts
  • ei: expected counts

We already have the number of categories (5 from Class A to E) and the observed counts, but we don’t have the expected counts yet. To calculate that, we should reflect on our hypotheses. In this case, I assume that all class proportions are the same, which is 20%. We will make another column in the dataset named Expected. We calculate it by multiplying the total number of observations by the proportion we choose:

Figure 6: Calculate expected Counts by Author
Figure 6: Calculate expected Counts by Author

Now we plug in the formula like this for each observed and expected value:

Figure 7: Calculate Test Statistic of goodness-of-fit test by Author
Figure 7: Calculate Test Statistic of goodness-of-fit test by Author

We already have the test statistic result. But how do we decide whether it will reject or fail to reject the null hypothesis?

Decision Rule

As mentioned above, we’ll use the chi-squared table to compare the test statistic. Remember that a small test statistic supports the null hypothesis, whereas a significant test statistic supports the alternative hypothesis. So, we should reject the null hypothesis when the test statistic is substantial (meaning this is an upper-tailed test). Because we do this manually, we use the rejection region to decide whether it will reject or fail to reject the null hypothesis. The rejection region is defined as below:

Figure 8: Rejection Region of goodness-of-fit test by Author
Figure 8: Rejection Region of goodness-of-fit test by Author

Where:

  • α: Significance Level
  • k: number of categories

The rule of thumb is: If our test statistic is more significant than the chi-squared table value we look up, we reject the null hypothesis. We’ll use the significance level of 5% and look at the chi-squared table. The value of chi-squared with a 5% significance level and degrees of freedom of 4 (five categories minus 1), we get 9.49. Because our test statistic is way more significant than the chi-squared table value (70.52 > 9.49), we reject the null hypothesis at a 5% significance level. Now, you already know how to perform the chi-squared goodness-of-fit test!

Python Approach

This is the Python approach to the chi-squared goodness-of-fit test using SciPy:

import pandas as pd
from scipy.stats import chisquare

# Define the student data
data = {
    'Class': ['A', 'B', 'C', 'D', 'E'],
    'Observed': [157, 191, 186, 163, 303]
}

# Transform dictionary into dataframe
df = pd.DataFrame(data)

# Define the null and alternative hypotheses
null_hypothesis = "p1 = 20%, p2 = 20%, p3 = 20%, p4 = 20%, p5 = 20%"
alternative_hypothesis = "The population proportions do not match the given proportions"

# Calculate the total number of observations and the expected count for each category
total_count = df['Observed'].sum()
expected_count = total_count / len(df)  # As there are 5 categories

# Create a list of observed and expected counts
observed_list = df['Observed'].tolist()
expected_list = [expected_count] * len(df)

# Perform the Chi-Squared goodness-of-fit test
chi2_stat, p_val = chisquare(f_obs=observed_list, f_exp=expected_list)

# Print the results
print(f"nChi2 Statistic: {chi2_stat:.2f}")
print(f"P-value: {p_val:.4f}")

# Print the conclusion
if p_val < 0.05:
    print("Reject the null hypothesis: The population proportions do not match the given proportions.")
else:
    print("Fail to reject the null hypothesis: The population proportions match the given proportions.")

Using the p-value, we also got the same result. We reject the null hypothesis at a 5% significance level.

Figure 9: Result of goodness-of-fit test using Python by Author
Figure 9: Result of goodness-of-fit test using Python by Author

Part 3: Chi-squared test of a contingency table

We already know how to make inferences about the proportion of one categorical variable. But what if I want to test whether two categorical variables are independent?

To test that, we use the chi-squared test of the contingency table. We will utilize the contingency table to calculate the test statistic value. A contingency table is a cross-tabulation table that classifies counts summarizing the combined distribution of two categorical variables, each having a finite number of categories. From this table, you can determine if the distribution of one categorical variable is consistent across all categories of the other categorical variable.

I will explain how to do it manually and using Python. In this example, we sampled 1000 students who got at least 75 on their math test. I want to test whether the variable of a group of students and the variable of the students who have taken the supplementary course (Taken or Not) outside the school before the test is independent. The distribution is like this:

Figure 10: Dummy data of contingency table generated randomly by Author
Figure 10: Dummy data of contingency table generated randomly by Author

Form Hypotheses

To generate these hypotheses is very simple. We define the hypotheses as:

Figure 11: Generate hypotheses of contingency table test by Author
Figure 11: Generate hypotheses of contingency table test by Author

Test Statistic

This is the hardest part. In handling real data, I suggest you use Python or other statistical software directly because the calculation is too complicated if we do it manually. But because we want to know the approach from the formula, let’s do the manual calculation. The test statistic of this test is:

Figure 12: The Chi-squared contingency table formula by Author
Figure 12: The Chi-squared contingency table formula by Author

Where:

  • r = number of rows
  • c = number of columns
  • fij: the observed counts
  • eij = (i th row total * j th row total)/sample size

Recall Figure 9, those values are just observed ones. Before we use the test statistic formula, we should calculate the expected counts. We do that by:

Figure 13: Expected Counts of the contingency table by Author
Figure 13: Expected Counts of the contingency table by Author

Now we get the observed and expected counts. After that, we will calculate the test statistic by:

Figure 14: Calculate Test Statistic of contingency table test by Author
Figure 14: Calculate Test Statistic of contingency table test by Author

Decision Rule

We already have the test statistic; now we compare it with the rejection region. The rejection region for the contingency table test is defined by:

Figure 15: Rejection Region of contingency table test by Author
Figure 15: Rejection Region of contingency table test by Author

Where:

  • α: Significance Level
  • r = number of rows
  • c = number of columns

The rule of thumb is the same as the goodness-of-fit test: If our test statistic is more significant than the chi-squared table value we look up, we reject the null hypothesis. We will use the significance level of 5%. Because the total row is 5 and the total column is 2, we look up the value of chi-squared with a 5% significance level and degrees of freedom of (5–1) * (2–1) = 4, and we get 15.5. Because the test statistic is lower than the chi-squared table value (22.9758 > 15.5), we reject the null hypothesis at a 5% significance level.

Python Approach

This is the Python approach to the chi-squared contingency table test using SciPy:

import pandas as pd
from scipy.stats import chi2_contingency

# Create the dataset
data = {
    'Class': ['group A', 'group B', 'group C', 'group D', 'group E'],
    'Taken Course': [91, 131, 117, 75, 197],
    'Not Taken Course': [66, 60, 69, 88, 106]
}

# Create a DataFrame
df = pd.DataFrame(data)
df.set_index('Class', inplace=True)

# Perform the Chi-Squared test for independence
chi2_stat, p_val, dof, expected = chi2_contingency(df)

# Print the results
print("Expected Counts:")
print(pd.DataFrame(expected, index=df.index, columns=df.columns))
print(f"nChi2 Statistic: {chi2_stat:.4f}")
print(f"P-value: {p_val:.4f}")

# Print the conclusion
if p_val < 0.05:
    print("nReject the null hypothesis: The variables are not independent")
else:
    print("nFail to reject the null hypothesis: The variables are independent")

Using the p-value, we also got the same result. We reject the null hypothesis at a 5% significance level.

Figure 16: Result of contingency table test using Python by Author
Figure 16: Result of contingency table test using Python by Author

Now that you understand how to conduct hypothesis tests using the chi-square test method, it’s time to apply this knowledge to your own data. Happy experimenting!

Part 4: Conclusion

The chi-squared test is a powerful statistical method that helps us understand the relationships and distributions within categorical data. Forming the problem and proper hypotheses before jumping into the test itself is crucial. A large sample is also vital in conducting a chi-squared test; for instance, it works well for sizes down to 5,000 (Bergh, 2015), as small sample sizes can lead to inaccurate results. To interpret results correctly, choose the right significance level and compare the chi-square statistic to the critical value from the chi-square distribution table or the p-value.

Reference

  • G. Keller, Statistics for Management and Economics, 11th ed., Chapter 15, Cengage Learning (2017).
  • Daniel, Bergh. (2015). Chi-Squared Test of Fit and Sample Size-A Comparison between a Random Sample Approach and a Chi-Square Value Adjustment Method.. Journal of applied measurement, 16(2):204–217.

The post Chi-Squared Test: Revealing Hidden Patterns in Your Data appeared first on Towards Data Science.

]]>
Practical Computer Simulations for Product Analysts https://towardsdatascience.com/practical-computer-simulations-for-product-analysts-4d3a17957f64/ Tue, 30 Apr 2024 06:57:11 +0000 https://towardsdatascience.com/practical-computer-simulations-for-product-analysts-4d3a17957f64/ Part 2: Using bootstrap for observations and A/B tests

The post Practical Computer Simulations for Product Analysts appeared first on Towards Data Science.

]]>
In the first part of this series, we’ve discussed the basic ideas of computer simulations and how you can leverage them to answer "what-if" questions. It’s impossible to talk about simulations without bootstrap.

Bootstrap in statistics is a practical computer method for estimating the statistics of probability distributions. It is based on the repeated generation of samples using the Monte Carlo method from an existing sample. This method allows for simple and fast estimation of various statistics (such as confidence intervals, variance, correlation, etc.) for complex models.

When I learned about bootstrap in the statistics course, it felt a bit hacky. Instead of learning multiple formulas and criteria for different cases, you can just write a couple of lines of code and get confidence interval estimations for any custom and complicated use case. It sounds like magic.

And it really is. Now, when even your laptop can run thousands of simulations in minutes or even seconds, bootstrap is a powerful tool in your analytical toolkit that can help you in many situations. So, I believe that it’s worth learning or refreshing your knowledge about it.

In this article, we will talk about the idea behind bootstrap, understand when you should use it, learn how to get confidence intervals for different metrics and analyse the results of A/B tests.

What is bootstrap?

Actually, bootstrap is exceptionally straightforward. We need to run simulations drawing elements from our sample distribution with replacement, and then we can make conclusions based on this distribution.

Let’s look at the simple example when we have four elements: 1, 2, 3 and 4. Then, we can simulate many other collections of 4 elements where each element might be 1, 2, 3 or 4 with equal probabilities and use these simulations to understand, for example, how the mean value might change.

The statistical meaning behind bootstrap is that we consider that the actual population has precisely the same distribution as our sample (or the population consists of an infinite number of our sample copies). Then, we just assume that we know the general population and use it to understand the variability in our data.

Usually, when using a classical statistical approach, we assume that our variable follows some known distribution (for example, normal). However, we don’t need to make any assumptions regarding the nature of the distribution in Bootstrap. It’s pretty handy and helps to analyse even very complex custom metrics.

It’s almost impossible to mess up the bootstrap estimations. So, in many cases, I would prefer it to the classical statistical methods. The only drawback is computational time. If you’re working with big data, simulations might take hours, while you can get classical statistics estimations within seconds.

However, there are cases when it’s pretty challenging to get estimations without bootstrap. Let’s discuss the best use cases for bootstrap:

  • if you have outliers or influential points in your data;
  • if your sample is relatively small (roughly less than 100 cases);
  • if your data distribution is quite far from normal or other theoretical distribution, for example, it has several modes;
  • if you’re working with custom metrics (for example, the share of cases closed within SLA or percentiles).

Bootstrap is a wonderful and powerful statistical concept. Let’s try to use it for descriptive statistics.

Working with observational data

First, let’s start with the observational data and work with a synthetic dataset. Imagine we are helping a fitness club to set up a new fitness program that will help clients prepare for the London Marathon. We got the first trial group of 12 customers and measured their results.

Here is the data we have.

We collected just three fields for each of the 12 customers:

  • races_before – numbers of races customers had before our program,
  • kms_during_program – kilometres clients run during our program,
  • finished_marathon – whether the program was successful and a customer has finished the London Marathon.

We aim to set up a goal-focused fair program that incentivises our clients to train with us more and achieve better results. So, we would like to return the money if the client has run at least 150 kilometres during the preparation but couldn’t complete the marathon. However, before launching this program, we would like to make some estimations: what distance clients cover during preparation and the estimated share of refunds. We need it to ensure that our business is profitable and sustainable.

Estimating average

Let’s start with estimating the average distance. We can try to leverage our knowledge of mathematical statistics and use formulas for confidence intervals.

To do so, we need to make an assumption about the distribution of this variable. The most commonly used is a normal distribution. Let’s try it.

import numpy as np
from scipy.stats import norm, t

def get_normal_confidence_interval(data, confidence=0.95):
    # Calculate sample mean and standard deviation
    sample_mean = np.mean(data)
    sample_std = np.std(data, ddof=1)  
    n = len(data)

    # Calculate the critical value (z) based on the confidence level
    z = norm.ppf((1 + confidence) / 2)

    # Calculate the margin of error using standard error
    margin_of_error = z * sample_std / np.sqrt(n)

    # Calculate the confidence interval
    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error

    return lower_bound, upper_bound

get_normal_confidence_interval(df.kms_during_program.values)
# (111.86, 260.55)

The other option often used with real-life data is t-test distribution, which gives a broader confidence interval (since it assumes fatter tales than normal distribution).

def get_ttest_confidence_interval(data, confidence=0.95):
    # Calculate sample mean and standard deviation
    sample_mean = np.mean(data)
    sample_std = np.std(data, ddof=1)  
    n = len(data)

    # Calculate the critical value (z) based on the confidence level
    z = t.ppf((1 + confidence) / 2, df=len(data) - 1)

    # Calculate the margin of error using standard error
    margin_of_error = z * sample_std / np.sqrt(n)

    # Calculate the confidence interval
    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error

    return lower_bound, upper_bound

get_ttest_confidence_interval(df.kms_during_program.values)
# (102.72, 269.69)

We have a few examples in our sample. Also, there’s an outlier: a client with 12 races who managed to run almost 600 km preparing for the marathon, while most other clients run less than 200 km.

So, it’s an excellent case to use the bootstrap technique to understand the distribution and confidence interval better.

Let’s create a function to calculate and visualise the confidence interval:

  • We run num_batches simulations, doing samples with replacement, and calculating the average distance.
  • Then, based on these variables, we can get a 95% confidence interval: 2.5% and 97.5% percentiles of this distribution.
  • Finally, we can visualise the distribution on a chart.
import tqdm
import matplotlib.pyplot as plt

def get_kms_confidence_interval(num_batches, confidence = 0.95):
    # Running simulations
    tmp = []
    for i in tqdm.tqdm(range(num_batches)):
        tmp_df = df.sample(df.shape[0], replace = True)
        tmp.append(
            {
                'iteration': i,
                'mean_kms': tmp_df.kms_during_program.mean()
            }
        )
    # Saving data
    bootstrap_df = pd.DataFrame(tmp)

    # Calculating confidence interval
    lower_bound = bootstrap_df.mean_kms.quantile((1 - confidence)/2)
    upper_bound = bootstrap_df.mean_kms.quantile(1 - (1 - confidence)/2)

    # Creating a chart
    ax = bootstrap_df.mean_kms.hist(bins = 50, alpha = 0.6, 
        color = 'purple')
    ax.set_title('Average kms during the program, iterations = %d' % num_batches)

    plt.axvline(x=lower_bound, color='navy', linestyle='--', 
        label='lower bound = %.2f' % lower_bound)

    plt.axvline(x=upper_bound, color='navy', linestyle='--', 
        label='upper bound = %.2f' % upper_bound)

    ax.annotate('CI lower bound: %.2f' % lower_bound, 
                xy=(lower_bound, ax.get_ylim()[1]), 
                xytext=(-10, -20), 
                textcoords='offset points',  
                ha='center', va='top',  
                color='navy', rotation=90) 
    ax.annotate('CI upper bound: %.2f' % upper_bound, 
                xy=(upper_bound, ax.get_ylim()[1]), 
                xytext=(-10, -20), 
                textcoords='offset points',  
                ha='center', va='top',  
                color='navy', rotation=90) 
    plt.xlim(ax.get_xlim()[0] - 20, ax.get_xlim()[1] + 20)
    plt.show()

Let’s start with a small number of batches to see the first results quickly.

get_kms_confidence_interval(100)

We got a bit narrower and skewed to the right confidence interval with bootstrap, which is in line with our actual distribution: (139.31, 297.99) vs (102.72, 269.69).

However, with 100 bootstrap simulations, the distribution is not very clear. Let’s try to add more iterations. We can see that our distribution consists of multiple modes – for samples with one occurrence of outliers, two occurrences, three, etc.

With more iterations, we can see more modes (since more occurrences of the outlier are rarer), but all the confidence intervals are pretty close.

In the case of bootstrap, adding more iterations doesn’t lead to overfitting (because each iteration is independent). I would think about it as increasing the resolution of your image.

Since our sample is small, running many simulations doesn’t take much time. Even 1 million bootstrap iterations take around 1 minute.

Estimating custom metrics

As we discussed, bootstrap is handy when working with metrics that are not as straightforward as averages. For example, you might want to estimate the median or share of tasks closed within SLA.

You might even use bootstrap for something more unusual. Imagine you want to give customers discounts if your delivery is late: 5% discount for 15 minutes delay, 10% – for 1 hour delay and 20% – for 3 hours delay.

Getting a confidence interval for such cases theoretically using plain statistics might be challenging, so bootstrap will be extremely valuable.

Let’s return to our running program and estimate the share of refunds (when a customer ran 150 km but didn’t manage to finish the marathon). We will use a similar function but will calculate the refund share for each iteration instead of the mean value.

import tqdm
import matplotlib.pyplot as plt

def get_refund_share_confidence_interval(num_batches, confidence = 0.95):
    # Running simulations
    tmp = []
    for i in tqdm.tqdm(range(num_batches)):
        tmp_df = df.sample(df.shape[0], replace = True)
        tmp_df['refund'] = list(map(
            lambda kms, passed: 1 if (kms &gt;= 150) and (passed == 0) else 0,
            tmp_df.kms_during_program,
            tmp_df.finished_marathon
        ))

        tmp.append(
            {
                'iteration': i,
                'refund_share': tmp_df.refund.mean()
            }
        )

    # Saving data
    bootstrap_df = pd.DataFrame(tmp)

    # Calculating confident interval
    lower_bound = bootstrap_df.refund_share.quantile((1 - confidence)/2)
    upper_bound = bootstrap_df.refund_share.quantile(1 - (1 - confidence)/2)

    # Creating a chart
    ax = bootstrap_df.refund_share.hist(bins = 50, alpha = 0.6, 
        color = 'purple')
    ax.set_title('Share of refunds, iterations = %d' % num_batches)
    plt.axvline(x=lower_bound, color='navy', linestyle='--',
        label='lower bound = %.2f' % lower_bound)
    plt.axvline(x=upper_bound, color='navy', linestyle='--', 
        label='upper bound = %.2f' % upper_bound)
    ax.annotate('CI lower bound: %.2f' % lower_bound, 
                xy=(lower_bound, ax.get_ylim()[1]), 
                xytext=(-10, -20), 
                textcoords='offset points',  
                ha='center', va='top',  
                color='navy', rotation=90) 
    ax.annotate('CI upper bound: %.2f' % upper_bound, 
                xy=(upper_bound, ax.get_ylim()[1]), 
                xytext=(-10, -20), 
                textcoords='offset points',  
                ha='center', va='top',  
                color='navy', rotation=90) 
    plt.xlim(-0.1, 1)
    plt.show()

Even with 12 examples, we got a 2+ times smaller confidence interval. We can conclude with 95% confidence that less than 42% of customers will be eligible for a refund.

That’s a good result with such a small amount of data. However, we can go even further and try to get an estimation of causal effects.

Estimation of effects

We have data about the previous races before this marathon, and we can see how this value is correlated with the expected distance. We can use bootstrap for this as well. We only need to add the linear regression step to our current process.

def get_races_coef_confidence_interval(num_batches, confidence = 0.95):
    # Running simulations
    tmp = []
    for i in tqdm.tqdm(range(num_batches)):
        tmp_df = df.sample(df.shape[0], replace = True)
        # Linear regression model
        model = smf.ols('kms_during_program ~ races_before', data = tmp_df).fit()

        tmp.append(
            {
                'iteration': i,
                'races_coef': model.params['races_before']
            }
        )

    # Saving data
    bootstrap_df = pd.DataFrame(tmp)

    # Calculating confident interval
    lower_bound = bootstrap_df.races_coef.quantile((1 - confidence)/2)
    upper_bound = bootstrap_df.races_coef.quantile(1 - (1 - confidence)/2)

    # Creating a chart
    ax = bootstrap_df.races_coef.hist(bins = 50, alpha = 0.6, color = 'purple')
    ax.set_title('Coefficient between kms during the program and previous races, iterations = %d' % num_batches)
    plt.axvline(x=lower_bound, color='navy', linestyle='--', label='lower bound = %.2f' % lower_bound)
    plt.axvline(x=upper_bound, color='navy', linestyle='--', label='upper bound = %.2f' % upper_bound)
    ax.annotate('CI lower bound: %.2f' % lower_bound, 
                xy=(lower_bound, ax.get_ylim()[1]), 
                xytext=(-10, -20), 
                textcoords='offset points',  
                ha='center', va='top',  
                color='navy', rotation=90) 
    ax.annotate('CI upper bound: %.2f' % upper_bound, 
                xy=(upper_bound, ax.get_ylim()[1]), 
                xytext=(10, -20), 
                textcoords='offset points',  
                ha='center', va='top',  
                color='navy', rotation=90) 
    # plt.legend() 
    plt.xlim(ax.get_xlim()[0] - 5, ax.get_xlim()[1] + 5)
    plt.show()

    return bootstrap_df

We can look at the distribution. The confidence interval is above 0, so we can say there’s an effect with 95% confidence.

You can spot that distribution is bimodal, and each mode corresponds to one of the scenarios:

  • The component around 12 is related to samples without an outlier – it’s an estimation of the effect of previous races on the expected distance during the program if we disregard the outlier.
  • The second component corresponds to the samples when one or several outliers were in the dataset.

So, it’s super cool that we can make even estimations for different scenarios if we look at the bootstrap distribution.

We’ve learned how to use bootstrap with observational data, but its bread and butter is A/B testing. So, let’s move on to our second example.

Simulations for A/B testing

The other everyday use case for bootstrap is designing and analysing A/B tests. Let’s look at the example. It will also be based on a synthetic dataset that shows the effect of the discount on customer retention. Imagine we are working on an e-grocery product and want to test whether our marketing campaign with a 20 EUR discount will affect customers’ spending.

About each customer, we know his country of residence, the number of family members that live with them, the average annual salary in the country, and how much money they spend on products in our store.

Power analysis

First, we need to design the experiment and understand how many clients we need in each experiment group to make conclusions confidently. This step is called power analysis.

Let’s quickly recap the basic statistical theory about A/B tests and main metrics. Every test is based on the null hypothesis (which is the current status quo). In our case, the null hypothesis is "discount does not affect customers’ spending on our product". Then, we need to collect data on customers’ spending for control and experiment groups and estimate the probability of seeing such or more extreme results if the null hypothesis is valid. This probability is called the p-value, and if it’s small enough, we can conclude that we have enough data to reject the null hypothesis and say that treatment affects customers’ spending or retention.

In this approach, there are three main metrics:

  • effect size – the minimal change in our metric we would like to be able to detect,
  • statistical significance equals the false positive rate (probability of rejecting the null hypothesis when there was no effect). The most commonly used significance is 5%. However, you might choose other values depending on your false-positive tolerance. For example, if implementing the change is expensive, you might want to use a lower significance threshold.
  • statistical power shows the probability of rejecting the null hypothesis given that we actually had an effect equal to or higher than the effect size. People often use an 80% threshold, but in some cases (i.e. you want to be more confident that there are no negative effects), you might use 90% or even 99%.

We need all these values to estimate the number of clients in the experiment. Let’s try to define them in our case to understand their meaning better.

We will start with effect size:

  • we expect the retention rate to change by at least 3% points as a result of our campaign,
  • we would like to spot changes in customers’ spending by 20 or more EUR.

For statistical significance, I will use the default 5% threshold (so if we see the effect as a result of A/B test analysis, we can be confident with 95% that the effect is present). Let’s target a 90% statistical power threshold so that if there’s an actual effect equal to or bigger than the effect size, we will spot this change in 90% of cases.

Let’s start with statistical formulas that will allow us to get estimations quickly. Statistical formulas imply that our variable has a particular distribution, but they can usually help you estimate the magnitude of the number of samples. Later, we will use bootstrap to get more accurate results.

For retention, we can use the standard test of proportions. We need to know the actual value to estimate the normed effect size. We can get it from the historical data before the experiment.

import statsmodels.stats.power as stat_power
import statsmodels.stats.proportion as stat_prop

base_retention = before_df.retention.mean()
ret_effect_size = stat_prop.proportion_effectsize(base_retention + 0.03, 
    base_retention)

sample_size = 2*stat_power.tt_ind_solve_power(
    effect_size = ret_effect_size,
    alpha = 0.05, power = 0.9,
    nobs1 = None, # we specified nobs1 as None to get an estimation for it
    alternative='larger'
)

# ret_effect_size = 0.0632, sample_size = 8573.86

We used a one-sided test because there’s no difference in whether there’s a negative or no effect from the business perspective since we won’t implement this change. Using a one-sided instead of a two-sided test increases the statistical power.

We can similarly estimate the sample size for the customer value, assuming the normal distribution. However, the distribution is not normal actually, so we should expect more precise results from bootstrap.

Let’s write code.

val_effect_size = 20/before_df.customer_value.std()

sample_size = 2*stat_power.tt_ind_solve_power(
    effect_size = val_effect_size,
    alpha = 0.05, power = 0.9, 
    nobs1 = None, 
    alternative='larger'
)
# val_effect_size = 0.0527, sample_size = 12324.13

We got estimations for the needed sample sizes for each test. However, there are cases when you have a limited number of clients and want to understand the statistical power you can get.

Suppose we have only 5K customers (2.5K in each group). Then, we will be able to achieve 72.2% statistical power for retention analysis and 58.7% – for customer value (given the desired statistical significance and effect sizes).

The only difference in the code is that this time, we’ve specified nobs1 = 2500 and left power as None.

stat_power.tt_ind_solve_power(
    effect_size = ret_effect_size,
    alpha = 0.05, power = None,
    nobs1 = 2500, 
    alternative='larger'
)
# 0.7223

stat_power.tt_ind_solve_power(
    effect_size = val_effect_size,
    alpha = 0.05, power = None,
    nobs1 = 2500, 
    alternative='larger'
)
# 0.5867

Now, it’s time to use bootstrap for the power analysis, and we will start with the customer value test since it’s easier to implement.

Let’s discuss the basic idea and steps of power analysis using bootstrap. First, we need to define our goal clearly. We want to estimate the statistical power depending on the sample size. If we put it in more practical terms, we want to know the percentage of cases when there was an increase in customer spending by 20 or more EUR, and we were able to reject the null hypothesis and implement this change in production. So, we need to simulate a bunch of such experiments and calculate the share of cases when we can see statistically significant changes in our metric.

Let’s look at one experiment and break it into steps. The first step is to generate the experimental data. For that, we need to get a random subset from the population equal to the sample size, randomly split these customers into control and experiment groups and add an effect equal to the effect size for the treatment group. All this logic is implemented in get_sample_for_value function below.

def get_sample_for_value(pop_df, sample_size, effect_size):
  # getting sample of needed size
  sample_df = pop_df.sample(sample_size)

  # randomly assign treatment
  sample_df['treatment'] = sample_df.index.map(
    lambda x: 1 if np.random.uniform() &gt; 0.5 else 0)

  # add efffect for the treatment group
  sample_df['predicted_value'] = sample_df['customer_value'] 
    + effect_size * sample_df.treatment

  return sample_df

Now, we can treat this synthetic experiment data as we usually do with A/B test analysis, run a bunch of bootstrap simulations, estimate effects, and then get a confidence interval for this effect.

We will be using linear regression to estimate the effect of treatment. As discussed in the previous article, it’s worth adding to linear regression features that explain the outcome variable (customers’ spending). We will add the number of family members and average salary to the regression since they are positively correlated.

import statsmodels.formula.api as smf
val_model = smf.ols('customer_value ~ num_family_members + country_avg_annual_earning', 
    data = before_df).fit(disp = 0)
val_model.summary().tables[1]

We will put all the logic of doing multiple bootstrap simulations and estimating treatment effects into the get_ci_for_value function.

def get_ci_for_value(df, boot_iters, confidence_level):
    tmp_data = []

    for iter in range(boot_iters):
        sample_df = df.sample(df.shape[0], replace = True)
        val_model = smf.ols('predicted_value ~ treatment + num_family_members + country_avg_annual_earning', 
          data = sample_df).fit(disp = 0)
        tmp_data.append(
            {
                'iteration': iter,
                'coef': val_model.params['treatment']
            }
        )

    coef_df = pd.DataFrame(tmp_data)
    return coef_df.coef.quantile((1 - confidence_level)/2), 
        coef_df.coef.quantile(1 - (1 - confidence_level)/2)

The next step is to put this logic together, run a bunch of such synthetic experiments, and save results.

def run_simulations_for_value(pop_df, sample_size, effect_size, 
    boot_iters, confidence_level, num_simulations):

    tmp_data = []

    for sim in tqdm.tqdm(range(num_simulations)):
        sample_df = get_sample_for_value(pop_df, sample_size, effect_size)
        num_users_treatment = sample_df[sample_df.treatment == 1].shape[0]
        value_treatment = sample_df[sample_df.treatment == 1].predicted_value.mean()
        num_users_control = sample_df[sample_df.treatment == 0].shape[0]
        value_control = sample_df[sample_df.treatment == 0].predicted_value.mean()

        ci_lower, ci_upper = get_ci_for_value(sample_df, boot_iters, confidence_level)

        tmp_data.append(
            {
                'experiment_id': sim,
                'num_users_treatment': num_users_treatment,
                'value_treatment': value_treatment,
                'num_users_control': num_users_control,
                'value_control': value_control,
                'sample_size': sample_size,
                'effect_size': effect_size,
                'boot_iters': boot_iters,
                'confidence_level': confidence_level,
                'ci_lower': ci_lower,
                'ci_upper': ci_upper
            }
        )

    return pd.DataFrame(tmp_data)

Let’s run this simulation for sample_size = 100 and see the results.

val_sim_df = run_simulations_for_value(before_df, sample_size = 100, 
    effect_size = 20, boot_iters = 1000, confidence_level = 0.95, 
    num_simulations = 20)
val_sim_df.set_index('simulation')[['sample_size', 'ci_lower', 'ci_upper']].head()

We’ve got the following data for 20 simulated experiments. We know the confidence interval for each experiment, and now we can estimate the power.

We would have rejected the null hypothesis if the lower bound of the confidence interval was above zero, so let’s calculate the share of such experiments.

val_sim_df['successful_experiment'] = val_sim_df.ci_lower.map(
  lambda x: 1 if x &gt; 0 else 0)

val_sim_df.groupby(['sample_size', 'effect_size']).aggregate(
    {
        'successful_experiment': 'mean',
        'experiment_id': 'count'
    }
)

We’ve started with just 20 simulated experiments and 1000 bootstrap simulations to estimate their confidence interval. Such a few simulations can help us get a low-resolution picture quite quickly. Keeping in mind the estimation we got from the classic statistics, we should expect that numbers around 10K will give us the desired statistical power.

tmp_dfs = []
for sample_size in [100, 250, 500, 1000, 2500, 5000, 10000, 25000]:
    print('Simulation for sample size = %d' % sample_size)
    tmp_dfs.append(
        run_simulations_for_value(before_df, sample_size = sample_size, effect_size = 20,
                              boot_iters = 1000, confidence_level = 0.95, num_simulations = 20)
    )

val_lowres_sim_df = pd.concat(tmp_dfs)

We got results similar to those of our theoretical estimations. Let’s try to run estimations with more simulated experiments (100 and 500 experiments). We can see that 12.5K clients will be enough to achieve 90% statistical power.

I’ve added all the power analysis results to the chart so that we can see the relation clearly.

In that case, you might already see that bootstrap can take a significant amount of time. For example, accurately estimating power with 500 experiment simulations for just 3 sample sizes took me almost 2 hours.

Now, we can estimate the relationship between effect size and power for a 12.5K sample size.

tmp_dfs = []
for effect_size in [1, 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100]:
    print('Simulation for effect size = %d' % effect_size)
    tmp_dfs.append(
        run_simulations_for_value(before_df, sample_size = 12500, effect_size = effect_size,
                              boot_iters = 1000, confidence_level = 0.95, num_simulations = 100)
    )

val_effect_size_sim_df = pd.concat(tmp_dfs)

We can see that if the actual effect on customers’ spending is higher than 20 EUR, we will get even higher statistical power, and we will be able to reject the null hypothesis in more than 90% of cases. But we will be able to spot the 10 EUR effect in less than 50% of cases.

Let’s move on and conduct power analysis for retention as well. The complete code is structured similarly to the customer spending analysis. We will discuss nuances in detail below.

import tqdm

def get_sample_for_retention(pop_df, sample_size, effect_size):
    base_ret_model = smf.logit('retention ~ num_family_members', data = pop_df).fit(disp = 0)
    tmp_pop_df = pop_df.copy()
    tmp_pop_df['predicted_retention_proba'] = base_ret_model.predict()
    sample_df = tmp_pop_df.sample(sample_size)
    sample_df['treatment'] = sample_df.index.map(lambda x: 1 if np.random.uniform() &gt; 0.5 else 0)
    sample_df['predicted_retention_proba'] = sample_df['predicted_retention_proba'] + effect_size * sample_df.treatment
    sample_df['retention'] = sample_df.predicted_retention_proba.map(lambda x: 1 if x &gt;= np.random.uniform() else 0)
    return sample_df

def get_ci_for_retention(df, boot_iters, confidence_level):
    tmp_data = []

    for iter in range(boot_iters):
        sample_df = df.sample(df.shape[0], replace = True)
        ret_model = smf.logit('retention ~ treatment + num_family_members', data = sample_df).fit(disp = 0)
        tmp_data.append(
            {
                'iteration': iter,
                'coef': ret_model.params['treatment']
            }
        )

    coef_df = pd.DataFrame(tmp_data)
    return coef_df.coef.quantile((1 - confidence_level)/2), coef_df.coef.quantile(1 - (1 - confidence_level)/2)

def run_simulations_for_retention(pop_df, sample_size, effect_size, 
                                  boot_iters, confidence_level, num_simulations):
    tmp_data = []

    for sim in tqdm.tqdm(range(num_simulations)):
        sample_df = get_sample_for_retention(pop_df, sample_size, effect_size)
        num_users_treatment = sample_df[sample_df.treatment == 1].shape[0]
        retention_treatment = sample_df[sample_df.treatment == 1].retention.mean()
        num_users_control = sample_df[sample_df.treatment == 0].shape[0]
        retention_control = sample_df[sample_df.treatment == 0].retention.mean()

        ci_lower, ci_upper = get_ci_for_retention(sample_df, boot_iters, confidence_level)

        tmp_data.append(
            {
                'experiment_id': sim,
                'num_users_treatment': num_users_treatment,
                'retention_treatment': retention_treatment,
                'num_users_control': num_users_control,
                'retention_control': retention_control,
                'sample_size': sample_size,
                'effect_size': effect_size,
                'boot_iters': boot_iters,
                'confidence_level': confidence_level,
                'ci_lower': ci_lower,
                'ci_upper': ci_upper
            }
        )

    return pd.DataFrame(tmp_data)

First, since we have a binary outcome for retention (whether the customer returns next month or not), we will use a logistic regression model instead of linear regression. We can see that retention is correlated with the size of the family. It might be the case that when you buy many different types of products for family members, it’s more difficult to find another service that will cover all your needs.

base_ret_model = smf.logit('retention ~ num_family_members', data = before_df).fit(disp = 0)
base_ret_model.summary().tables[1]

Also, the functionget_sample_for_retention has a bit trickier logic to adjust results for the treatment group. Let’s look at it step by step.

First, we are fitting a logistic regression on the whole population data and using this model to predict the probability of retaining using this model.

base_ret_model = smf.logit('retention ~ num_family_members', data = pop_df)
  .fit(disp = 0)
tmp_pop_df = pop_df.copy()
tmp_pop_df['predicted_retention_proba'] = base_ret_model.predict()

Then, we got a random sample equal to the size and split it into a control and test group.

sample_df = tmp_pop_df.sample(sample_size)
sample_df['treatment'] = sample_df.index.map(
  lambda x: 1 if np.random.uniform() &gt; 0.5 else 0)

For the treatment group, we increase the probability of retaining by the expected effect size.

sample_df['predicted_retention_proba'] = sample_df['predicted_retention_proba'] 
    + effect_size * sample_df.treatment

The last step is to define, based on probability, whether the customer is retained or not. We used uniform distribution (random number between 0 and 1) for that:

  • if a random value from a uniform distribution is below probability, then a customer is retained (it happens with specified probability),
  • otherwise, the customer has churned.
sample_df['retention'] = sample_df.predicted_retention_proba.map(
    lambda x: 1 if x &gt; np.random.uniform() else 0)

You can run a few simulations to ensure our sampling function works as intended. For example, with this call, we can see that for the control group, retention is equal to 64% like in the population, and it’s 93.7% for the experiment group (as expected with effect_size = 0.3 )

get_sample_for_retention(before_df, 10000, 0.3)
  .groupby('treatment', as_index = False).retention.mean()

# |    |   treatment |   retention |
# |---:|------------:|------------:|
# |  0 |           0 |    0.640057 |
# |  1 |           1 |    0.937648 |

Now, we can also run simulations to see the optimal number of samples to reach 90% of statistical power for retention. We can see that the 12.5K sample size also will be good enough for retention.

Analysing results

We can use linear or logistic regression to analyse results or leverage the functions we already have for bootstrap CI.

value_model = smf.ols(
  'customer_value ~ treatment + num_family_members + country_avg_annual_earning', 
  data = experiment_df).fit(disp = 0)
value_model.summary().tables[1]

So, we got the statistically significant result for the customer spending equal to 25.84 EUR with a 95% confidence interval equal to (16.82, 34.87) .

With the bootstrap function, the CI will be pretty close.

get_ci_for_value(experiment_df.rename(
    columns = {'customer_value': 'predicted_value'}), 1000, 0.95)
# (16.28, 34.63)

Similarly, we can use logistic regression for retention analysis.

retention_model = smf.logit('retention ~ treatment + num_family_members',
    data = experiment_df).fit(disp = 0)
retention_model.summary().tables[1]

Again, the bootstrap approach gives close estimations for CI.

get_ci_for_retention(experiment_df, 1000, 0.95)
# (0.072, 0.187)

With logistic regression, it might be tricky to interpret the coefficient. However, we can use a hacky approach: for each customer in our dataset, calculate probability in case the customer was in control and treatment using our model and then look at the average difference between probabilities.

experiment_df['treatment_eq_1'] = 1
experiment_df['treatment_eq_0'] = 0

experiment_df['retention_proba_treatment'] = retention_model.predict(
    experiment_df[['retention', 'treatment_eq_1', 'num_family_members']]
        .rename(columns = {'treatment_eq_1': 'treatment'}))

experiment_df['retention_proba_control'] = retention_model.predict(
    experiment_df[['retention', 'treatment_eq_0', 'num_family_members']]
      .rename(columns = {'treatment_eq_0': 'treatment'}))

experiment_df['proba_diff'] = experiment_df.retention_proba_treatment 
    - experiment_df.retention_proba_control

experiment_df.proba_diff.mean()
# 0.0281

So, we can estimate the effect on retention to be 2.8%.

Congratulations! We’ve finally finished the full A/B test analysis and were able to estimate the effect both on average customer spending and retention. Our experiment is successful, so in real life, we would start thinking about rolling it to production.

You can find the full code for this example on GitHub.

Summary

Let me quickly recap what we’ve discussed today:

  • The main idea of bootstrap is simulations with replacements from your sample, assuming that the general population has the same distribution as the data we have.
  • Bootstrap shines in cases when you have few data points, your data has outliers or is far from any theoretical distribution. Bootstrap can also help you estimate custom metrics.
  • You can use bootstrap to work with observational data, for example, to get confidence intervals for your values.
  • Also, bootstrap is broadly used for A/B testing analysis – both to estimate the impact of treatment and do a power analysis to design an experiment.

Thank you a lot for reading this article. If you have any follow-up questions or comments, please leave them in the comments section.

Reference

All the images are produced by the author unless otherwise stated.

This article was inspired by the book "Behavioral Data Analysis with R and Python" by Florent Buisson.

The post Practical Computer Simulations for Product Analysts appeared first on Towards Data Science.

]]>
Linear Regressions for Causal Conclusions https://towardsdatascience.com/linear-regressions-for-causal-conclusions-34c6317c5a11/ Wed, 10 Apr 2024 04:45:23 +0000 https://towardsdatascience.com/linear-regressions-for-causal-conclusions-34c6317c5a11/ An easy and yet powerful tool for decision making

The post Linear Regressions for Causal Conclusions appeared first on Towards Data Science.

]]>
Image by DALL-E
Image by DALL-E

I suppose most of us have heard the statement "correlation doesn’t imply causation" multiple times. It often becomes a problem for analysts since we frequently can see only correlations but still want to make causal conclusions.

Let’s discuss a couple of examples to understand the difference better. I would like to start with a case from everyday life rather than the digital world.

In 1975, a vast population study was launched in Denmark. It’s called the Copenhagen City Heart Study (CCHS). Researchers gathered information about 20K men and women and have been monitoring these people for decades. The initial goal of this research was to find ways to prevent cardiovascular diseases and strokes. One of the conclusions from this study is that people who reported regularly playing tennis have 9.7 years higher life expectancy.

Let’s think about how we could interpret this information. Does it mean that if a person starts playing tennis weekly today, they will increase their life expectancy by ten years? Unfortunately, not exactly. Since it’s an observational study, we should be cautious about making causal inferences. There might be some other effects. For example, tennis players are likely to be wealthier, and we know that higher wealth correlates with greater longevity. Or there could be a correlation that people who regularly do sports also care more about their health and, because of it, do all checkups regularly. So, observational research might overestimate the effect of tennis on longevity since it doesn’t control other factors.

Let’s move on to the examples closer to product analytics and our day-to-day job. The number of Customer Support contacts for a client will likely be positively correlated with the probability of churn. If customers had to contact our support ten times, they would likely be annoyed and stop using our product, while customers who never had problems and are happy with the service might never reach out with any questions.

Does it mean that if we reduce the number of CS contacts, we will increase customer retention? I’m ready to bet that if we hide contact info and significantly reduce the number of CS contacts, we won’t be able to decrease churn because the actual root cause of churn is not CS contact but customers’ dissatisfaction with the product, which leads to both customers contacting us and stopping using our product.

I hope that with these examples, you can gain some intuition about the correlation vs. causation problem.

In this article, I would like to share approaches for driving causal conclusions from the data. Surprisingly, we will be able to use the most basic tool – just a Linear Regression.

If we use the same linear regression for causal inference, you might wonder, what is the difference between our usual approach and causal analytics? That’s a good question. Let’s start our causal journey by understanding the differences between approaches.

Predictive vs. causal analytics

Predictive analytics helps to make forecasts and answer questions like "How many customers will we have in a year if nothing changes?" or "What is the probability for this customer to make a purchase within the next seven days?".

Causal analytics tries to understand the root causes of the process. It might help you to answer "what if" questions like "How many customers will churn if we increase our subscription fee?" or "How many customers would have signed up for our subscription if we didn’t launch this Saint Valentine’s promo?".

Causal questions seem way more complicated than just predictive ones. However, these two approaches often leverage the same tools, such as linear or logistic regressions. Even though tools are the same, they have absolutely different goals:

  • For predictive analytics, we try our best to predict a value in the future based on information we know. So, the main KPI is an error in the prediction.
  • Building a regression model for the causal analysis, we focus on the relationships between our target value and other factors. The model’s main output is coefficients rather than forecasts.

Let’s look at a simple example. Suppose we would like to forecast the number of active customers.

  • In the predictive approach, we are talking about baseline forecast (given that the situation will stay pretty much the same). We can use ARIMA (Autoregressive Integrated Moving Average) and base our projections on previous values. ARIMA works well for predictions but can’t tell you anything about the factors affecting your KPI and how to improve your product.
  • In the case of causal analytics, our goal is to find causal relationships in our data, so we will build a regression and identify factors that can impact our KPI, such as subscription fees, marketing campaigns, seasonality, etc. In that case, we will get not only the BAU (business as usual) forecast but also be able to estimate different "what if" scenarios for the future.

Now, it’s time to dive into causal theory and learn basic terms.

Correlation doesn’t imply causation

Let’s consider the following example for our discussion. Imagine you sent a discount coupon to loyal customers of your product, and now you want to understand how it affected their value (money spent on the product) and retention.

One of the most basic causal terms is treatment. It sounds like something related to the medicine, but actually, it’s just an intervention. In our case, it’s a discount. We usually define treatment at the unit level (in our case, customer) in the following way.

The other crucial term is outcome Y, our variable of interest. In our example, it’s the customer’s value.

The fundamental problem of causal inference is that we can’t observe both outcomes for the same customers. So, if a customer received the discount, we will never know what value or retention he would have had without a coupon. It makes causal inference tricky.

That’s why we need to introduce another concept – potential outcomes. The outcome that happened is usually called factual, and the one that didn’t is counterfactual. We will use the following notation for it.

The main goal of causal analysis is to measure the relationship between treatment and outcome. We can use the following metrics to quantify it:

  • ATE – average treatment effect,
  • ATT – average treatment effect on treated (customers with the treatment)

They are both equal to expected values of the differences between potential outcomes either for all units (customers in our case) or only for treated ones.

That’s an actual causal effect, and unfortunately, we won’t be able to calculate it. But cheer up; we can still get some estimations. We can observe the difference between values for treated and not treated customers (correlation effect). Let’s try to interpret this value.

Using a couple of simple mathematical transformations (i.e. adding and subtracting the same value), we’ve concluded that the average in values between treated and not treated customers equals the sum of ATT (average treatment effect on treated) and bias term. The bias equals the difference between control and treatment groups without a treatment.

If we return to our case, the bias will be equal to the difference between expected customer value for the treatment group if they haven’t received discount (counterfactual outcome) and the control group (factual outcome).

In our example, the average value from customers who received a discount will likely be much higher than for those who didn’t. Could we attribute all this effect to our treatment (discount coupon)? Unfortunately not. Since we sent discount to loyal customers who are already spending a lot of money in our product, they would likely have higher value than control group even without a treatment. So, there’s a bias, and we can’t say that the difference in value between two segments equals ATT.

Let’s think about how to overcome this obstacle. We can do an A/B test: randomly split our loyal customers into two groups and send discount coupons only to half of them. Then, we can estimate the discount’s effect as the average difference between these two groups since we’ve eliminated bias (without treatment, there’s no difference between these groups except for discount).

We’ve covered the basic theory of causal inference and have learned the most crucial concept of bias. So, we are ready to move on to practice. We will start by analysing the A/B test results.

Use case: A/B test

Randomised controlled trial (RTC), often called the A/B test, is a powerful tool for getting causal conclusions from data. This approach assumes that we are assigning treatment randomly, and it helps us eliminate bias (since groups are equal without treatment).

To practice solving such tasks, we will look at the example based on synthetic data. Suppose we’ve built an LLM-based tool that helps customer support agents answer questions more quickly. To measure the effect, we introduced this tool to half of the agents, and we would like to measure how our treatment (LLM-based tool) affects the outcome (time the agent spends answering a customer’s question).

Let’s have a quick look at the data we have.

Here are the description of the parameters we logged:

  • case_id – unique ID for the case.
  • agent_id – unique ID for the agent.
  • treatment equals 1 if agent was in an experiment group and have a chance to use LLMs, 0 – otherwise.
  • time_spent_mins – minutes spent answering the customer’s question.
  • cs_center – customer support centre. We are working with several CS centres. We launched this experiment in some of them because it’s easier to implement. Such an approach also helps us to avoid contamination (when agents from experiment and control groups interact and can affect each other).
  • complexity equals low, medium or high. This feature is based on the category of the customer’s question and defines how much time an agent is supposed to spend solving this case.
  • tenure – number of months since the agent started working.
  • passed_training – whether the agent passed LLM training. This value can be equal to True only for the treatment group since this training wasn’t offered to the agents from the control group.
  • within_sla equals 1 if the agent was able to answer the question within SLA (15 minutes).

As usual, let’s start with a high-level overview of the data. We have quite a lot of data points, so we will likely be able to get statistically significant results. Also, we can see way lower average response times for the treatment group, so we can hope that the LLM tool really helps.

I also usually look at the actual distributions since average statistics might be misleading. In this case, we can see two unimodal distributions without distinctive outliers.

Image by author
Image by author

Classic statistical approach

The classic approach to analysing A/B tests is to use statistical formulas. Using the scipy package, we can calculate the confidence interval for the difference between the two means.

# defining samples
control_values = df[df.treatment == 0].time_spent_mins.values
exp_values = df[df.treatment == 1].time_spent_mins.values

# calculating p-values
from scipy.stats import ttest_ind

ttest_ind(exp_values, control_values)
# Output: TtestResult(statistic=-70.2769283935386, pvalue=0.0, df=89742.0)

We got a p-value below 1%. So, we can reject the null hypothesis and conclude that there’s a difference in average time spent per case in the control and test groups. To understand the effect size, we can also calculate the confidence interval.

from scipy import stats
import numpy as np

# Calculate sample statistics
mean1, mean2 = np.mean(exp_values), np.mean(control_values)
std1, std2 = np.std(exp_values, ddof=1), np.std(control_values, ddof=1)
n1, n2 = len(exp_values), len(control_values)
pooled_std = np.sqrt(((n1 - 1) * std1**2 + (n2 - 1) * std2**2) / (n1 + n2 - 2))
degrees_of_freedom = n1 + n2 - 2
confidence_level = 0.95

# Calculate margin of error
margin_of_error = stats.t.ppf((1 + confidence_level) / 2, degrees_of_freedom) * pooled_std * np.sqrt(1 / n1 + 1 / n2)

# Calculate confidence interval
mean_difference = mean1 - mean2
conf_interval = (mean_difference - margin_of_error, 
    mean_difference + margin_of_error)

print("Confidence Interval:", list(map(lambda x: round(x, 3), conf_interval)))
# Output: Confidence Interval: [-1.918, -1.814]

As expected since p-value is below 5%, our confidence interval doesn’t include 0.

The traditional approach works. However, we can get the same results with linear regression, which will also allow us to do more advanced analysis later. So, let’s discuss this method.

Linear regression basics

As we already discussed, observing both potential outcomes (with and without treatment) for the same object is impossible. Since we won’t be able to estimate the impact on each object individually, we need a model. Let’s assume the constant treatment effect.

Then, we can write down the relation between outcome (time spent on request) and treatment in the following way, where

  • baseline is a constant that shows the basic level of outcome,
  • residual represents other potential relationships we don’t care about right now (for example, the agent’s maturity or complexity of the case).

It’s a linear equation, and we can get the estimation of the impact variable using linear regression. We will use OLS (Ordinary Least Squares) function from statsmodels package.

import statsmodels.formula.api as smf
model = smf.ols('time_spent_mins ~ treatment', data=df).fit()
model.summary().tables[1]

In the result, we got all the needed info: estimation of the effect (coefficient for the treatment variable), its p-value and confidence interval.

Since the p-value is negligible (definitely below 1%), we can consider the effect significant and say that our LLM-based tool helps to reduce the time spent on a case by 1.866 minutes with a 95% confidence interval (1.814, 1.918). You can notice that we got exactly the same result as with statistical formulas before.

Adding more variables

As promised, we can make a more complex analysis with linear regression and take into account more factors, so let’s do it. In our initial approach, we used only one regressor – treatment flag. However, we can add more variables (for example, complexity).

In this case, the impact will show estimation after accounting for all the effects of other variables in the model (in our case – task complexity). Let’s estimate it. Adding more variables into the regression model is straightforward – we just need to add another component to the equation.

import statsmodels.formula.api as smf
model = smf.ols('time_spent_mins ~ treatment + complexity', data=df).fit()
model.summary().tables[1]

Now, we see a bit higher estimation of the effect – 1.91 vs 1.87 minutes. Also, the error has decreased (0.015 vs 0.027), and the confidence interval has narrowed.

You can also notice that since complexity is a categorical variable, it was automatically converted into a bunch of dummy variables. So, we got estimations of -9.8 minutes for low-complexity tasks and -4.7 minutes for medium ones.

Let’s try to understand why we got a more confident result after adding complexity. Time spent on a customer case significantly depends on the complexity of the tasks. So, complexity is responsible for a significant amount of our variable’s variability.

Image by author
Image by author

As I mentioned before, the coefficient for treatment estimates the impact after accounting for all the other factors in the equation. When we added complexity to our linear regression, it reduced the variance of residuals, and that’s why we got a narrower confidence interval for time.

Let’s double-check that complexity explains a significant proportion of variance. We can see a considerable decrease: time spent has a variance equal to 16.6, but when we account for complexity, it reduces to just 5.9.

time_model = smf.ols('time_spent_mins ~ complexity', data=df).fit()

print('Initial variance: %.2f' % (df.time_spent_mins.var()))
print('Residual variance after accounting for complexity: %.2f' 
  % (time_model.resid.var()))

# Output: 
# Initial variance: 16.63
# Residual variance after accounting for complexity: 5.94

So, we can see that adding a factor that can predict the outcome variable to a linear regression can improve your effect size estimations. Also, it’s worth noting that the variable is not correlated with treatment assignment (the tasks of each complexity have equal chances to be in the control or test group).

Traditionally, causal graphs are used to show the relationships between the variables. Let’s draw such a graph to represent our current situation.

Image by author
Image by author

Non-linear relationships

So far, we’ve looked only at linear relationships, but sometimes, it’s not enough to model our situation.

Let’s look at the data on LLM training that agents from the experiment group were supposed to pass. Only half of them have passed the LLM training and learned how to use the new tool effectively.

We can see a significant difference in average time spent for the treatment group who passed training vs. those who didn’t.

Image by author
Image by author

So, we should expect different impacts from treatment for these two groups. We can use non-linearity to express such relationships in formulas and add treatment * passed_training component to our equation.

model = smf.ols('time_spent_mins ~ treatment * passed_training + complexity', 
    data=df).fit()
model.summary().tables[1]

The treatment and passed_training factors will also be automatically added to the regression. So, we will be optimising the following formula.

We got the following results from the linear regression.

No statistically significant effect is associated with passed training since the p-value is above 5%, while other coefficients differ from zero.

Let’s put down all the different scenarios and estimate the effects using the coefficients we got from the linear regression.

So, we’ve got new treatment estimations: 2.5 minutes improvement per case for the agents who have passed the training and 1.3 minutes – for those who didn’t.

Confounders

Before jumping to conclusions, it’s worth double-checking some assumptions we made – for example, random assignment. We’ve discussed that we launched the experiment in some CS centres. Let’s check whether agents in the different centres are similar so that our control and test groups are non-biased.

We know that agents differ by experience, which might significantly affect their performance. Our day-to-day intuition tells us that more experienced agents will spend less time on tasks. We can see in the data that it is actually like this.

Image by author
Image by author

Let’s see whether our experiment and control have the same level of agents’ experience. The easiest way to do it is to look at distributions.

Image by author
Image by author

Apparently, agents in the treatment group have much more experience than the ones in the control group. Overall, it makes sense that the product team decided to launch the experiment, starting with the more trained agents. However, it breaks our assumption about random assignment. Since the control and test groups are different even without treatment, we are overestimating the effect of our LLM tool on the agents’ performance.

Let’s return to our causal graph. The agent’s experience affects both treatment assignment and output variable (time spent). Such variables are called confounders.

Image by author
Image by author

Don’t worry. We can solve this issue effortlessly – we just need to include confounders in our equation to control for it. When we add it to the linear regression, we start to estimate the treatment effect with fixed experience, eliminating the bias. Let’s try to do it.

model = smf.ols('time_spent_mins ~ treatment * passed_training + complexity + tenure', data=df).fit()
model.summary().tables[1]

With added tenure, we got the following results:

  • There is no statistically significant effect of passed training or treatment alone since the p-value is above 5%. So, we can conclude that an LLM helper does not affect agents’ performance unless they have passed the training. In the previous iteration, we saw a statistically significant effect, but it was due to tenure confounding bias.
  • The only statistically significant effect is for the treatment group with passed training. It equals 1.07 minutes with a 95% confidence interval (1.02, 1.11).
  • Each month of tenure is associated with 0.05 minutes less time spent on the task.

We are working with synthetic data so we can easily compare our estimations with actual effects. The LLM tool reduces the time spent per task by 1 minute if the agent has passed the training, so our estimations are pretty accurate.

Bad controls

Machine learning tasks are often straightforward: you gather data with all possible features you can get, try to fit some models, compare their performance and pick the best one. Contrarily, causal inference requires some art and a deep understanding of the process you’re working with. One of the essential questions is what features are worth including in regression and which ones will spoil your results.

Till now, all the additional variables we’ve added to the linear regression have been improving the accuracy. So, you might think adding all your features to regression will be the best strategy. Unfortunately, it’s not that easy for causal inference. In this section, we will look at a couple of cases when additional variables decrease the accuracy of our estimations.

For example, we have a CS centre in data. We’ve assigned treatment based on the CS centre, so including it in the regression might sound reasonable. Let’s try.

model = smf.ols('time_spent_mins ~ treatment + complexity + tenure + cs_center', 
    data=df[df.treatment == df.passed_training]).fit()
model.summary().tables[1]

For simplicity, I’ve removed non-linearity from our dataset and equation, filtering out cases where the agents from the treatment groups didn’t pass the LLM training.

If we include the CS centre in linear regression, we will get a ridiculously high estimation of the effect (around billions) without statistical significance. So, this variable is rather harmful than helpful.

Let’s update a causal chart and try to understand why it doesn’t work. CS centre is a predictor for our treatment but has no relationship with the output variable (so it’s not a confounder). Adding a treatment predictor leads to multicollinearity (like in our case) or reduces the treatment variance (it’s challenging to estimate the effect of treatment on the output variable since treatment doesn’t change much). So, it’s a bad practice to add such variables to the equation.

Image by author
Image by author

Let’s move on to another example. We have a within_sla variable showing whether the agents finished the task within 15 minutes. Can this variable improve the quality of our effect estimations? Let’s see.

model = smf.ols('time_spent_mins ~ treatment + complexity + tenure + within_sla', 
    data=df[df.treatment == df.passed_training]).fit()
model.summary().tables[1]

The new effect estimation is way lower: 0.8 vs 1.1 minutes. So, it poses a question: which one is more accurate? We’ve added more parameters to this model, so it’s more complex. Should it give more precise results, then? Unfortunately, it’s not always like that. Let’s dig deeper into it.

In this case, within_sla flag shows whether the agent solved the problem within 15 minutes or the question took more time. So, if we return to our causal chart, within_sla flag is an outcome of our output variable (time spent on the task).

Image by author
Image by author

When we add the within_slag flag into regression and control for it, we are starting to estimate the effect of treatment with a fixed value of within_sla. So, we will have two cases: within_sla = 1 and within_sla = 0. Let’s look at the bias for each of them.

In both cases, bias is not equal to 0, which means our estimation is biased. At first glance, it might look a bit counterintuitive. Let me explain the logic behind it a bit.

  • In the first equation, we compare cases where agents finished the tasks within 15 minutes with the help of the LLM tool and without. The previous analysis shows that the LLM tool (our treatment) tends to speed up agents’ work. So, if we compare the expected time spent on tasks without treatments (when agents work independently without the LLM tool), we should expect quicker responses from the second group.
  • Similarly, for the second equation, we are comparing agents who haven’t completed tasks within 15 minutes, even with the help of LLM and those who did it on their own. Again, we should expect longer response times from the first group without treatment.

It’s an example of selection bias – a case when we control for a variable on the path from treatment to output variable or outcome of the output variable. Controlling for such variables in a linear regression also leads to biased estimations, so don’t do it.

Grouped data

In some cases, you might not have granular data. In our example, we might not know the time spent on each task individually, but know the averages. It’s easier to track aggregated numbers for agents. For example, "within two hours, an agent closed 15 medium tasks". We can aggregate our raw data to get such statistics.

agents_df = df.groupby(['agent_id', 'treatment', 'complexity', 'tenure', 
  'passed_training'], as_index = False).aggregate(
    {'case_id': 'nunique', 'time_spent_mins': 'mean'}
)

It’s not a problem for linear regression to deal with agent-level data. We just need to specify weights for each agent (equal to the number of cases).


model = smf.ols('time_spent_mins ~ treatment + complexity + tenure', 
    data = agents_df[agents_df.treatment == agents_df.passed_training],
    weights = agents_df[agents_df.treatment == agents_df.passed_training]['case_id'])
    .fit()
model.summary().tables[1]

With aggregated data, we have roughly the same results for the effect of treatment. So, there’s no problem if you have only average numbers.

Use case: observational data

We’ve looked at the A/B test examples for causal inference in detail. However, in many cases, we can’t conduct a proper randomised trial. Here are some examples:

  • Some experiments are unethical. For example, you can’t push students to drink alcohol or smoke to see how it affects their performance at university.
  • In some cases, you might be unable to conduct an A/B test because of legal limitations. For example, you can’t charge different prices for the same product.
  • Sometimes, it’s just impossible. For example, if you are working on an extensive rebranding, you will have to launch it globally one day with a big PR announcement.

In such cases, you have to use just observations to make conclusions. Let’s see how our approach works in such a case. We will use the Student Performance data set from the UC Irvine Machine Learning Repository.

Let’s use this real-life data to investigate how willingness to take higher education affects the math class’s final score. We will start with a trivial model and a causal chart.

Image by author
Image by author
df = pd.read_csv('student-mat.csv', sep = ';')
model = smf.ols('G3 ~ higher', data=df).fit()
model.summary().tables[1]

We can see that willingness to continue education statistically significantly increases the final grade for the course by 3.8 points.

However, there might be some confounders that we have to control for. For example, parents’ education can affect both treatments (children are more likely to plan to take higher education if their parents have it) and outcomes (educated parents are more likely to help their children so that they have higher grades). Let’s add the mother and father’s education level to the model.

Image by author
Image by author
model = smf.ols('G3 ~ higher + Medu + Fedu', data=df).fit()
model.summary().tables[1]

We can see a statistically significant effect from the mother’s education. We likely improved the accuracy of our estimation.

However, we should treat any causal conclusions based on observational data with a pinch of salt. We can’t be sure that we’ve taken into account all confounders and that the estimation we’ve got is entirely unbiased.

Also, it might be tricky to interpret the direction of the relation. We are sure there’s a correlation between willingness to continue education and final grade. However, we can interpret it in multiple ways:

  • Students who want to continue their education are more motivated, so they have higher final grades.
  • Students with higher final grades are inspired by their success in studying, and that’s why they want to continue their education.

With observational data, we can only use our common sense to choose one option or the other. There’s no way to infer this conclusion from data.

Despite the limitations, we can still use this tool to try our best to come to some conclusions about the world. As I mentioned, causal inference is based significantly on domain knowledge and common sense, so it’s worth spending time near the whiteboard to think deeply about the process you’re modelling. It will help you to achieve excellent results.

You can find complete code for these examples on GitHub.

Summary

We’ve discussed quite a broad topic of causal inference, so let me recap what we’ve learned:

  • The main goal of predictive analytics is to get accurate forecasts. The causal inference is focused on understanding the relationships, so we care more about the coefficients in the model than the actual predictions.
  • We can leverage linear regression to get the causal conclusions.
  • Understanding what features we should add to the linear regression is an art, but here is some guidance.

    • You must include confounders (features that affect both treatment and outcome).
    • Adding a feature that predicts the output variable and explains its variability can help you to get more confident estimations.
    • Avoid adding features that either affect only treatment or are the outcome of the output variable.
  • You can use this approach for both A/B tests and observational data. However, with observations, we should treat our causal conclusions with a pinch of salt because we can never be sure that we accounted for all confounders.

Thank you a lot for reading this article. If you have any follow-up questions or comments, please leave them in the comments section.

Dataset

Cortez, Paulo. (2014). Student Performance. UCI Machine Learning Repository (CC BY 4.0). https://doi.org/10.24432/C5TG7T

Reference

All the images are produced by the author unless otherwise stated.

This article is inspired by the book Causal Inference for the Brave and True that gives a wonderful overview on the causal inference basics.

The post Linear Regressions for Causal Conclusions appeared first on Towards Data Science.

]]>
Bayesian AB Testing with Pyro https://towardsdatascience.com/bayesian-ab-testing-with-pyro-cd4daff686e1/ Wed, 15 Nov 2023 10:29:01 +0000 https://towardsdatascience.com/bayesian-ab-testing-with-pyro-cd4daff686e1/ A primer in Bayesian thinking and AB testing using Pyro

The post Bayesian AB Testing with Pyro appeared first on Towards Data Science.

]]>
This article is an introduction to Ab Testing using the Python probability programming language (PPL) Pyro, an alternative to PyMC. The motivation for writing this article was to further my understanding of Bayesian statistical inference using the Pyro framework and to help others in the process. As such, feedback is welcomed and encouraged.


Introduction

My previous experience with Bayesian modelling in Python was with PyMC. However, I have found it troublesome to work with some of the latest versions. My research into other probabilistic programming languages led me to discover Pyro, a universal PPL built by Uber and supported by PyTorch on the backend. The documentation for both Pyro and PyTorch is linked below.

Pyro

PyTorch documentation – PyTorch 2.1 documentation

Whilst exploring Pyro, I found it difficult to find end-to-end tutorials for AB testing using the package. This article aims to fill that gap.

This article has five sections. In the first section, I will give a primer on Bayesian thinking, a background in the philosophy that the methodology speaks for. I will give a brief technical background on Pyro and the Bayesian methods used to make our statistical inferences. Next, I will perform an AB test using Pyro and discuss the results. I will then explain the case for Bayesian AB testing in a business setting. The final section will summarise.


A Primer in Bayesian Thinking

The Bayesian process is relatively simple at a high level. First, you have a variable that you are interested in. We state our current understanding of that variable, which we express as a probability distribution (everything in Bayes’ world is a probability distribution). This is called a prior. Probability in Bayesian inference is epistemic, meaning it arises through our degree of knowledge. Thus, the probability distribution we state as our prior is as much of a statement of uncertainty, as it is understanding. We then observe events from the real world. These observations are then used to update our understanding of the variable we are interested in. The new state of our understanding is called the posterior. We also characterise the posterior with a probability distribution, the probability of the variable we are interested in, given the data we have observed. This process is illustrated in the flow diagram below, where theta is the variable we are interested in and data is the outcome of the events we have observed.

Bayesian Process
Bayesian Process

The Bayesian process is actually cyclical, because we then repeat this process, starting again but using our new-found knowledge after observing the world. Our old posterior becomes our new prior, we observe new events and once again update our understanding, becoming ‘less and less and less wrong’ as Nate Silver put it in his book The Signal and the Noise. We continuously __ reduce our uncertainty concerning the variable, converging towards a state of certainty (but we can never be certain). Mathematically, this way of thinking is encapsulated by Bayes’ theorem.

Bayes' Theorem
Bayes’ Theorem

In practice, Bayes’ theorem quickly becomes computationally intractable for large models across higher dimensions. There are numerous approaches to solve this issue, but the method we will use today is called Markov Chain Monte Carlo, or MCMC. MCMC is a class of algorithms (we will be using one called Hamiltonian Monte Carlo), which draws samples from a probability distribution intelligently. It is beyond the scope of this article to dive deep into this, but I will provide some helpful references at the end for further reading. For now, it is sufficient to know that Bayes’ theorem is too complex to be able to calculate, so we leverage the power of advanced algorithms, such as MCMC to help. This article will focus on MCMC using Pyro. However, Pyro does emphasise the use of Stochastic Variational Inference (SVI), an approximation-based approach, which will not be covered here.


AB Testing Using Pyro

Consider a company that has designed a new website landing page and wants to understand the impact this will have on conversion, i.e. do visitors continue their web session on the website after landing on the page? In test group A, website visitors will be shown the current landing page. In test group B, website visitors will be shown the new landing page. In the rest of the article, I will refer to test group A as the control group, and group B as the treatment group. The business is sceptical about the change and has opted for an 80/20 split in session traffic. The total number of visitors and the total number of page conversions for each test group are summarised below.

Test Observations
Test Observations

The null hypothesis of the AB test is that there will be no change in page conversion for the two test groups. Under the frequentist framework, this would be expressed as the following for a two-sided test, where r_c and r_t are the page conversion rates in the control and treatment groups, respectively.

Null and Alternative Hypotheses
Null and Alternative Hypotheses

A significance test would then seek to either reject or fail to reject the null hypothesis. Under the Bayesian framework, we express the null hypothesis slightly differently by asserting the same prior for each of the test groups.

Let’s pause and outline exactly what is happening during our test. The variable we are interested in is the page conversion rate. This is simply calculated by taking the number of distinct converted visitors over the total number of visitors. The event that generates this rate is whether the visitor clicks through the page. There are only two possible outcomes here for each visitor, either the visitor clicks through the page and converts, or does not. Some of you might recognise that for each distinct visitor, this is an example of a Bernoulli trial; there is one trial and two possible outcomes. Now, when we collect a set of these Bernoulli trials, we have a binomial distribution. When the random variable X has a binomial distribution, we give it the following notation:

Binomial Distribution Notation
Binomial Distribution Notation

Where n is the number of visitors (or the number of Bernoulli trials), and p is the probability of the event on each trial. p is what we are interested in here, we want to understand what the probability of a visitor converting on the page is in each test group. We have observed some data, but as mentioned in the previous section, we first need to define our prior. As always in Bayesian Statistics, we need to define this prior as a probability distribution. As mentioned before, this probability distribution is a characterisation of our uncertainty. Beta distributions are commonly used for modelling probabilities, as it is defined between the intervals of [0,1]. Furthermore, using a beta distribution as our prior for a binomial likelihood function gives us the helpful property of conjugacy, which means our posterior will be generated from the same distribution as our prior. We say that the beta distribution is a conjugate prior. A beta distribution is defined by two parameters, alpha, and confusingly, beta.

Beta Distribution Notation
Beta Distribution Notation

With access to historical data, we can assert an informed prior. We do not necessarily need historical data, we could use our intuition to inform our understanding, but for now let’s assume we have neither (later in this tutorial we will use informed priors, but to demonstrate the impact, I will start with the uninformed). Let’s assume we have no understanding of the conversion rate on the company’s site, and therefore define our prior as Beta(1,1). This is called a flat prior. The probability distribution of this function looks like the graph below, the same as a uniform distribution defined between the intervals [0,1]. By asserting a Beta(1,1) prior, we say that all possible values of the page conversion rate are equally probable.

Credit: Author
Credit: Author

We now have all the information we need, the priors, and the data. Let’s jump into the code. The code provided herein will provide a framework to get started with AB testing using Pyro; it therefore neglects some features of the package. To help optimise your code further and take full advantage of Pyro’s capabilities, I recommend referring to the official documentation.

First, we need to import our packages. The final line is good practice, particularly when working in notebooks, clearing the store of parameters we have built up.

Python">import pyro
import pyro.distributions as dist
from pyro.infer import NUTS, MCMC
import torch
from torch import tensor
import matplotlib.pyplot as plt
import seaborn as sns
from functools import partial
import pandas as pd

pyro.clear_param_store()

Models in Pyro are defined as regular Python functions. This is helpful as it makes it intuitive to follow.

def model(beta_alpha, beta_beta):
    def _model_(traffic: tensor, number_of_conversions: tensor):
        # Define Stochastic Primatives
        prior_c = pyro.sample('prior_c', dist.Beta(beta_alpha, beta_beta))
        prior_t = pyro.sample('prior_t', dist.Beta(beta_alpha, beta_beta))
        priors = torch.stack([prior_c, prior_t])
        # Define the Observed Stochastic Primatives
        with pyro.plate('data'):
            observations = pyro.sample('obs', dist.Binomial(traffic, priors),
                             obs = number_of_conversions)
    return partial(_model_)

A few things to break down and explain here. First, we have a function wrapped inside an outer function, the outer function returns the partial function of the inner function. This allows us to change our priors, without needing to change the code. I have referred to the variables defined in the inner function as primitives, think of primitives as variables in the model. We have two types of primitives in the model, stochastic and observed stochastic. In Pyro, we do not have to explicitly define the difference, we simply add the obs argument to the sample method when it is an observed primitive and Pyro interprets it accordingly. Observed primitives are contained within the context manager pyro.plate(), which is best practice and makes our code look cleaner. Our stochastic primitives are our two priors, characterised by Beta distributions, governed by the alpha and beta parameters that we pass in from the outer function. As previously mentioned, we assert the null hypothesis by defining these as equal. We then stack these two primitives together using tensor.stack(), which performs an operation akin to concatenating a Numpy array. This will return a tensor, the data structure required for inference in Pyro. We have defined our model, now let’s move onto the inference stage.

As previously mentioned, this tutorial will use MCMC. The function below will take the model that we have defined above and the number of samples we wish to use to generate our posterior distribution as a parameter. We also pass our data into the function, as we did for the model.

def run_infernce(model, number_of_samples, traffic, number_of_conversions):
    kernel = NUTS(model)

    mcmc = MCMC(kernel, num_samples = number_of_samples, warmup_steps = 200)

    mcmc.run(traffic, number_of_conversions)

    return mcmc

The first line inside this function defines our kernel. We use the NUTS class to define our kernel, which stands for No-U-Turn Sampler, an autotuning version of Hamiltonian Monte Carlo. This tells Pyro how to sample from the posterior probability space. Again, it is beyond the scope of this article to dive deeper into this topic, but for now, it is sufficient to know that NUTS allows us to sample from the probability space intelligently. The kernel is then used to initialise the MCMC class on the second line, specifying it to use NUTS. We pass the number_of_samples argument in the MCMC class which is the number of samples used to generate the posterior distribution. We assign the initialised MCMC class to the mcmc variable and call the run() method, passing our data as parameters. The function returns the mcmc variable.

This is all we need; the following code defines our data and calls the functions we have just made using the Beta(1,1) prior.

traffic = torch.tensor([5523., 1379.])
conversions =torch.tensor([2926., 759.])
inference = run_infernce(model(1,1), number_of_samples = 1000, 
               traffic = traffic, number_of_conversions = conversions)

The first element of the traffic and conversions tensors are the counts for the control group, and the second element in each tensor is the counts for the treatment group. We pass the model function, with the parameters to govern our prior distribution, alongside the tensors we have defined. Running this code will generate our posterior samples. We run the following code to extract the posterior samples and pass them to a Pandas dataframe.

posterior_samples = inference.get_samples()
posterior_samples_df = pd.DataFrame(posterior_samples)

Notice the column names of this dataframe are the strings we passed when we defined our primitives in the model function. Each row in our dataframe contains samples drawn from the posterior distribution, and each of these samples represents an estimate of the page conversion rate, the probability value p that governs our Binomial distribution. Now we have returned the samples, we can plot our posterior distributions.

Results

An insightful way to visualise the results of the AB test with two test groups is by a joint kernel density plot. It allows us to visualise the density of samples in the probability space across both distributions. The graph below can be produced from the dataframe we have just built.

Credit: Author
Credit: Author

The probability space contained in the graph above can be divided across its diagonal, anything above the line would indicate regions where the estimation of the conversion rate is higher in the treatment group than the control and vice versa. As illustrated in the plot, the samples drawn from the posterior are densely populated in the region which would indicate the conversion rate is higher in the treatment group. It is important to highlight that the posterior distribution for the treatment group is wider than the control group, reflecting a higher degree of uncertainty. This is a result of observing less data in the treatment group. Nevertheless, the plot strongly indicates that the treatment group has outperformed the control group. By collecting an array of samples from the posterior and taking the element-wise difference, we can say that the probability that the treatment group outperforms the control group is 90.4%. This figure suggests that 90.4% of the samples drawn from the posterior will be populated above the diagonal in the joint density plot above.

These results were achieved by using a flat (uninformed) prior. The use of an informed prior may help improve the model, particularly when the availability of observed data is limited. A helpful exercise is to explore the effects of using different priors. The plot below shows the Beta(2,2) probability density function and the joint plot it produces when we rerun the model. We can see that using the Beta(2,2) prior produces a very similar posterior distribution for both test groups.

Credit: Author
Credit: Author

The samples drawn from the posterior suggest there is a 91.5% probability that the treatment group performs better than the control. Therefore, we do believe with a higher degree of certainty that the treatment group is better than the control versus using a flat prior. However, in this example the difference is negligible.

There is one other thing I would like to highlight about these results. When we ran the inference, we told Pyro to generate 1000 samples from the posterior. This is an arbitrary number, selecting a different number of samples can change the results. To highlight the effect of increasing the number of samples, I ran an AB test where the observations from the control and treatment groups were the same, each with an overall conversion rate of 50%. Using a Beta(2,2) prior generates the following posterior distributions as we incrementally increase the number of samples.

Credit: Author
Credit: Author

When we run our inference with just 10 samples, the posterior distribution for the control and treatment groups are relatively wide and adopt different shapes. As the number of samples that we draw increases, the distributions converge, eventually generating nearly identical distributions. Furthermore, we observe two properties of statistical distributions, the central limit theorem and the law of large numbers. The central limit theorem states that the distribution of sample means converges towards a normal distribution as the number of samples increases, and we can see that in the plot above. Additionally, the law of large numbers states that as the sample size grows, the sample mean converges towards the population mean. We can see that the mean of the distributions in the bottom right tile is approximately 0.5, the conversion rate observed in each of the test samples.


The Business Case for Bayesian AB Testing

Bayesian AB testing can help elevate a business’s test and learn culture. Bayesian statistical inference allows for small uplifts in the test to be detected quickly as it is not reliant on long-run probabilities to draw conclusions. Test conclusions can be reached faster and therefore the rate of learning increases. Bayesian AB testing also allows for early stoppage of a test, if results gained through ‘peeking’ indicate test groups are performing significantly worse than the control then the test can be stopped. Therefore, the opportunity cost of testing can be significantly reduced. This is a major benefit of Bayesian AB testing; results can be constantly monitored, and our posteriors constantly updated. Conversely, the early detection of an uplift in the test subject can help businesses implement changes faster, reducing the latency of implementing revenue-improving changes. Customer-facing businesses must be able to quickly implement and analyse test results, which is facilitated by the Bayesian AB testing framework.


Summary

This article has given a brief background on Bayesian thinking and explored the results of Bayesian AB testing using Pyro. I hope you have found this article insightful. As mentioned in the introduction, feedback is welcomed and encouraged. As promised, I have linked some further reading material below.

Recommended Material

The following books provide good insight into Bayesian Inference:

  • The Signal and the Noise: The Art and Science of Prediction – Nate Silver
  • The Book of Why: The New Science of Cause and Effect – Judea Pearl and Dana Mackenzie. Although this book largely focuses on causality, chapter 3 is a worthwhile read on Bayes’.
  • Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference – Cameron Davidson-Pilon. This book is also available on Git, I have linked it below.

GitHub – CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers: aka "Bayesian…

These Medium articles provide detailed explanations of MCMC.

Monte Carlo Markov Chain (MCMC) explained

Bayesian inference problem, MCMC and variational inference

The post Bayesian AB Testing with Pyro appeared first on Towards Data Science.

]]>