The world’s leading publication for data science, AI, and ML professionals.

The statistical magic behind the bootstrap

How to use the bootstrap for tests or confidence intervals and why it works

Photo by Artem Maltsev on Unsplash
Photo by Artem Maltsev on Unsplash

It is often claimed that AI, Data Science and machine learning are all just glorified statistics. While this may take it a step too far, data scientists have undoubtedly borrowed a lot of useful tools from statistics. But there is one extremely useful tool that has been widely overlooked – the Bootstrap. The Bootstrap is an algorithm that allows you to determine the distribution of a test statistic without doing any theory.

Statistical inference provides the mathematical theory that describes which conclusions can legitimately be drawn from data.

In data science statistical inference usually comes up in A/B testing or when models are used for analytics. Non-statisticians often struggle when it comes to testing and inference. And to be fair – I (as a statistician) sometimes struggle, too.

When we have to do a test in practice, we define the hypothesis we want to test and most likely start googling to figure out the right test statistic and distribution to use. If we have a lot of data that is usually not a problem – especially if we want to test something simple like a hypothesis on the mean of our A/B test result or the coefficient in a linear regression model. Then a good old t-test will do it.

But what if the data set is small and not normally distributed at all? Are all the assumptions of the respective test fulfilled? What if the hypothesis we want to test is a non-linear function of the features in the model? Or if we want to test a hypothesis on the ratio of two KPIs measured in the test? Statistics quickly become very complicated. This is where the Bootstrap comes to help.

Left: histogram of the sample for which we want to test whether the expected value is zero. Right: simulated distribution of the t-statistic for a sample simulated from the same distribution as the sample on the left with normal density superimposed. Image by author.
Left: histogram of the sample for which we want to test whether the expected value is zero. Right: simulated distribution of the t-statistic for a sample simulated from the same distribution as the sample on the left with normal density superimposed. Image by author.

In the figure above, we can see an example of a situation where a normal t-test reaches its limit. The 50 observations in the sample on the left are drawn from an exponential distribution, squared and then centered. On the right, one can see what happens if we want to use a t-test for the (true) null hypothesis that the sample comes from a distribution with zero mean. The histogram shows the distribution that is obtained if the test statistic is actually calculated for 5000 simulated examples constructed in the same way. The superimposed density is that of the asymptotic normal distribution of a t-test.

Clearly, the normal distribution is a very bad approximation to the true distribution in this case. So that the conclusions we would be drawing if we used a normal t-test in this case would not be correct.

How the Bootstrap works

The Bootstrap allows to determine the distribution of a statistic T – for example the click-trough-rate (CTR) in an A/B test – very easily with a procedure similar to a simulation. The procedure is as follows:

  1. Draw a sample of the same size as your data from your data. Draw with replacement so that some of the original data points might be missing and some might be in the sample multiple times. This sample is called a bootstrap sample.
  2. Calculate your quantity of interest **T*** based on the bootstrap sample and save the result.
  3. Repeat the previous two steps m times.
  4. Use the saved values T₁,…,T from the previous steps to calculate the properties of the distribution you are interested in – be it p-values or quantiles for confidence intervals.

This is all. It works surprisingly good – often also in relatively small samples. It may cost some compute, but in many applications this is not an obstacle because the data sets are sufficiently small. It is statistical magic!

Top: original sample with empirical cumulative distribution function superimposed. Bottom: two examples of bootstrap samples. Image by the author.
Top: original sample with empirical cumulative distribution function superimposed. Bottom: two examples of bootstrap samples. Image by the author.

To calculate confidence intervals around the sampled statistic, one can simply utilize the respective quantiles from the observed distribution of T₁,…,T.

For hypothesis testing one has to be a bit careful, because the bootstrap statistics T are centered around the statistic T from the original sample – not around the expected value of that test statistic. Mathematically, we write E[Tₘ]=T ≠ E[T]. Therefore, if we want to test whether T is significantly different from zero, critical values or p-values have to be derived from the distribution of *Tₘ-T**.

Distribution of the t-statistic from the example above compared to asymptotic normal distribution and bootstrap distribution. Image by the author.
Distribution of the t-statistic from the example above compared to asymptotic normal distribution and bootstrap distribution. Image by the author.

The figure above compares the distribution of the test statistic obtained using the Bootstrap to the actual simulated distribution and the normal distribution. It is clear to see, that the bootstrapped distribution comes much closer to the true one.

What you need to know to understand the Bootstrap

When I first learned about the Bootstrap, I learned how to carry out the steps described above. But it really bugged me that I could not figure out why this magic would work. Drawing new samples from a sample seemed ludicrous to me. What statistical argument would justify that?

It turns out that it is actually surprisingly simple to understand the statistical reason why the Bootstrap works. You just need a clear understanding of two things:

1. Statistical Testing

The first thing you need to understand the Bootstrap is a clear understanding of statistical tests. When we do a test, we have a sample and some hypothesis about it. The data in our sample x₁, …,xₙ comes from some distribution that is usually unknown. We then calculate some test statistic T(x₁,…,xₙ) that is just some function of our data. Since we do not know the distribution of our sample, we can also not know the distribution Fₙ of T. What statisticians do to overcome this, is to construct T in such a way that they know what distribution F it would have if we would have an infinite amount of data and the null hypothesis was true. If the null hypothesis is false, T has to diverge to a value that is extremely unlikely under the asymptotic distribution F. We then compute T and compare it to F – ignoring the fact that the actual sample is finite. If the sample is large enough this is a reasonable approximation. If it is small, the result can be very wrong. In the t-test example from above, the asymptotic distribution F is the normal distribution, but the distribution Fₙ of the test statistic on a finite amount of data shown in the histogram on the right is very different.

2. Empirical cumulative distribution functions

The second thing you need to understand the Bootstrap is to know about empirical cumulative distribution functions. In case you feel a little rusty on your statistics foundations, let me give you a quick reminder.

When we think of statistical distributions, we typically tend to think about the density function f(x) – for example the famous bell curve in case of the normal distribution. Another way to describe a distribution is via its cumulative distribution function.

F(x) = P(X ≤ x) = E[𝟙(X≤ x)] For every value x ** it gives us the probability that a random variable X drawn from that distribution is less than or equal to that value x. For a continuous distribution F(x)** is simply the integral over the density from minus infinity to x. So the cumulative distribution function is just another way to describe a probability distribution.

Distribution function and density function of the standard normal distribution. Image by the author.
Distribution function and density function of the standard normal distribution. Image by the author.

In practice, we usually do not know the cumulative distribution function of the data we are working with – but we can estimate it via its empirical counterpart – the empirical cumulative distribution function or ECDF that is given by

As you can see from the formula, the ECDF at point x is simply the proportion of the data that is smaller or equal to the value x. By its nature, the ECDF is a step function that stays constant between the data points that were actually observed in the sample. As the size n of the data set increases, F̂(x) becomes more and more similar to the true cumulative distribution function F(x).

Empirical cumulative distribution function and theoretical cumulative distribution function. Image by the author.
Empirical cumulative distribution function and theoretical cumulative distribution function. Image by the author.

Key insight

What is most important for the Bootstrap is the fact that any ECDF itself is a valid distribution function. This distribution is discrete and the possible values that the random variable can take are those that were observed in the original sample. The ECDF assigns the same probability 1/n to each of these possible outcomes.

Why the Bootstrap works

So why does this bootstrap magic work? Recall that our test statistic T(x₁,…,xₙ) follows the unknown distribution Fₙ and we want to know Fₙ to calculate confidence intervals, critical values or p-values. We calculate m test statistics T based on bootstrap samples drawn from the original sample and then utilize the empirical distribution F of these test statistics in place of the unknown distribution Fₙ.

The key insight why this makes sense is that drawing with replacement from the original sample means to draw from the distribution specified by the ECDF of the sample. As we just discussed, the ECDF is the distribution that assigns probability 1/n to each of the data points in the original sample. Drawing with replacement means to draw each of these points with probability 1/n. Therefore, if we draw from the sample – we draw from the ECDF.

Consequently, if the sample size n is large enough, the distribution F of the test statistics T will be a good estimate for Fₙ, because F̂(x) will be a good estimate for F(x).

Why is it called the Bootstrap?

The inventor of the Bootstrap is in his mid-eighties today. His name is Bradley Efron and it turns out that he is a pretty humorous guy. The few interviews he has given are full of edgy quotes, jokes and funny anecdotes. As a student he was actually suspended from Stanford for 6 months. He had become the editor for Stanford’s humor magazine The Chapparal and published an issue that was a parody on the Playboy. This parody may have taken it a bit too far for the year 1961.

Photo by Glenn Carstens-Peters on Unsplash
Photo by Glenn Carstens-Peters on Unsplash

The inspiration to pick the name "Bootstrap" for his procedure comes from a story that is much better known in Germany and the United Kingdom than the rest of the world. Baron Munchausen is a fictional German nobleman from the 18th century who likes to tell unbelievable stories about his adventures. In one of this stories, he gets stuck in a swamp with his horse and manages to free himself by pulling himself and his horse out by his hair – or in some versions – his bootstraps.

You can see, why Efron thought that this is a nice analogy to resampling from your sample!

Where and when to use the bootstrap

The bootstrap can be used for testing, confidence intervals, etc. whenever the real distribution is unknown or hard to determine.

The logic why the bootstrap works is based on an asymptotic approximation. The bootstrap distribution will be as far off from the true distribution as the ECDF is different from the real cumulative distribution function. Therefore, it is not clear that bootstrapping would be preferable to a test based on an asymptotic normal distribution. Nevertheless, the bootstrap often performs surprisingly good in relatively small samples.

A main downside to bootstrapping is the amount of compute that is required. This can be an issue if the underlying data set is very large, or if the amount of compute needed to generate T(x₁,…,xₙ) is large relative to the data. Nevertheless, in most areas where you might want to apply the Bootstrap, compute is not a limiting factor.

In the end, everything depends on the use case. The Bootstrap can be a great tool for data scientists who want to produce reliable results fast – without worrying about theory for days. But be aware that the reason why the Bootstrap works is an asymptotic approximation and there is no guarantee on its performance in finite samples.

If you like this article, follow me on Linkedin, where I regularly post about statistics, data science and Machine Learning. Also, check out some of my other posts on Medium:

Are you interpreting your logistic regression correctly?

How to choose your loss function – where I disagree with Cassie Kozyrkov


Related Articles