The Capture-ReCapture Method

When you capture our individuals, make sure to not harm them, as you have to set them free later again. Photo by Anne Nygård on Unsplash

In this article, I want to present a statistical method to estimate the size of a population without counting it fully, which is called the Capture-ReCapture method. Coming from biological domains, the procedure can also be applied to many other fields and scenarios that can be of interest to data scientists and related professions.

I will first demonstrate the procedure on a biological example before I talk about its statistical background and the properties that allow for its usage. After that, I will present some examples of different domains to demonstrate the capabilities the Capture-ReCapture method has for different scenarios.

How many snails are in my garden?

Many people don't like snails, but I still think they are adorable. Let's count them without hurting them. Photo by Krzysztof Niewolny on Unsplash — Many people don’t like snails, but I still think they are adorable. Let’s count them without hurting them. Photo by Krzysztof Niewolny on Unsplash

Say I want to know how many snails are living in my garden. I could try to count all of them, but how will I know when I’m finished with that? Even if I don’t find any more snails, I can never be sure that there are none left. Instead, there is a different method I can use.

On the first day, I dedicate half an hour to collecting snails and counting them. Additionally, I mark each one with a dot of paint, before I release it into my garden again. Say I have collected 21 snails. Can I already give an estimate on the number of snails in total in my garden? No, not yet (besides the fact, that there must be at least 21 snails), but I am not finished.

A day later, I go to my garden again and start counting snails for half an hour. Some of the snails I find that day already have a dot of paint on their shell, i.e. I already found them yesterday, while others don’t (i.e. I didn’t find that particular snail yesterday). Say I count 28 snails that day, 9 of which are already marked with a dot of paint. Now I can give an estimate of the number of snails in total. Let’s do the Math.

On the second day, a proportion of 9/28 snails I had already found the day before. That ratio should be equal to the ratio of the snails I found on the first day over the number of snails in total, i.e. 21/N = 9/28, where N is the total number of snails. I can reformulate that to get the number of snails as N = (21*28)/9 = 65.

Why is that? On the second day, a certain ratio of individuals (say p%) has a certain property (namely being marked). If I draw a random sample from the population, I expect that p% of my sample have that property as well. That is very intuitive: If you randomly sample from the population of your city, you would also expect that the ratio of genders in your sample reflects the ratio of genders in total, right? However, on the second day we know this ratio p, which we didn’t know on the first day (when painting snails on the first day, we didn’t know which fraction of snails we had already caught), so on the first day, we painted p% of all snails. It is easy now to derive the total number of snails from that: If I painted 21 snails, and I now know that this is 9/28=32% of the population, there are roughly 65 snails in total (with 21 being roughly 32% of 65).

Conditions for recapturing

Before using the Capture-ReCapture method, make sure that the required conditions are fulfilled. Photo by Sung Jin Cho on Unsplash

Besides counting the number of snails in your garden, there are many other scenarios where you can apply the aforementioned procedure. As you can imagine, the distance between the two sampling steps doesn’t have to be a day, and the marking can also be done in a different way than by marking individuals literally. You may also just keep a list of the individuals you have drawn in the first round, as long as you can easily determine whether an individual you find in the second iteration is already present on the list. However, for the Capture-ReCapture method to be applicable, there are some properties that have to be fulfilled, which are the following:

On both points of data collection, the population has to be the same. In particular, that demands that no individuals are added or removed between the two points in time.
On both points of data collection, one has to draw randomly and independently from the distribution. I.e. each individual must have the same likelihood of being caught. In particular, being marked or not should not make a difference in the likelihood of being drawn on the other occasion.
The number of individuals drawn on each occasion has to be of sufficient size to create a meaningful overlap. You can easily imagine that randomly sampling 100 books each from your local library, where the number of books is in the millions, creates no overlap at all and hence doesn’t help your estimation.

Example Use cases

Spoiler: Medicine is a domain where variants of the Capture-ReCapture method are used a lot. Photo by Ksenia Yakovleva on Unsplash

Now that we have understood the Capture-ReCapture method, let’s take a look at some examples where to use it. It comes in handy whenever we want to determine the size of a population without being able to count it fully. However, different scenarios may have different pitfalls to the methods prerequisites that have to be taken into consideration.

Counting the number of guests at a party

At the next party you are attending, you can take five minutes to mark some individuals (either by literally marking them or by keeping a list of them) and some minutes later you draw random individuals again. However, make sure that you really draw randomly and independently. That is, you should catch people from all over the place and don’t be biased towards people you know or don’t know. Also, make sure that the distance between the two points of data collection is not too big; otherwise, your estimate might be biased by the fact that people left the party in the meantime.

Capturing from two independent lists

A variant of the Capture-ReCapture method doesn’t use recapturing at a different point in time but uses two independent data sources (that have been drawn from the same distribution) and their overlap. In this way the method is often used in medical scenarios, so let’s take a look at an example where we estimate the prevalence of a disease.

Say I have a list of patients from a hospital listing 142 persons having a certain disease, and I have another list coming from the National Health Service that lists 442 persons having that disease. Say 71 people are appearing on both lists. Then we can use the formula from above and obtain our result (142*442)/71 = 884. That is 884 people are estimated to suffer from the disease.

Most important for that variant is that the two lists are indeed independent. I.e. the likelihood for an individual to be part of one list should not differ whether or not that individual is part of the other list or vice versa.

Estimate the number of potential customers

Say you have a website to sell your breathtaking new product. On one day you capture all visitors on your website (e.g. by tracking their IP) and the very same you do some days later. With the overlap between the two days, you can estimate the number of potential customers for your product. However, you should be aware that this scenario may easily include a violation of an important assumption, namely the independent draws on both captures. In particular, one could argue that visiting the website on day one can increase the likelihood of visiting the website again.

Summary

We have now seen some examples of the Capture-ReCapture method, which allows us to estimate the size of a population without counting it fully. Instead of counting each individual in the population, the method demands to draw two independent samples of the population (either at different points in time or from different sources) and use their overlap to estimate the population size. This can be used in a variety of domains, whenever a full observation of the population is not feasible.