Data Science
Markov Chain Monte Carlo: Made Simple Once and For All

I recently posted an article where I used Bayesian Inference and Markov chain Monte Carlo (MCMC) to predict the CL round of 16 winners. There, I tried to explain Bayesian Statistics in relative depth but I didn’t tell much about MCMC to avoid making it excessively large. The post:
So I decided to dedicate a full post to introduce Markov Chain Monte Carlo methods for anyone interested in learning how they work mathematically and when they prove to be useful.
To tackle this post, I’ll adopt the divide-and-conquer strategy: divide the term into its simplest terms and explain them individually to then solve the big picture. So this is what we’ll go through:
- Monte Carlo methods
- Stochastic processes
- Markov Chain
- MCMC
Monte Carlo Methods
A Monte Carlo method or simulation is a type of computational algorithm that consists of using sampling numbers repeatedly to obtain numerical results in the form of the likelihood of a range of results of occurring.
In other words, a Monte Carlo simulation is used to estimate or approximate the possible outcomes or distribution of an uncertain event.
A simple example to illustrate this is by rolling two dice and adding their values. We could easily compute the probability of each outcome but we could also use Monte Carlo methods to simulate 5,000 dice-rollings (or more) and get the underlying distribution.
Stochastic Processes
Wikipedia’s definition is "A stochastic or random process can be defined as a collection of random variables that is indexed by some mathematical set"[1].
In more readable terms: "it’s any mathematical process that can be modeled with a family of random variables".[2]
Let’s use a simple example to understand the concept. Imagine you put a video camera on your favorite store to, once every 2 minutes, check how many visitors there are. We define X(0) as the initial state and it shows the number of visitors seen at t=0. Then, 2 minutes later, we see X(1) and so on.
The state space is a set of values that our random variables (X(i)) can adopt, and they can get from 1 to the maximum store capacity.
One of the properties of a stochastic process is that whatever happens in a specific moment is conditioned to what has happened in the preceding moments. Keeping up with our example, if we have 100 visitors at t=0, the probability of having 100 ± 20 at t=1 is greater than seeing it drop to 10, for example (if no unexpected event happens). Therefore, these X variables aren’t independent.
Markov Chains
A Markov chain is a sequence of numbers where each number is dependent on the previous value of the sequence.
So it’s a stochastic method with one peculiarity: knowing the current state is as good as knowing the entire history. In mathematical terms, we say that a stochastic process is Markovian if X(t+1) conditioned to x(1), x(2),…x(t) only depends on x(t):

Keeping up with our example, for it to be considered a Markov chain we would need the number of visitors in a given time – t – to only depend on the number of visitors we saw in the previous instant – t-1. That’s not true in real life but imagine it is, then we define the transition probability as the probability of going from state i to state j in a specific instant:

And, if that probability is time-independent, we say it’s stationary.
With this transition probability, we now define the transition matrix, which is just a matrix with all transition probabilities:

This matrix comes in handy when we want to compute the probabilities of transitioning from one state to another in n steps, which is achieved mathematically with power operations on the matrix:

Let’s define now a new – and dumb – example, in which we consider that a striker’s probability of scoring a goal in a football (soccer) match depends only on whether he/she scored in the previous game or not. Because we suppose it’s also time-independent – when the match is played doesn’t matter – we are working with stationary transition probabilities.
Concretely, if a player scored in the previous match, we assume the probability of scoring again in the next game is 70% (the player is hypermotivated to keep the streak going). If the player doesn’t score, this probability drops to 40%.
Let’s put that into the transition matrix:

The proper way to read it is: we have two possible outcomes (goal or no goal). Row 1 defines the next game probabilities for the case in which the player has scored; row 2 does the same but for the case in which he/she hasn’t scored. Columns are read similarly: the first one relates to the probabilities of scoring and the second to the probabilities of not scoring.
So, for example, 0.7 is the probability of scoring after having scored in the previous game.
Now, what are the chances that a certain player scores in the n = 2 game knowing that he hasn’t scored today?

If the player hasn’t scored today, we have to focus on the second row. As we’re interested in the chances of scoring, we focus on the first column. And where these both intersect we have 0.52–the probability of scoring in the game ahead of the next one is 52%.
We could want to work out the marginal distributions for each X(t) and we can do it by using the initial conditions in which the chain initialized: X(0).
Keeping up with the example, the question would now be: knowing that the player has a 50–50% chance of scoring in the first game, what are the chances that he/she scores then and in the second game ahead of the first?

The answer is 0.565, or 56.5%.
What’s curious about Markov Chains is that, independently of which values we choose for p0, we might end up with the same distribution after a certain number of iterations. That’s called a _stationary distribution, a_nd this is key for MCMC.
Markov Chain Monte Carlo (MCMC)
Now it’s time to combine both methods together.
MCMC methods constitute Monte Carlo simulations where the samples are drawn from random Markov chain sequences to form a probability distribution. In the case of Bayesian modeling, this stationary distribution will be the posterior distribution.
Simulating the chain after a given set of steps (what’s called the burn-in phase) we’ll get us to the desired distribution. These simulations are dependent on each other but, if we discard a few after certain iterations, we make sure these simulations are almost independent (thinning).
MCMC comes in handy when we want to perform inference for probability distributions where independent samples from the distribution cannot be easily drawn.
Regarding the different MCMC algorithms that exist, we’ll focus on the two more common ones:
- Gibbs Sampling: this algorithm for sampling samples from the conditional distributions. Here, we sample our variables based on the distribution conditional to the other variables and iteratively repeat this process. For example, in a case where we have 3 variables, we would simulate the first one by sampling, for each t in 1…N iterations:

- Metropolis-Hastings: is usually the alternative to Gibbs when simulating the complete conditionals isn’t possible (i.e. when we cannot sample a variable conditioned to all the other ones). This works by proposing a candidate for the next step in the Markov chain – x(cand) – by sampling from a simple distribution – q – built around x(t-1). Then we choose to accept the candidate or not with a determined probability (if it’s not accepted, then the chain doesn’t change). This probability is defined by:

Conclusion
In short, MCMC methods consist of drawing random samples conditioned to the previous value/step only and potentially deciding whether we keep them or not. And repeat multiple times until we form the chains.
To schematize, let’s define the algorithm in a set of steps:
- Get/Assign the initial values.
-
For each iteration: a) Sample the candidates from a distribution that only depends on the previous value (Markov Chain). b) If we’re using the Metropolis-Hastings algorithm, decide whether we accept or reject the candidates by computing and using the acceptance probability. c) Update/Store the new values.
Thanks for reading the post!
I really hope you enjoyed it and found it insightful. There's a lot more to
come, especially more AI-based posts I'm preparing.
Follow me and subscribe to my mail list for more
content like this one, it helps a lot!
@polmarin
Resources
[1] Wikipedia contributors. (2024, January 8). Stochastic process. In Wikipedia, The Free Encyclopedia. Retrieved 19:43, February 24, 2024, from https://en.wikipedia.org/w/index.php?title=Stochastic_process&oldid=1194369849
[2] Christopher Kazakis. (2021, January 8th). See the Future with Stochastic Processes. Retrieved 19:43, February 24, 2024, from https://towardsdatascience.com/stochastic-processes-a-beginners-guide-3f42fa9941b5.