Sports Analytics

Oh, the Champions League. Possibly the competition that attracts the most fans regardless of the team they support. It’s the best against the best. The show is almost guaranteed… And the outcome is almost impossible to predict.
But that shouldn’t stop us from trying.
I was the other day going through old college assessments and found one that inspired me to write this post, where we’ll use Bayesian Inference to create a model that tries to predict the next Champions League matches: the first leg in the Round of 16 (well, it can be used in any match from any round, to be honest).
The aim is to build a model through Bayesian Statistics applied to a real-case scenario that I find to be interesting and entertaining.
Whether you know about Bayesian Inference or not, this post is for you. Even if you already knew about everything I’m about to share, the final predictions will at least serve you to either congratulate or laugh at me after the first leg is played.
Here’s what we’ll go through today:
- Bayesian Inference & Modeling
- The Data
- Building the Model
- Validating the Model
- Predictions
- Conclusion, Improvements, and Future Steps
The full code is on my GitHub page which you’ll find in the resources section[1].
Bayesian Inference & Modeling
If you’re new to this, you’ll need a proper intro with the basic math fundamentals, which is what’s coming next. If you want to go straight to the point, feel free to skip this section.
As per Wikipedia, "Bayesian Inference is a method of statistical inference in which Bayes’ theorem is used to update the probability for a hypothesis as more evidence or information becomes available."[2]
That’s the formal definition, but there’s a lot of information here. I think it’s better if we dissect this sentence into different parts:
- "Bayesian Inference is a method of statistical inference" – Statistical inference is the process of using data analysis and statistics to infer properties of a population, by hypothesis testing and deriving estimates.
- "in which Bayes’ theorem is used to update the probability for a hypothesis" – Bayes’ theorem[3] describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
- "as more evidence or information becomes available" – Different from frequentist statistics, the data (here, evidence or information) is not the only factor we take into account, it is only used to update our prior beliefs.
To summarize: Bayesian Inference is the process of inferring properties from a certain population based on Bayes’ theorem by using the prior knowledge we possess (as a distribution) and the data to get the final distribution.
Lots of concepts, and many more still missing, but I won’t go in-depth today. However, we can’t ignore the Bayes’ theorem formula, as it’s the core of the whole process:

The formula itself is really simple and chances are you’ve already seen it before. If you haven’t, just take into account that P(A|B) is read as the probability of the event "A" happening given the event "B".
For example, we could read P(rain|clouds) as the probability of rain given that there are clouds in the sky.
Keeping up with the example, where A is "rain" and B is "clouds", the formula would be:

Using this formula we could compute the probability of experiencing rain given that the sky is cloudy. For that, we’d have to know the probability of having a cloudy sky when it’s raining (P(clouds|rain)), the probability of rain at that given moment (P(rain)), and the probability of seeing clouds in the sky (P(clouds)).
But this equation can also be expressed in terms of posterior, likelihood, prior and evidence:
- Posterior: P(rain | clouds) – is what we’re trying to infer.
- Likelihood: P(clouds | rain) – is the probability of observing the data that has been observed (clouds in the sky) assuming that the data came from a specific scenario (rain).
- Prior: P(rain) – The previous information we have about the event we’re interested in.
- Evidence: P(clouds) – It’s the data itself.

The evidence itself is only used to normalize the density and make it a valid probability, but it’s just a constant. For that reason, it’s sometimes omitted because most of the time we’ll be interested in the posterior distribution only, not the density probability. And the final formula we’ll see is

meaning that the posterior distribution is proportional to the likelihood times the prior.
Modeling, on its own, refers to the process of finding the statistical model that governs the behavior of any real process.
The Data
We’ve managed to survive the math, now it’s time to dig into the practical case and data. We’re trying to predict the winner of a given match so we’ll define θ as our hypothesis. We know it’s a number and it can either be 1 (the team won) or 0 (the team didn’t win).
So, we can assume that θ follows a Bernoulli distribution.[4] We don’t know the "p" parameter, and that’s what we’ll try to infer using Bayes’ theorem.
Note that θ isn’t linear. It’s, in fact, from the exponential family and, when this happens, we say we are dealing with a generalized linear model (GLM). In our case, where our variable follows a Bernoulli distribution, we’ll present the logistic regression.
Logistic regression is a model in which a certain variable can only be either 0 or 1, meaning success or failure where "p" is the success probability. So

where i = 1,…,n are all the matches we have times 2.
As the distribution isn’t linear, we need a link function that provides the relationship between the linear predictor and the mean of the distribution function. It depends on the distribution we’re dealing with but the link function for our case is the logit:

We can now provide a linear shape to this logit function:

And these "x" are the data we’re going to use. So Bernoulli’s variable has now become a sort of linear one, which is amazing.
Our model today will be based on the premise of simplicity. The data we’ll be using will only consist of three parameters: the number of goals the team scores, whether they play at home or not, and the average player ratings.
Keeping up with the simplicity, we’ll only be using the team’s data to predict its chances of winning, without caring about the rival. You’ll see why in the next section.
The catch is that we’re only going to use data from the current Champions League campaign (2023–24) – from the group phase. Obtaining open-source football data can be tricky nowadays so we’re going to use only a very limited set of matches to build our model, and we’ll compensate for that with Markov Chain Monte Carlo (MCMC) approximations.
Talking about what MCMC is, is outside of this post’s scope. But I’ll put a link in the resources section to an article that explains it incredibly well, named "A Gentle Introduction to Markov Chain Monte Carlo for Probability"[5].
Building The Model
Enough intro and theoretical comments. Let’s go straight to the code.
I had two options: Python or R. I chose the latter because I especially like the rjags library for the task we’re about to go through. However, you could easily do the same with Python either from scratch or using any other library you’d like.
JAGS[6] comes from Just Another Gibbs Sampler, and it’s a program for simulation from Bayesian hierarchical models using MCMC. Perfect for our scenario.
So what we’ll do first is build the model:
modelString = "
model{
# Likelihood
for(i in 1:n){
y[i] ~ dbern(pi[i])
logit(pi[i]) <- beta[1] + beta[2]*goals[i] + beta[3]*is_home[i] +
beta[4]*avg_rating[i]
}
# Prior distributions
for (j in 1:4) {
beta[j] ~ dnorm(0.0, 0.0001)
}
}"
writeLines( modelString , con="cl_model.txt" )
In the model, which we store as a string and then into a TXT file, what we do is define the likelihood and the prior distributions:
- Likelihood: in a for loop that goes through all the data rows we have (192, which comes from 96 matches times 2 teams per match) and assigns the Bernoulli distribution to the predictor variable for that sample. We then compute the logit function which, as you’ll recall, is a linear equation using the three stats we’re using to predict.
- Prior distributions: we don’t know anything about the linear parameters so we’re just making them follow a normal distribution with a standard deviation of 100 and mean 0.
What we do next is initialize the beta values using a normal distribution:
init_values <- function(){
list(beta = rnorm(4, 0, 3))
}
init_values()

Time to import the data. If you’ve read my articles before, you know I’m a DuckDB fan[7] and there’s where I have all the data stored. As the SQL query itself isn’t relevant – only the data we extract – I’ll omit this step and show the data straightaway:

We’re seeing one row per team and game, with each team’s stats and whether they won or not. For simplicity, the avg_rating feature is rounded.
To create the JAGS model, we’ll need to provide the data. Not just the covariables used to predict but also some variables we’re using in the model like the outcome column y or the static number of observations n. To do so, we just create a list of values like this:
jags_data <- list(y = data$win,
goals = data$goals,
is_home = data$is_home,
avg_rating = data$avg_rating,
n = nrow(data))
We have our model defined, prepared the function to initialize values (and executed it) and we already have the data in the proper JAGS format. It’s now time to run the model and find the proper beta values.
JAGS uses different MCMC samplers to approximate the posterior distribution and automatically chooses the proper one, but it needs a prior "adaptation" period. Here’s how we do it:
jagsModel <- jags.model(file = "cl_model.txt",
data = jags_data,
inits = init_values,
n.chains = 3,
n.adapt = 1000)
This jags.model function needs the model file, the data we just prepared, and the function that initializes values. Then, we set the number of Markov chains to 3 and the number of iterations for adaptation to 1000.
It then shows some info relevant to the model and its nodes, but we don’t have to do anything with it.
To ensure convergence, we’ll apply what’s called "burn-in", which is only a way to say that we discard a certain number of iterations that we don’t know if fall within the posterior region or not. It’s as easy as running the next function:
update(jagsModel, n.iter=10000)
Lastly, we produce the inference by telling JAGS to generate MCMC samples (these samples will then be used to generate the posterior distribution). To do so, we use coda.samples() as follows:
samples <- coda.samples(jagsModel,
variable.names = c("beta"),
n.iter = 10000,
thin = 10)
We’re basically using the model as the first parameter, we then specify the parameters the values of which want to record (beta), and set the number of steps for each chain to 10.000.
The last parameter, thin, is crucial here. As we’re working with Markov Chains, we expect to have autocorrelated samples. For that reason, it’s common to use thinning to only keep 1 every X iterations for our final samples. In this case, I’m using X=10.
Now run the code and let it produce the samples.
Validating the Model
We’re almost there!
We’ve already got our samples. Let’s now check if everything’s fine. To do so we’ll examine the chains. Do you remember that we decided to use 3 and not 1? It’s only for pure comparison purposes.
We’re going to use here an R module created by Kruschke, J. K. (2015) called DBDA2E-utilities.R[8]. From it, we’ll first take advantage of the diagMCMC() function by passing the samples as the first parameter and then the parameter name as the second. We’ll try, for example, with the second one (beta[2]):
diagMCMC(codaObject = samples, parName = "beta[2]")

This is what we see in each of these graphs:
- Top-left: Trace plot to see if the chains converge and none is left orphan.
- Bottom-left: We see how the Gelman-Rubin statistic evolves with each iteration. If you’re not used to it, just take into account that we want it as close to 1 as possible.
- Top-right: Autocorrelation is shown here and, as we want as little of it as possible, we want to see the lines going down as much as possible.
- Bottom-right: Density plot to see if the three chains are well superimposed or not.
We’re now confident that our chains look fine. Let’s see the actual posterior distributions, shall we?
From the same file module we just used, we’ll now take advantage of the plotPost() function:
plotPost( samples[,"beta[2]"])


For now, everything looks reasonably fine. We’re ready to start predicting.
Predictions
Suppose we want to predict Napoli vs Barcelona. To compute the distributions of each team’s winning probabilities, we would need to know in advance the match’s predicted goals and the predicted average player ratings. There’s no way to know these before the match gets played, but we could use two strategies:
- Build different models to predict these variables (neural networks, for example).
- Use the last-X matches average from these stats.
Today, we’ll go with the second one, and we’ll use the 5-game average to predict the winning chance distributions. This adds value to our predictions because it takes into account the team’s current form.
This is the DF we’re now going to use, which now contains data from all competitions, not just the Champions League:

As our model predicts team by team, we’ll have to perform two predictions and overlap both histograms into one single plot. Let’s start with a real example: RB Leipzig vs Real Madrid.
I’m finishing this post as of February 15th and this game was just played two days ago, so using it as an example is good to see if the predictions are accurate or not. To be clear: data from yesterday’s matches isn’t used for these predictions.
We need to filter the data from this DF and get each team’s:
# Home team data
home_team = 'RBL'
home_pred_home <- 1
goals_pred_home <- avg_data[avg_data$team == home_team, 'goals_avg']
avg_rating_pred_home <- avg_data[avg_data$team == home_team, 'rating_avg']
# Away team data
away_team = 'Real Madrid'
home_pred_away <- 0
goals_pred_away <- avg_data[avg_data$team == away_team, 'goals_avg']
avg_rating_pred_away <- avg_data[avg_data$team == away_team, 'rating_avg']
Now that we have valid values for each variable, we need to run the predictions. Recall the logit function allowed us to convert this problem into a linear one, and we found the coefficients through JAGS. Now it’s time to use these and the averages to get the linear predictor values:
predictor_pred_home <- samples[[1]][,1] +
samples[[1]][,2]*goals_pred_home +
samples[[1]][,3]*home_pred_home +
samples[[1]][,4]*avg_rating_pred_home
predictor_pred_away <- samples[[1]][,1] +
samples[[1]][,2]*goals_pred_away +
samples[[1]][,3]*home_pred_away +
samples[[1]][,4]*avg_rating_pred_away
And finally, we compute the inverse of the logit function to find the final pi estimate:
pi_pred_home <- as.numeric(exp(predictor_pred_home)/(1+exp(predictor_pred_home)))
pi_pred_away <- as.numeric(exp(predictor_pred_away)/(1+exp(predictor_pred_away)))
To grasp the full value of these predictions, we’ll now plot them into a histogram:
preds <- data.frame(pi_pred_home, pi_pred_away)
names(preds) <- c(home_team, away_team)
ggplot(preds) +
geom_histogram(aes(x = pi_pred_home, y = ..density..,
color = home_team, fill=home_team),
bins = 30, alpha=0.5) +
geom_histogram(aes(x = pi_pred_away, y = ..density..,
color = away_team, fill=away_team),
bins = 30, alpha=0.5) +
theme_light() +
xlim(c(0,1)) +
xlab(expression(pi)) +
theme(axis.text = element_text(size=15),
axis.title = element_text(size=16))
And the result:

How do we interpret this? The most direct way to do it is by using the median of both teams’ predictions and comparing them both relatively:
- For RB Leipzig, the median is at 37.07%.
- For Real Madrid, the median is at 58.84%.
From these numbers, we could conclude that both teams have been doing quite well recently but Madrid seems to be better though not by much. So the model is inclined towards Madrid but without much certainty.
Taking into account that the match ended 0–1 yesterday, we can be extremely proud of our model.
Copenhagen vs City (1–3), Lazio vs Bayern (0–1), and PSG vs Real Sociedad (2–0) have already been played, so we can use these to assess our model’s accuracy as well:

This is impressive. The model not only predicted these games with 100% accuracy, but it was able to predict Lazio’s higher winning chances vs Bayern even if it was the underdog and the odds were clearly against them.
Almost everyone would have betted for Bayern as a favorite, even myself, but the model didn’t and Lazio ended up winning 1–0.
So far, 4/4 correctly predicted. So let’s predict the 4 remaining matches being played next week!

These look reasonable again:
- Inter vs Atlético de Madrid: Inter is the clear favorite (personally not so sure about it).
- PSV vs Dortmund: No clear favorite but Dortmund seems to be in better form and its winning chances seem higher… But a draw wouldn’t be unexpected.
- Porto vs Arsenal: Arsenal is, without a doubt, the model’s favorite. We see Arsenal’s distribution being thin and tall, meaning the confidence in that prediction is higher, while Porto’s is quite wide, meaning the model is not so sure about what it’s predicting and the true value lies within 25% and a 70% chance of winning for the Portuguese team.
- Napoli vs Barcelona: Similar to the case above, the model doesn’t see Napoli winning the game, while with Barça it can’t decide with confidence its chances of winning, being within 50% and 80% (being myself a Barça fan, that’s too high in my opinion)
Now we just have to wait, enjoy the matches, and come back to see how many predictions our Bayesian model got right!
Conclusions, Future Steps, and Improvements
What we built here was rather simple… Yet we can say the predictions were reasonable. More than that, probably, having seen that it was accurate on all played matches already, being able to predict one underdog’s win.
However, this model is too simple to expect great predictions from it. For example, the Napoli vs Barcelona prediction favors Barça because the model doesn’t take into account goals conceded and that’s the team’s major flaw this season.
That’s why I wanted to dedicate a final section to suggest just a few potential improvements to turn this model into a high-class one:
- Expand the feature set: a proper feature analysis should be performed to assess which features we’ve left behind would be great parameters for our model. For example: win%, possession, shots, tackles, xG…
- Include rival data: the current model predicts the team’s winning distribution based on the team’s stats only, without taking into account the rivals they face. If we added the other team’s info, the model would be more complete.
- Use more data: Even if we used MCMC, another solution would have been to use a lot more data from past CL seasons to create the model and make it (potentially) more robust.
- Create a machine learning model to predict the team’s features instead of using simple averages.
Many others could be added to this list. Feel free to implement these on your own!
Thanks for reading the post!
I really hope you enjoyed it and found it insightful. There's a lot more to
come, especially more AI-based posts I'm preparing.
Follow me and subscribe to my mail list for more
content like this one, it helps a lot!
@polmarin
Resources
[1] Pol Marín (2024). Bayesian Inference in Champions League. https://github.com/polmarin/bayesian_inference_champions_league
[2] Wikipedia contributors. (2024, January 31). Bayesian inference. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Bayesian_inference&oldid=1201216717
[3] Wikipedia contributors. (2024, January 12). Bayes’ theorem. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Bayes%27_theorem&oldid=1195202672
[4] Wikipedia contributors. (2024, January 25). Bernoulli distribution. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Bernoulli_distribution&oldid=1199096378
[5] Jason Brownlee (2019, September 25). A Gentle Introduction to Markov Chain Monte Carlo for Probability. In Machine Learning Mastery. https://machinelearningmastery.com/markov-chain-monte-carlo-for-probability/
[6] JAGS – Just Another Gibbs Sampler. https://mcmc-jags.sourceforge.io/
[7] Pol Marín (2023, March 16). Forget about SQLite, Use DuckDB Instead – And Thank Me Later. In Medium. https://towardsdatascience.com/forget-about-sqlite-use-duckdb-instead-and-thank-me-later-df76ee9bb777
[8] Kruschke J. K. (2015). GitHub –https://github.com/kyusque/DBDA2E-utilities/