The world’s leading publication for data science, AI, and ML professionals.

Football and Geometry – Passing Networks

Analyzing Bayer Leverkusen's Passing Networks from Last Season

Football Analytics

Understanding networks through the analysis of Bayer Leverkusen’s passing networks

Photo by Clint Adair on Unsplash
Photo by Clint Adair on Unsplash

Long time no see… But for a good reason.

After some months I’m back on Medium and today we’re merging two exciting worlds: football and geometry.

Concretely, we’ll touch upon the topic of networks but, as always, through a practical case. We’ll study football passing networks focusing on last year’s Bayer Leverkusen matches.

The Bundesliga winners had an amazing season playing outstanding football under Xabi Alonso. I’m curious to investigate how that translates to mathematical terms and understand their playing style and most relevant players through their passing networks.

While the importance of networks is already established to study interconnection between nodes, its application in football isn’t different from that. It’s basic stuff, in fact, but it’s worth dedicating a post for anyone who hasn’t seen one yet.

Statsbomb[1] has high-quality data and, luckily for us, they made free and available for everyone all Bayer Leverkusen’s games from last season.

Here’s what we will go through today:

  1. Introduction to passing networks
  2. Building the network
  3. Metrics and Analysis
  4. Conclusion

Before going on, I want to share that all the code in this post is mine but has been inspired by the amazing Soccermatics course David Sumpter created. You can find the URL to this extensive education in the Resources section at the end of this post[2].

Introduction to Passing Networks

Passing networks are graphical representations of how players interact with each other during a football match, visualizing the flow of passes between teammates. It contains two main elements:

  • Nodes – represent players and are located in the average position where the player either passed or was passed a ball.
  • Edges (or lines) – represent the passes being made between the two players (nodes) connected by that line.

The thickness of the edges usually represents the frequency of passes and the size of the nodes the number of passes made by a player.

This visual and analytical tool is increasingly used to assess team shape, player involvement, and tactical patterns in modern football. If we, as data scientists, use additional math to compute other metrics, we get even more advanced insights about the team’s passing characteristics.

Here are some of the specific situations where passing networks can come in handy:

  • Understand Team Structure. As it contains the average positions where these players were during these passing events, we can get an idea of the team’s structure and, therefore, how they play. For example, a compact and dense network might indicate that the team prefers short passes and possession-based football, while a more spread-out network could suggest a direct or counter-attacking approach.
  • Identifying Key Players. A key metric in passing networks is centrality, which is used to assess whether the network relies mostly on a small subset of players or not. If we see, for example, Pirlo being key to Italy’s passing network, we might want to put a great defense on him the next time we play against the Italians.
  • Tactical Analysis. Coaches and analysts can spot tactical strengths and weaknesses. For instance, if a team’s passing network shows strong connections down the left side, it might indicate a reliance on attacking through that flank. Also, that can be used to compare different matches and analyze how it changed from one opponent to the other.

Building the network

We’ll use Statsbomb’s free data and access it through statsbombpy‘s[3] Python library. Let’s start by importing the library and retrieving all the events from the Bundesliga 2023/24 season (as we’re using free data only, we’ll receive only info relative to Bayer Leverkusen’s matches):

from collections import Counter

from mplsoccer import Pitch
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from statsbombpy import sb

events_df = sb.competition_events(
    country="Germany",
    division="1. Bundesliga",
    season="2023/2024",
    gender="male"
).sort_values(['match_id', 'minute', 'second'])

We now have a lot of events and columns in this data frame, so we should filter out unneeded rows. Additionally, we’re going to perform a little bit of feature engineering by splitting the location and _pass_endlocation column into two separate columns (one for the x and the other for the y-axis):

passes_df = events_df[
    (events_df['type'] == 'Pass') 
    & (events_df['team'] == 'Bayer Leverkusen')
    & (events_df['pass_outcome'].isna()
)].reset_index(drop=True)[['match_id', 'location', 'pass_end_location', 'player', 'pass_recipient']]

# Define start and end positions
ini_locs_df = pd.DataFrame(passes_df["location"].to_list(), columns=['x', 'y'])
end_locs_df = pd.DataFrame(passes_df["pass_end_location"].to_list(), columns=['end_x', 'end_y'])
passes_df = pd.concat([passes_df, ini_locs_df, end_locs_df], axis=1)

# Reshape and rename columns
passes_df.drop(columns=['location', 'pass_end_location'], inplace=True)
passes_df.columns = ['match_id', 'player_name', 'pass_recipient_name', 'x', 'y', 'end_x', 'end_y']

After keeping only the columns we’re interested in, some decisions should be made now. Bayer Leverkusen, like any other team, doesn’t only have 11 players in their squad. If we want to show a passing network, we only want to show 11 (as if it were a match lineup). But a decision has to be made on the method to exclude players.

A good approach could have probably been to use the most-used team’s formation and choose, for each position, the player who’s played the most minutes in there. But that seemed too complex for this post and I decided to make it simple: use the most-used lineup (and I mean, the most-used 11-player set of players starting games).

Here’s the code that handles this, previously transforming player names to only show surnames:

# Show surname only
passes_df["player_name"] = passes_df["player_name"].apply(lambda x: str(x).split()[-1])
passes_df["pass_recipient_name"] = passes_df["pass_recipient_name"].apply(lambda x: str(x).split()[-1])
passes_df.loc[:, ["player_name", "pass_recipient_name"]] = passes_df.loc[:, ["player_name", "pass_recipient_name"]].replace('García', 'Grimaldo')

# Select most-used lineup and keep those players
used_lineups = []
for match_id in passes_df.match_id.unique():
    match_lineup = sb.lineups(
        match_id=match_id
    )['Bayer Leverkusen']

    match_lineup['starter'] = match_lineup['positions'].apply(
        lambda x: x[0]['start_reason'] == 'Starting XI' if x else False
    )

    match_lineup["player_name"] = match_lineup["player_name"].apply(lambda x: str(x).split()[-1])
    match_lineup.loc[:, "player_name"] = match_lineup.loc[:, "player_name"].replace('García', 'Grimaldo')

    starters = sorted(match_lineup[match_lineup['starter']==True].player_name.tolist())
    used_lineups.append(starters)

most_used_lineup_players = Counter([', '.join(c) for c in used_lineups]).most_common()[0][0].split(", ")

I ended up reducing the DF’s dimensionality:

# Show surname only
passes_df["player_name"] = passes_df["player_name"].apply(lambda x: str(x).split()[-1])
passes_df["pass_recipient_name"] = passes_df["pass_recipient_name"].apply(lambda x: str(x).split()[-1])

# Manually correct Grimaldo's name
passes_df.loc[:, ["player_name", "pass_recipient_name"]] = passes_df.loc[:, ["player_name", "pass_recipient_name"]].replace('García', 'Grimaldo')

passes_df = passes_df[['x', 'y', 'end_x', 'end_y', "player_name", "pass_recipient_name"]]

This is what the DF looks like at this point:

Top 5 rows in the passes_df data frame - image by the author
Top 5 rows in the passes_df data frame – image by the author

We now need to get the number of passes per player and their location (node size and location) and the number of passes between pairs of players (edge thickness). Let’s start with the first one by creating a new DF:

# Create DF with average player positions
nodes_df = pd.concat([
    passes_df[["player_name", 'x', 'y']], 
    passes_df[["pass_recipient_name", 'end_x', 'end_y']].rename(columns={'pass_recipient_name': 'player_name', 'end_x': 'x', 'end_y': 'y'})
]).groupby('player_name').mean().reset_index()

nodes_df = nodes_df[nodes_df['player_name'].isin(most_used_lineup_players)]

# Add number of passes made
nodes_df = nodes_df.merge(long_df.groupby('player_name').agg(passing_participation=('x', 'count')).reset_index())

# Add marker_size column to have it relative to the number of passes made 
nodes_df['marker_size'] = (nodes_df['passing_participation'] / nodes_df['passing_participation'].max() * 1500)

We’re first creating a DF that contains only three columns (player_name, x, y) and then grouping by the player name to compute the average x and y. After that, we filter out those players not in the most common lineup and we merge the result of this operation with the passes_df previously grouped by player containing the total number of passes in which each player participated. This way, we end up having a DF with one row per player, his average pitch location, and the number of passes made.

The last step is used to add the marker size, which will be used later, and this is how it looks like:

Top 5 rows of nodes_df - image by the author
Top 5 rows of nodes_df – image by the author

Let’s keep on with the edges DF. We’ll need to create a new column in _passesdf to illustrate the pair of players from that event. As we don’t care about directionality, we’ll filter out unneeded players and sort those pairs alphabetically:

edges_df = passes_df.copy()
edges_df = edges_df[edges_df['player_name'].isin(most_used_lineup_players)]

edges_df['player_pair'] = edges_df.apply(
    lambda x: "-".join(sorted([x["player_name"], x["pass_recipient_name"]])), 
    axis=1)

This will be used as a key for a groupby next:

edges_df = edges_df.groupby(["player_pair"])
                    .agg(passes_made=('x', 'count'))
                    .reset_index()
filtered_edges_df = edges_df[edges_df['passes_made'] > 238]

To create this _edgesdf, I’m grouping by this new column we just created and counting the number of passes between each pair of players. Then, we filter out those with less than 238 passes because we want to see only those with an average passing rate of 7 passes per game.

This is not accurate because not all players played 34 games so their averages might be higher and still be left out… But we want to keep things simple.

Anyway, let’s see it:

Top 5 rows of player-pairs and passes made - image by the author
Top 5 rows of player-pairs and passes made – image by the author

We’re now ready to plot and we’ll use mplsoccer[4] to display a field to draw upon. The full code taking care of the visualization is shown next:

pitch = Pitch(line_color='grey')
fig, ax = pitch.grid(grid_height=0.9, title_height=0.06, axis=False,
                     endnote_height=0.04, title_space=0, endnote_space=0)
pitch.scatter(nodes_df.x, nodes_df.y, s=nodes_df.marker_size, color='rosybrown', edgecolors='lightcoral', linewidth=1, alpha=1, ax=ax["pitch"], zorder = 3)

for i, row in nodes_df.iterrows():
    pitch.annotate(row.player_name, xy=(row.x, row.y), c='black', va='center', ha='center', weight = "bold", size=16, ax=ax["pitch"], zorder = 4)

all_players = nodes_df["player_name"].tolist()
for i, row in filtered_edges_df.iterrows():
    player1 = row["player_pair"].split('-')[0]
    player2 = row['player_pair'].split('-')[1]

    if player1 not in all_players or player2 not in all_players:
        continue

    player1_x, player1_y = nodes_df.loc[nodes_df["player_name"] == player1, ['x', 'y']].values[0]
    player2_x, player2_y = nodes_df.loc[nodes_df["player_name"] == player2, ['x', 'y']].values[0]

    line_width = (row["passes_made"] / lines_df['passes_made'].max() * 10)

    pitch.lines(player1_x, player1_y, player2_x, player2_y,
                    alpha=1, lw=line_width, zorder=2, color="lightcoral", ax = ax["pitch"])

fig.suptitle("Bayer Leverkusen's Passing Network (2023/24)", fontsize = 25)
plt.show()

And, finally, let’s see the resulting passing network:

Bayer Leverkusen's passing network using the most-used lineup and showing only those connections with mroe than 238 passes - image by the author
Bayer Leverkusen’s passing network using the most-used lineup and showing only those connections with mroe than 238 passes – image by the author

Quite nice, right?

Metrics and Analysis

As data scientists, we can’t stop here. Developing the code and creating the visualization is key, but we actually need to derive some insights from it. Otherwise, it’s useless.

So let’s define some extra metrics such as network centrality and passing rate.

  • The passing rate refers to the number of successful passes per minute of possession.
  • Network centrality has already been explained before but it measures how influential a player is within the team’s passing structure.

Thomas Grund has already done the job for us by inspecting how the passing rate relates to a team’s probability of scoring more goals. In short, he found out that a team with a passing rate of 5 successful passes per minute of possession had 20% more goals than those teams with a ratio of 3.

So it is a good proxy to measure attacking performance (or, at least, goal-scoring probabilities). Let’s compute that for today’s case study:

events_df['total_minutes'] = (events_df['second']/60) + events_df['minute']
events_df['event_duration'] = events_df.groupby('match_id')['total_minutes'].diff().shift(-1)

# Agrupar per equip i sumar
possession_minutes = events_df.groupby(['possession_team'])['event_duration'].sum()['Bayer Leverkusen']
passing_rate = len(passes_df)/possession_minutes

Here’s what the previous snippet does:

  1. Create a new column with the time in minutes (minutes).
  2. Create a new column to find how many minutes go by between event and event (as a way to measure each event’s duration)
  3. Compute the total minutes Bayer Leverkusen had the ball throughout the season.
  4. Compute the passing rate using the number of successful passes done by the team (contained in _passesdf) divided by the minutes with possession.

And the result is: 10.87.

So, if going from 3 to 5 successful passes per minute of possession translated to 20% more goals, imagine if a team has a rate of 10.87. This is how good Leverkusen were.

To put it even in more context, do you remember Spain’s 2012 UEFA Euro team? Busquets, Xavi, Iniesta, Xabi Alonso… Their passing rate in the game against Croatia was 9.65. Leverkusen’s sustained average of 10.87 during the whole season is impressive knowing how managed to have better metrics than the team that is considered by many the best team that’s ever played the game.

Xabi Alonso tried to replicate (and succeeded) what he had done already as a player, now as a coach, and with a less-talented team (still a very good team though).

Moving on to network centrality, this metric is going to be key as well to know if the game relied on a small subset of players or not. There are many ways to compute it such as using the Degree Centrality metric, the Betweenness Centrality, or the Eigenvector Centrality (to name a few).

For simplicity, we’ll stick to the way David Sumpter computes it in Soccermatics[2] but slightly changed: We sum the difference between the max number of successful passes made/received by one player and the number of successful passes made/received by each player, and divide it by the sum of all passes multiplied by the number of nodes in a network minus 1:

#find one who made most passes
max_passing_participations = nodes_df['passing_participation'].max()
#calculate the denominator - 10*the total sum of passes
denominator = 10*(nodes_df['passing_participation'].sum()/2)
#calculate the nominator
nominator = (max_passing_participations - nodes_df['passing_participation']).sum()
#calculate the centralization index
centralization_index = nominator/denominator

Using this approach, we get a centralization index of 0.1988, or 19.88%. The closer to 0, the less centralized it is, so around 20% can still be considered low and that’s correlated with an 8% increase in the probability of scoring.

To add even more context, in the same game mentioned above between Spain and Croatia, the Spanish had a centralization index of 14.6%. So Leverkusen’s metric, aggregated throughout the season (not just one game like the Euro’s), is pretty close to that 14.6%.

Conclusion

Today we learned the core components of a network – nodes and edges – and also what their properties mean (size and width). Additionally, we learned the concept of centrality in a network.

We did all this through a real-case scenario, using Bayer Leverkusen’s 2023/24 season data by leveraging their passing networks and analyzing them mathematically.

To extract some insights, we found their network was that of a team that plays strong possession football, and both metrics we computed were similar to the ones seen in Spain when they faced Croatia in the 2012 Euros.

Thanks for reading the post! 

I really hope you enjoyed it and found it insightful. There's a lot more to 
come, especially more AI-based posts I'm preparing.

Follow me and subscribe to my mail list for more 
content like this one, it helps a lot!

@polmarin

Resources

[1] StatsBomb. (n.d.). Home. StatsBomb. Retrieved September 14, 2024, from https://statsbomb.com

[2] Soccermatics. (n.d.). Soccermatics documentation. Retrieved September 13, 2024, from https://soccermatics.readthedocs.io

[3] StatsBombPy. GitHub. Retrieved September 14, 2024, from https://github.com/statsbomb/statsbombpy

[4] mplsoccer. (n.d.). mplsoccer documentation. Retrieved September 14, 2024, from https://mplsoccer.readthedocs.io/en/latest/index.html

[5] Grund, T. U. (2012). Network structure and team performance: The case of English Premier League soccer teams. Social Networks, 34(4), 682–690. https://doi.org/10.1016/j.socnet.2012.08.004


Related Articles