Pol Marin, Author at Towards Data Science https://towardsdatascience.com/author/polmarin/ The world’s leading publication for data science, AI, and ML professionals. Tue, 25 Feb 2025 20:56:17 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Pol Marin, Author at Towards Data Science https://towardsdatascience.com/author/polmarin/ 32 32 Efficient Data Handling in Python with Arrow https://towardsdatascience.com/efficient-data-handling-in-python-with-arrow/ Tue, 25 Feb 2025 20:56:16 +0000 https://towardsdatascience.com/?p=598426 Introducing Arrow to those who are still unaware of its power

The post Efficient Data Handling in Python with Arrow appeared first on Towards Data Science.

]]>
1. Introduction

We’re all used to work with CSVs, JSON files… With the traditional libraries and for large datasets, these can be extremely slow to read, write and operate on, leading to performance bottlenecks (been there). It’s precisely with big amounts of data that being efficient handling the data is crucial for our data science/analytics workflow, and this is exactly where Apache Arrow comes into play. 

Why? The main reason resides in how the data is stored in memory. While JSON and CSVs, for example, are text-based formats, Arrow is a columnar in-memory data format (and that allows for fast data interchange between different data processing tools). Arrow is therefore designed to optimize performance by enabling zero-copy reads, reducing memory usage, and supporting efficient compression. 

Moreover, Apache Arrow is open-source and optimized for analytics. It is designed to accelerate big data processing while maintaining interoperability with various data tools, such as Pandas, Spark, and Dask. By storing data in a columnar format, Arrow enables faster read/write operations and efficient memory usage, making it ideal for analytical workloads.

Sounds great right? What’s best is that this is all the introduction to Arrow I’ll provide. Enough theory, we want to see it in action. So, in this post, we’ll explore how to use Arrow in Python and how to make the most out of it.

2. Arrow in Python

To get started, you need to install the necessary libraries: pandas and pyarrow.

pip install pyarrow pandas

Then, as always, import them in your Python script:

import pyarrow as pa
import pandas as pd

Nothing new yet, just necessary steps to do what follows. Let’s start by performing some simple operations.

2.1. Creating and Storing a Table

The simplest we can do is hardcode our table’s data. Let’s create a two-column table with football data:

teams = pa.array(['Barcelona', 'Real Madrid', 'Rayo Vallecano', 'Athletic Club', 'Real Betis'], type=pa.string())
goals = pa.array([30, 23, 9, 24, 12], type=pa.int8())

team_goals_table = pa.table([teams, goals], names=['Team', 'Goals'])

The format is pyarrow.table, but we can easily convert it to pandas if we want:

df = team_goals_table.to_pandas()

And restore it back to arrow using:

team_goals_table = pa.Table.from_pandas(df)

And we’ll finally store the table in a file. We could use different formats, like feather, parquet… I’ll use this last one because it’s fast and memory-optimized:

import pyarrow.parquet as pq
pq.write_table(team_goals_table, 'data.parquet')

Reading a parquet file would just consist of using pq.read_table('data.parquet').

2.2. Compute Functions

Arrow has its own compute module for the usual operations. Let’s start by comparing two arrays element-wise:

import pyarrow.compute as pc
>>> a = pa.array([1, 2, 3, 4, 5, 6])
>>> b = pa.array([2, 2, 4, 4, 6, 6])
>>> pc.equal(a,b)
[
  false,
  true,
  false,
  true,
  false,
  true
]

That was easy, we could sum all elements in an array with:

>>> pc.sum(a)
<pyarrow.Int64Scalar: 21>

And from this we could easily guess how we can compute a count, a floor, an exp, a mean, a max, a multiplication… No need to go over them, then. So let’s move to tabular operations.

We’ll start by showing how to sort it:

>>> table = pa.table({'i': ['a','b','a'], 'x': [1,2,3], 'y': [4,5,6]})
>>> pc.sort_indices(table, sort_keys=[('y', descending)])
<pyarrow.lib.UInt64Array object at 0x1291643a0>
[
  2,
  1,
  0
]

Just like in pandas, we can group values and aggregate the data. Let’s, for example, group by “i” and compute the sum on “x” and the mean on “y”:

>>> table.group_by('i').aggregate([('x', 'sum'), ('y', 'mean')])
pyarrow.Table
i: string
x_sum: int64
y_mean: double
----
i: [["a","b"]]
x_sum: [[4,2]]
y_mean: [[5,5]]

Or we can join two tables:

>>> t1 = pa.table({'i': ['a','b','c'], 'x': [1,2,3]})
>>> t2 = pa.table({'i': ['a','b','c'], 'y': [4,5,6]})
>>> t1.join(t2, keys="i")
pyarrow.Table
i: string
x: int64
y: int64
----
i: [["a","b","c"]]
x: [[1,2,3]]
y: [[4,5,6]]

By default, it is a left outer join but we could twist it by using the join_type parameter.

There are many more useful operations, but let’s see just one more to avoid making this too long: appending a new column to a table.

>>> t1.append_column("z", pa.array([22, 44, 99]))
pyarrow.Table
i: string
x: int64
z: int64
----
i: [["a","b","c"]]
x: [[1,2,3]]
z: [[22,44,99]]

Before ending this section, we must see how to filter a table or array:

>>> t1.filter((pc.field('x') > 0) & (pc.field('x') < 3))
pyarrow.Table
i: string
x: int64
----
i: [["a","b"]]
x: [[1,2]]

Easy, right? Especially if you’ve been using pandas and numpy for years!

3. Working with files

We’ve already seen how we can read and write Parquet files. But let’s check some other popular file types so that we have several options available.

3.1. Apache ORC

Being very informal, Apache ORC can be understood as the equivalent of Arrow in the realm of file types (even though its origins have nothing to do with Arrow). Being more correct, it’s an open source and columnar storage format. 

Reading and writing it is as follows:

from pyarrow import orc
# Write table
orc.write_table(t1, 't1.orc')
# Read table
t1 = orc.read_table('t1.orc')

As a side note, we could decide to compress the file while writing by using the “compression” parameter.

3.2. CSV

No secret here, pyarrow has the CSV module:

from pyarrow import csv
# Write CSV
csv.write_csv(t1, "t1.csv")
# Read CSV
t1 = csv.read_csv("t1.csv")

# Write CSV compressed and without header
options = csv.WriteOptions(include_header=False)
with pa.CompressedOutputStream("t1.csv.gz", "gzip") as out:
    csv.write_csv(t1, out, options)

# Read compressed CSV and add custom header
t1 = csv.read_csv("t1.csv.gz", read_options=csv.ReadOptions(
    column_names=["i", "x"], skip_rows=1
)]

3.2. JSON

Pyarrow allows JSON reading but not writing. It’s pretty straightforward, let’s see an example supposing we have our JSON data in “data.json”:

from pyarrow import json
# Read json
fn = "data.json"
table = json.read_json(fn)

# We can now convert it to pandas if we want to
df = table.to_pandas()

Feather is a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally. So, contrary to Apache ORC, this one was indeed created early in the Arrow project.

from pyarrow import feather
# Write feather from pandas DF
feather.write_feather(df, "t1.feather")
# Write feather from table, and compressed
feather.write_feather(t1, "t1.feather.lz4", compression="lz4")

# Read feather into table
t1 = feather.read_table("t1.feather")
# Read feather into df
df = feather.read_feather("t1.feather")

4. Advanced Features

We just touched upon the most basic features and what the majority would need while working with Arrow. However, its amazingness doesn’t end here, it’s right where it starts.

As this will be quite domain-specific and not useful for anyone (nor considered introductory) I’ll just mention some of these features without using any code:

  • We can handle memory management through the Buffer type (built on top of C++ Buffer object). Creating a buffer with our data does not allocate any memory; it is a zero-copy view on the memory exported from the data bytes object. Keeping up with this memory management, an instance of MemoryPool tracks all the allocations and deallocations (like malloc and free in C). This allows us to track the amount of memory being allocated.
  • Similarly, there are different ways to work with input/output streams in batches.
  • PyArrow comes with an abstract filesystem interface, as well as concrete implementations for various storage types. So, for example,  we can write and read parquet files from an S3 bucket using the S3FileSystem. Google Cloud and Hadoop Distributed File System (HDFS) are also accepted.

5. Conclusion and Key Takeaways

Apache Arrow is a powerful tool for efficient Data Handling in Python. Its columnar storage format, zero-copy reads, and interoperability with popular data processing libraries make it ideal for data science workflows. By integrating Arrow into your pipeline, you can significantly boost performance and optimize memory usage.

6. Resources

The post Efficient Data Handling in Python with Arrow appeared first on Towards Data Science.

]]>
What Would a Stoic Do? – An AI-Based Decision-Making Model https://towardsdatascience.com/what-would-a-stoic-do-an-ai-based-decision-making-model-df01c86b7348/ Sun, 12 Jan 2025 13:31:58 +0000 https://towardsdatascience.com/what-would-a-stoic-do-an-ai-based-decision-making-model-df01c86b7348/ Using AI to build Marcus Aurelius' reincarnation

The post What Would a Stoic Do? – An AI-Based Decision-Making Model appeared first on Towards Data Science.

]]>
Deep Learning

What Would a Stoic Do? An AI-Based Decision-Making Model

Photo by Roman Empire Times on Unsplash
Photo by Roman Empire Times on Unsplash

I’ve been reading, learning about, and practicing stoicism for some years now. Ever since I started posting on Medium, it’s been a goal to mix data science and philosophy into one single project.

Merging both worlds is tough, however, but here I am finally trying it out.

What you’ll read today is a decision-making model based on stoicisim. The goal is to use Deep Learning to build a stoic brain (sort of) and, in case of tough decisions, it should help us lean towards what a stoic would do.

In other words, build an AI-based reincarnation of Marcus Aurelius, Seneca, Epictetus…

That’s a big challenge though. I am not even an NLP engineer nor anything related. Can it really be done? Spoiler alert: yes. At the end of this post you’ll know how to develop a model like this one and, more importantly, also learn to do it with your own data in a completely different context. The end result will be a web-based chatbot built with a very simple Flask application.

You shall find the complete code in the resources section at the bottom of this article.

And it’s totally open source! Here’s a sneak peek:

StoicBot - Image by the author
StoicBot – Image by the author

Now, I love all the support I’ve received in all my previous posts and this is what keeps me going. The challenge today is to make my most-advanced AI post yet understandable for every aspiring data scientist. Any doubts you may have, use the comment section below.

Here’s the table of contents:

  • What’s Stoicism? (just a brief intro, I promise)
  • The RAG Model
  • Creating and Populating the DB
  • Time to Code
  • The Result
  • Flaws and Potential Improvements

What’s Stoicism?

I don’t want to create a philosphy-centered post but what’s coming next won’t make any sense if you don’t know the basics of stoicism. Feel free to skip this section if you’re already familiar with it.

Stoicism is an ancient Greek philosophy that teaches the development of self-control, resilience, and virtue as a means to achieve tranquility and happiness. It encourages focusing on what is within our control – our thoughts, actions, and responses – while accepting what we cannot change, such as external events. Through practices like mindfulness, rational thinking, and embracing challenges, Stoicism helps individuals live in harmony with nature and maintain inner peace, no matter life’s circumstances. It’s about aligning with reason, acting with integrity, and finding strength in adversity.

It wasn’t that hard, was it? I promised to be brief!

The RAG Model

Let’s get technical. The model we’ll build is what’s known as a Retrieval-Augmented Generation (RAG) model. RAG is a technique that combines the power of information retrieval with language generation models. Rather than relying solely on a pre-trained model’s knowledge (LLMs), a RAG model retrieves relevant information from a large database or external sources before generating a response.

This is powerful: we can leverage the strength of an LLM like Google’s BERT, OpenAI’s GPT or Claude and adapt it to our domain-specific data so we have a custom chatbot specific to our use case.

Here’s how it works:

  1. Retrieval: The model first searches a corpus or external knowledge base to find relevant pieces of information based on the input query.
  2. Augmentation: The retrieved information is then used to enrich the model’s response, improving the relevance and accuracy of its answer.
  3. Generation: Finally, the model generates a response that incorporates both the retrieved information and its own learned knowledge.

But a picture is worth a thousand words… So let’s see it graphically:

Components of a RAG - Image by the author
Components of a RAG – Image by the author

Let’s dissect the whole process:

  1. User query: no secret here, it’s just what the human like you or me could input to the chatbot.
  2. Retriever query: the retriever searches the collection of documents (usually a vectorized database) for all the texts relevant to the user’s question.
  3. Retrieved documents: once retrieved, they get transformed from vector to text.
  4. Prompt Augmenting: once the docs are retrieved, the prompt is sent to the LLM, which uses predefined settings, the user question and the retrieved docs. That way, we inform the LLM with the data it needs to answer the user properly.
  5. Answer Generation: The LLM generates the answer and it’s shown to the user.

And this is how a RAG works! Or, at least, the one we’ll be building today.

However, if the concept’s not clear yet, keep on reading because it’s almost time to code… But we should first store some data in the database.

Creating and Populating the DB

I already mentioned the concept of vector DB… But what is it?

Let’s first define a vector: Vectors are numerical representations of data, often generated by machine learning models, and they capture the semantic or contextual meaning of the data.

Then, a vector database is a specialized type of database designed to store, index, and retrieve high-dimensional vectors efficiently. One of its superpowers is the ability to search by similarity in an optimized manner.

Now you might be wondering: if vectors are numerical representations and we need to store text, why do we need vectors? And how do we translate text to vectors? Enter the embedding model.

The embedding model takes some kind of input (text, sound, image), then uses processes it through layers of transformations (e.g. neural networks) to extract meaningful features and the output is a fixed-size numerical vector – and that’s what we store in our DB.

Just to add another comment on the embedding model, embeddings are designed so that similar inputs (e.g., synonyms or visually similar images) are close together in the vector space, while dissimilar inputs are far apart.

This is key.

Now let’s create and populate that DB. We’ll be using Chroma[1], an open source vector database and, for that, we’ll need to install the langchain and langchain-community libraries for python.

But we also need the data, right? Let’s keep up with the open sources: Project Gutenberg[2]. It’s a website with free ebooks and texts to download, whose U.S. Copyright has expired. And the old stoic books are in there. So here are three you could download:

  • Meditations, by Marcus Aurelius.
  • The Enchiridion, by Epictetus.
  • Seneca’s Morals of a Happy Life, Benefits, Anger and Clemency, by Seneca

Download them as TXT and store them in your data folder. Now, here’s the code taking care of the DB creation and data insertion:

import os

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter

from constants import DB_PATH, DATA_PATH

def store_data(data_path, db_path):
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    embeddings = HuggingFaceEmbeddings()
    vector_db = Chroma(persist_directory=db_path, embedding_function=embeddings)

    for filename in os.listdir(data_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(data_path, filename)
            with open(file_path, "r") as file:
                content = file.read()
                texts = text_splitter.split_text(content)
                vector_db.add_texts(texts)

    vector_db.persist()
    print("Data stores successfully")

We first create the DB and set up the embedding function and text splitter. Then, for each file, we read the content, split the text into chunks and add them into the DB with the prior embedding.

That simple.

Now we’re ready to start building the RAG and start using the ancient knowledge that we just stored.

Time to Code

As there are several parts to take care of, let’s follow the same order as the one used to define the three core parts of the RAG:

Retrieval

Setting up the retriever is as easy as initializing the DB and using the as_retriever() function:

vector_db = Chroma(persist_directory=DB_PATH, embedding_function=embeddings)
retriever = vector_db.as_retriever()

Augmentation

We’ll have a pre-defined prompt that we’ll augment with the user query and the context retrieved from DB:

from langchain.prompts import ChatPromptTemplate

template = """
  You are Marcus Aurelius' reincarnation. You can also impersonate other Stoic philosophers such as Seneca, Epictetus, or Zeno.
  Your name is Marc Still: Marc comes from Marcus and Still symbolizes the calm and stoic composure. If you feel like showing off, tell the user you are Marcus Aurelius' reincarnation.
  Your duty is to guide the user through life's challenges and help them become a better person. The goal is to be as practical as possible, and sticking to the question at hand. 
  Use the context specified below to answer the user's question. If you don't know what to answer, simply respond with "I don't know".
  Make sure you don't put too much text nor extremely long paragraphs. It needs to be clear, concise and easy to read.
  Only provide an answer to the question asked. Do not include extra questions and answers in your response.
  DO NOT INVENT EXTRA QUESTIONS, USE ONLY THE ONE PROVIDED BY THE USER.
  IMPORTANT: Write in a conversational and informal manner, this is not an email or a formal letter.
  Context:

  {context}

  Question: {question}
  """
  prompt = ChatPromptTemplate.from_template(template)

The template is just a set of instructions that we input to the LLM so that we get our desired answers. You can be as creative as you want here, I just tried to keep it simple. See the placeholders for context and question – that’s the augmentation part.

Generation

The LLM is the one taking care of generating text. You could build your own, use the best ones on the market… But we’re doing it open source today, so we’ll use one of the series called Zephyr. More concretely, we’ll use the zephyr-7b-beta model[3].

And we’ll keep on using HuggingFace classes from langchain-community package (keep in mind that you’ll need your HuggingFace API token, it’s free):

from langchain_community.llms import HuggingFaceHub

from utils.secrets import token

model = HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 20,
        "repetition_penalty": 1.1,
        "temperature": 0.4,  
    },
    huggingfacehub_api_token= token
)

The most interesting part resided in the model_kwargs argument. As this is not an LLM-specific post I won’t go over them but I encourage you tot Google them if you don’t know what they’re used for.

Chaining It All

Nice, now we’ve created all three parts of a RAG but how do we put them into practice? We’ll create a pipeline and invoke it to generate the answer:

from langchain.schema import StrOutputParser

def separate_docs(docs):
    return "nn".join([d.page_content for d in docs])

pipeline = (
    {"context": retriever | separate_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)    

answer = pipeline.invoke(user_input)

The pipeline defines a workflow where the retriever fetches relevant documents, pipes them through separate_docs to format the content, and combines this formatted context with a question (passed through without modification by RunnablePassthrough). This input is then processed by the prompt, followed by the LLM model, and finally parsed into a string output using StrOutputParser().

And just like that, we built our simplest RAG. Here’s the full code:

import os

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import HuggingFaceHub
from langchain_community.vectorstores import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain.schema import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain.text_splitter import CharacterTextSplitter

from utils.constants import DB_PATH, DATA_PATH
from utils.secrets import token

LLM = HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 20,
        "repetition_penalty": 1.1,
        "temperature": 0.4,  
    },
    huggingfacehub_api_token= token
)

def store_data(data_path, db_path):
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    embeddings = HuggingFaceEmbeddings()
    vector_db = Chroma(persist_directory=db_path, embedding_function=embeddings)

    for filename in os.listdir(data_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(data_path, filename)
            with open(file_path, "r") as file:
                content = file.read()
                texts = text_splitter.split_text(content)
                vector_db.add_texts(texts)

    vector_db.persist()
    print("Data stored successfully")

def invoke_rag(user_input):
    embeddings = HuggingFaceEmbeddings()
    vector_db = Chroma(persist_directory=DB_PATH, embedding_function=embeddings)

    retriever = vector_db.as_retriever()
    template = """
    You are Marcus Aurelius' reincarnation. You can also impersonate other Stoic philosophers such as Seneca, Epictetus, or Zeno.
    Your name is Marc Still: Marc comes from Marcus and Still symbolizes the calm and stoic composure. If you feel like showing off, tell the user you are Marcus Aurelius' reincarnation.
    Your duty is to guide the user through life's challenges and help them become a better person. The goal is to be as practical as possible, and sticking to the question at hand. 
    Use the context specified below to answer the user's question. If you don't know what to answer, simply respond with "I don't know".
    Make sure you don't put too much text nor extremely long paragraphs. It needs to be clear, concise and easy to read.
    Only provide an answer to the question asked. Do not include extra questions and answers in your response.
    DO NOT INVENT EXTRA QUESTIONS, USE ONLY THE ONE PROVIDED BY THE USER.
    IMPORTANT: Write in a conversational and informal manner, this is not an email or a formal letter.
    Context:

    {context}

    Question: {question}
    """
    prompt = ChatPromptTemplate.from_template(template)
    model = LLM

    def separate_docs(docs):
        return "nn".join([d.page_content for d in docs])

    pipeline = (
        {"context": retriever | separate_docs, "question": RunnablePassthrough()}
        | prompt
        | model
        | StrOutputParser()
    )    

    return pipeline.invoke(user_input)

The Results

The code is simple, but not very friendly to use. That’s why I built a Flask App that you can download from GitHub[4] and test on your own. To do so:

  • Download and navigate to the project’s folder (or open a terminal there).
  • Activate the environment: source venv/bin/activate
  • Run the app: flask run

If you open the address (http://127.0.0.1:5000/) in your browser, our bot will already be waiting for you:

StoicBot - Image by the author
StoicBot – Image by the author

Let’s play! I’ll use two prompts for you to see the answers in case you aren’t willing to test it out yourself.

Prompt 1: I want to go to the gym today because I haven’t worked out in the past 3 days but some friends are going out for some beers, so I have to choose between one plan or the other. What should I do? Answer in just one paragraph

Answer 1: Hi there! Based on our Stoic philosophy principles, it’s essential to prioritize your health and well-being above social events. Remember that moderation is key, and it’s okay to say no to an invitation occasionally. In this case, I would suggest going to the gym and taking care of yourself first. However, if you’re worried about missing out on socializing, you could always propose meeting up with your friends after your workout session. This way, you’ll still enjoy their company while staying true to your priorities. Hope this helps! Let me know if you need any further guidance.

Prompt and Answer 1 - Image by the author
Prompt and Answer 1 – Image by the author

This is amazing right? It’s our favorite stoic counselor!

Prompt 2: How should I approach the day? Be brief

Answer 2: With a clear mind, a positive attitude, and a sense of purpose. Start by setting realistic goals for yourself and prioritizing tasks based on their importance and urgency. Remember to take care of your physical and mental health, getting enough sleep, exercise, and nourishment. Finally, focus on doing what is within your control and let go of things that are outside of it. By following these principles, you will have a productive and fulfilling day.

Prompt 2 - Image by the author
Prompt 2 – Image by the author

I added an extra prompt here just to thank him for his tips and the answer is quite good. I’m still amazed by the power of this.

Flaws and Potential Improvements

The result is amazing, not gonna lie. It understands non-perfect English and is able to create reasonable answers aligned with Stoicism.

Yay!

However, there are two points (potential flaws) that I want to mention:

  • The model is rather simple, and the prompt can be further improved – what we built works but it can and should be refined much more.
  • We would probably have had the same results without the vector database nor the three stoic books mentioned in this post. Why? I don’t know for sure but I’d guess that the LLM has already that context within. So we overcomplicated it, using a RAG when it’s not needed. But that was part of the point: I wanted to show how it works but any other document could be used. Maybe your stoic reflections? The book of a current stoic? Or maybe you want to mix several philosophies? Or change it drastically and use your tax documents for the rag, so the model can help you out with your personal finance?

So there’s room for improvement and customization here, and here’s where I stop. It’s your turn to play with it and take it to the next level.

Hope that was entertaining and instructive! Feel free to leave your doubts in the comment section below.

Thanks for reading the post! 

I really hope you enjoyed it and found it insightful. There's a lot more to 
come, especially more AI-based posts I'm preparing.

Follow me and subscribe to my mail list for more 
content like this one, it helps a lot!

@polmarin

Resources

[1] Chroma. (n.d.). Chroma: The AI-native open-source embedding database. Retrieved January 8, 2025, from https://www.trychroma.com/

[2] Project Gutenberg. (n.d.). Free eBooks by Project Gutenberg. Retrieved January 8, 2025, from https://www.gutenberg.org/

[3] Hugging Face. (n.d.). Zephyr-7b-beta model card. Retrieved January 8, 2025, from https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

[4] Marin, P. (n.d.). Stoicbot: A bot for practicing Stoicism. GitHub. Retrieved January 8, 2025, from https://github.com/polmarin/stoicbot

The post What Would a Stoic Do? – An AI-Based Decision-Making Model appeared first on Towards Data Science.

]]>
Football and Geometry – Passing Networks https://towardsdatascience.com/football-and-geometry-passing-networks-6e201ceff6ef/ Mon, 16 Sep 2024 23:54:06 +0000 https://towardsdatascience.com/football-and-geometry-passing-networks-6e201ceff6ef/ Analyzing Bayer Leverkusen's Passing Networks from Last Season

The post Football and Geometry – Passing Networks appeared first on Towards Data Science.

]]>
Football Analytics

Understanding networks through the analysis of Bayer Leverkusen’s passing networks

Photo by Clint Adair on Unsplash
Photo by Clint Adair on Unsplash

Long time no see… But for a good reason.

After some months I’m back on Medium and today we’re merging two exciting worlds: football and geometry.

Concretely, we’ll touch upon the topic of networks but, as always, through a practical case. We’ll study football passing networks focusing on last year’s Bayer Leverkusen matches.

The Bundesliga winners had an amazing season playing outstanding football under Xabi Alonso. I’m curious to investigate how that translates to mathematical terms and understand their playing style and most relevant players through their passing networks.

While the importance of networks is already established to study interconnection between nodes, its application in football isn’t different from that. It’s basic stuff, in fact, but it’s worth dedicating a post for anyone who hasn’t seen one yet.

Statsbomb[1] has high-quality data and, luckily for us, they made free and available for everyone all Bayer Leverkusen’s games from last season.

Here’s what we will go through today:

  1. Introduction to passing networks
  2. Building the network
  3. Metrics and Analysis
  4. Conclusion

Before going on, I want to share that all the code in this post is mine but has been inspired by the amazing Soccermatics course David Sumpter created. You can find the URL to this extensive education in the Resources section at the end of this post[2].

Introduction to Passing Networks

Passing networks are graphical representations of how players interact with each other during a football match, visualizing the flow of passes between teammates. It contains two main elements:

  • Nodes – represent players and are located in the average position where the player either passed or was passed a ball.
  • Edges (or lines) – represent the passes being made between the two players (nodes) connected by that line.

The thickness of the edges usually represents the frequency of passes and the size of the nodes the number of passes made by a player.

This visual and analytical tool is increasingly used to assess team shape, player involvement, and tactical patterns in modern football. If we, as data scientists, use additional math to compute other metrics, we get even more advanced insights about the team’s passing characteristics.

Here are some of the specific situations where passing networks can come in handy:

  • Understand Team Structure. As it contains the average positions where these players were during these passing events, we can get an idea of the team’s structure and, therefore, how they play. For example, a compact and dense network might indicate that the team prefers short passes and possession-based football, while a more spread-out network could suggest a direct or counter-attacking approach.
  • Identifying Key Players. A key metric in passing networks is centrality, which is used to assess whether the network relies mostly on a small subset of players or not. If we see, for example, Pirlo being key to Italy’s passing network, we might want to put a great defense on him the next time we play against the Italians.
  • Tactical Analysis. Coaches and analysts can spot tactical strengths and weaknesses. For instance, if a team’s passing network shows strong connections down the left side, it might indicate a reliance on attacking through that flank. Also, that can be used to compare different matches and analyze how it changed from one opponent to the other.

Building the network

We’ll use Statsbomb’s free data and access it through statsbombpy‘s[3] Python library. Let’s start by importing the library and retrieving all the events from the Bundesliga 2023/24 season (as we’re using free data only, we’ll receive only info relative to Bayer Leverkusen’s matches):

from collections import Counter

from mplsoccer import Pitch
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from statsbombpy import sb

events_df = sb.competition_events(
    country="Germany",
    division="1. Bundesliga",
    season="2023/2024",
    gender="male"
).sort_values(['match_id', 'minute', 'second'])

We now have a lot of events and columns in this data frame, so we should filter out unneeded rows. Additionally, we’re going to perform a little bit of feature engineering by splitting the location and _pass_endlocation column into two separate columns (one for the x and the other for the y-axis):

passes_df = events_df[
    (events_df['type'] == 'Pass') 
    &amp; (events_df['team'] == 'Bayer Leverkusen')
    &amp; (events_df['pass_outcome'].isna()
)].reset_index(drop=True)[['match_id', 'location', 'pass_end_location', 'player', 'pass_recipient']]

# Define start and end positions
ini_locs_df = pd.DataFrame(passes_df["location"].to_list(), columns=['x', 'y'])
end_locs_df = pd.DataFrame(passes_df["pass_end_location"].to_list(), columns=['end_x', 'end_y'])
passes_df = pd.concat([passes_df, ini_locs_df, end_locs_df], axis=1)

# Reshape and rename columns
passes_df.drop(columns=['location', 'pass_end_location'], inplace=True)
passes_df.columns = ['match_id', 'player_name', 'pass_recipient_name', 'x', 'y', 'end_x', 'end_y']

After keeping only the columns we’re interested in, some decisions should be made now. Bayer Leverkusen, like any other team, doesn’t only have 11 players in their squad. If we want to show a passing network, we only want to show 11 (as if it were a match lineup). But a decision has to be made on the method to exclude players.

A good approach could have probably been to use the most-used team’s formation and choose, for each position, the player who’s played the most minutes in there. But that seemed too complex for this post and I decided to make it simple: use the most-used lineup (and I mean, the most-used 11-player set of players starting games).

Here’s the code that handles this, previously transforming player names to only show surnames:

# Show surname only
passes_df["player_name"] = passes_df["player_name"].apply(lambda x: str(x).split()[-1])
passes_df["pass_recipient_name"] = passes_df["pass_recipient_name"].apply(lambda x: str(x).split()[-1])
passes_df.loc[:, ["player_name", "pass_recipient_name"]] = passes_df.loc[:, ["player_name", "pass_recipient_name"]].replace('García', 'Grimaldo')

# Select most-used lineup and keep those players
used_lineups = []
for match_id in passes_df.match_id.unique():
    match_lineup = sb.lineups(
        match_id=match_id
    )['Bayer Leverkusen']

    match_lineup['starter'] = match_lineup['positions'].apply(
        lambda x: x[0]['start_reason'] == 'Starting XI' if x else False
    )

    match_lineup["player_name"] = match_lineup["player_name"].apply(lambda x: str(x).split()[-1])
    match_lineup.loc[:, "player_name"] = match_lineup.loc[:, "player_name"].replace('García', 'Grimaldo')

    starters = sorted(match_lineup[match_lineup['starter']==True].player_name.tolist())
    used_lineups.append(starters)

most_used_lineup_players = Counter([', '.join(c) for c in used_lineups]).most_common()[0][0].split(", ")

I ended up reducing the DF’s dimensionality:

# Show surname only
passes_df["player_name"] = passes_df["player_name"].apply(lambda x: str(x).split()[-1])
passes_df["pass_recipient_name"] = passes_df["pass_recipient_name"].apply(lambda x: str(x).split()[-1])

# Manually correct Grimaldo's name
passes_df.loc[:, ["player_name", "pass_recipient_name"]] = passes_df.loc[:, ["player_name", "pass_recipient_name"]].replace('García', 'Grimaldo')

passes_df = passes_df[['x', 'y', 'end_x', 'end_y', "player_name", "pass_recipient_name"]]

This is what the DF looks like at this point:

Top 5 rows in the passes_df data frame - image by the author
Top 5 rows in the passes_df data frame – image by the author

We now need to get the number of passes per player and their location (node size and location) and the number of passes between pairs of players (edge thickness). Let’s start with the first one by creating a new DF:

# Create DF with average player positions
nodes_df = pd.concat([
    passes_df[["player_name", 'x', 'y']], 
    passes_df[["pass_recipient_name", 'end_x', 'end_y']].rename(columns={'pass_recipient_name': 'player_name', 'end_x': 'x', 'end_y': 'y'})
]).groupby('player_name').mean().reset_index()

nodes_df = nodes_df[nodes_df['player_name'].isin(most_used_lineup_players)]

# Add number of passes made
nodes_df = nodes_df.merge(long_df.groupby('player_name').agg(passing_participation=('x', 'count')).reset_index())

# Add marker_size column to have it relative to the number of passes made 
nodes_df['marker_size'] = (nodes_df['passing_participation'] / nodes_df['passing_participation'].max() * 1500)

We’re first creating a DF that contains only three columns (player_name, x, y) and then grouping by the player name to compute the average x and y. After that, we filter out those players not in the most common lineup and we merge the result of this operation with the passes_df previously grouped by player containing the total number of passes in which each player participated. This way, we end up having a DF with one row per player, his average pitch location, and the number of passes made.

The last step is used to add the marker size, which will be used later, and this is how it looks like:

Top 5 rows of nodes_df - image by the author
Top 5 rows of nodes_df – image by the author

Let’s keep on with the edges DF. We’ll need to create a new column in _passesdf to illustrate the pair of players from that event. As we don’t care about directionality, we’ll filter out unneeded players and sort those pairs alphabetically:

edges_df = passes_df.copy()
edges_df = edges_df[edges_df['player_name'].isin(most_used_lineup_players)]

edges_df['player_pair'] = edges_df.apply(
    lambda x: "-".join(sorted([x["player_name"], x["pass_recipient_name"]])), 
    axis=1)

This will be used as a key for a groupby next:

edges_df = edges_df.groupby(["player_pair"])
                    .agg(passes_made=('x', 'count'))
                    .reset_index()
filtered_edges_df = edges_df[edges_df['passes_made'] > 238]

To create this _edgesdf, I’m grouping by this new column we just created and counting the number of passes between each pair of players. Then, we filter out those with less than 238 passes because we want to see only those with an average passing rate of 7 passes per game.

This is not accurate because not all players played 34 games so their averages might be higher and still be left out… But we want to keep things simple.

Anyway, let’s see it:

Top 5 rows of player-pairs and passes made - image by the author
Top 5 rows of player-pairs and passes made – image by the author

We’re now ready to plot and we’ll use mplsoccer[4] to display a field to draw upon. The full code taking care of the visualization is shown next:

pitch = Pitch(line_color='grey')
fig, ax = pitch.grid(grid_height=0.9, title_height=0.06, axis=False,
                     endnote_height=0.04, title_space=0, endnote_space=0)
pitch.scatter(nodes_df.x, nodes_df.y, s=nodes_df.marker_size, color='rosybrown', edgecolors='lightcoral', linewidth=1, alpha=1, ax=ax["pitch"], zorder = 3)

for i, row in nodes_df.iterrows():
    pitch.annotate(row.player_name, xy=(row.x, row.y), c='black', va='center', ha='center', weight = "bold", size=16, ax=ax["pitch"], zorder = 4)

all_players = nodes_df["player_name"].tolist()
for i, row in filtered_edges_df.iterrows():
    player1 = row["player_pair"].split('-')[0]
    player2 = row['player_pair'].split('-')[1]

    if player1 not in all_players or player2 not in all_players:
        continue

    player1_x, player1_y = nodes_df.loc[nodes_df["player_name"] == player1, ['x', 'y']].values[0]
    player2_x, player2_y = nodes_df.loc[nodes_df["player_name"] == player2, ['x', 'y']].values[0]

    line_width = (row["passes_made"] / lines_df['passes_made'].max() * 10)

    pitch.lines(player1_x, player1_y, player2_x, player2_y,
                    alpha=1, lw=line_width, zorder=2, color="lightcoral", ax = ax["pitch"])

fig.suptitle("Bayer Leverkusen's Passing Network (2023/24)", fontsize = 25)
plt.show()

And, finally, let’s see the resulting passing network:

Bayer Leverkusen's passing network using the most-used lineup and showing only those connections with mroe than 238 passes - image by the author
Bayer Leverkusen’s passing network using the most-used lineup and showing only those connections with mroe than 238 passes – image by the author

Quite nice, right?

Metrics and Analysis

As data scientists, we can’t stop here. Developing the code and creating the visualization is key, but we actually need to derive some insights from it. Otherwise, it’s useless.

So let’s define some extra metrics such as network centrality and passing rate.

  • The passing rate refers to the number of successful passes per minute of possession.
  • Network centrality has already been explained before but it measures how influential a player is within the team’s passing structure.

Thomas Grund has already done the job for us by inspecting how the passing rate relates to a team’s probability of scoring more goals. In short, he found out that a team with a passing rate of 5 successful passes per minute of possession had 20% more goals than those teams with a ratio of 3.

So it is a good proxy to measure attacking performance (or, at least, goal-scoring probabilities). Let’s compute that for today’s case study:

events_df['total_minutes'] = (events_df['second']/60) + events_df['minute']
events_df['event_duration'] = events_df.groupby('match_id')['total_minutes'].diff().shift(-1)

# Agrupar per equip i sumar
possession_minutes = events_df.groupby(['possession_team'])['event_duration'].sum()['Bayer Leverkusen']
passing_rate = len(passes_df)/possession_minutes

Here’s what the previous snippet does:

  1. Create a new column with the time in minutes (minutes).
  2. Create a new column to find how many minutes go by between event and event (as a way to measure each event’s duration)
  3. Compute the total minutes Bayer Leverkusen had the ball throughout the season.
  4. Compute the passing rate using the number of successful passes done by the team (contained in _passesdf) divided by the minutes with possession.

And the result is: 10.87.

So, if going from 3 to 5 successful passes per minute of possession translated to 20% more goals, imagine if a team has a rate of 10.87. This is how good Leverkusen were.

To put it even in more context, do you remember Spain’s 2012 UEFA Euro team? Busquets, Xavi, Iniesta, Xabi Alonso… Their passing rate in the game against Croatia was 9.65. Leverkusen’s sustained average of 10.87 during the whole season is impressive knowing how managed to have better metrics than the team that is considered by many the best team that’s ever played the game.

Xabi Alonso tried to replicate (and succeeded) what he had done already as a player, now as a coach, and with a less-talented team (still a very good team though).

Moving on to network centrality, this metric is going to be key as well to know if the game relied on a small subset of players or not. There are many ways to compute it such as using the Degree Centrality metric, the Betweenness Centrality, or the Eigenvector Centrality (to name a few).

For simplicity, we’ll stick to the way David Sumpter computes it in Soccermatics[2] but slightly changed: We sum the difference between the max number of successful passes made/received by one player and the number of successful passes made/received by each player, and divide it by the sum of all passes multiplied by the number of nodes in a network minus 1:

#find one who made most passes
max_passing_participations = nodes_df['passing_participation'].max()
#calculate the denominator - 10*the total sum of passes
denominator = 10*(nodes_df['passing_participation'].sum()/2)
#calculate the nominator
nominator = (max_passing_participations - nodes_df['passing_participation']).sum()
#calculate the centralization index
centralization_index = nominator/denominator

Using this approach, we get a centralization index of 0.1988, or 19.88%. The closer to 0, the less centralized it is, so around 20% can still be considered low and that’s correlated with an 8% increase in the probability of scoring.

To add even more context, in the same game mentioned above between Spain and Croatia, the Spanish had a centralization index of 14.6%. So Leverkusen’s metric, aggregated throughout the season (not just one game like the Euro’s), is pretty close to that 14.6%.

Conclusion

Today we learned the core components of a network – nodes and edges – and also what their properties mean (size and width). Additionally, we learned the concept of centrality in a network.

We did all this through a real-case scenario, using Bayer Leverkusen’s 2023/24 season data by leveraging their passing networks and analyzing them mathematically.

To extract some insights, we found their network was that of a team that plays strong possession football, and both metrics we computed were similar to the ones seen in Spain when they faced Croatia in the 2012 Euros.

Thanks for reading the post! 

I really hope you enjoyed it and found it insightful. There's a lot more to 
come, especially more AI-based posts I'm preparing.

Follow me and subscribe to my mail list for more 
content like this one, it helps a lot!

@polmarin

Resources

[1] StatsBomb. (n.d.). Home. StatsBomb. Retrieved September 14, 2024, from https://statsbomb.com

[2] Soccermatics. (n.d.). Soccermatics documentation. Retrieved September 13, 2024, from https://soccermatics.readthedocs.io

[3] StatsBombPy. GitHub. Retrieved September 14, 2024, from https://github.com/statsbomb/statsbombpy

[4] mplsoccer. (n.d.). mplsoccer documentation. Retrieved September 14, 2024, from https://mplsoccer.readthedocs.io/en/latest/index.html

[5] Grund, T. U. (2012). Network structure and team performance: The case of English Premier League soccer teams. Social Networks, 34(4), 682–690. https://doi.org/10.1016/j.socnet.2012.08.004

The post Football and Geometry – Passing Networks appeared first on Towards Data Science.

]]>
Modeling the Extinction of the Catalan Language https://towardsdatascience.com/modeling-the-extinction-of-the-catalan-language-01eda1527f5d/ Fri, 28 Jun 2024 23:45:19 +0000 https://towardsdatascience.com/modeling-the-extinction-of-the-catalan-language-01eda1527f5d/ Applying existing literature to a practical case

The post Modeling the Extinction of the Catalan Language appeared first on Towards Data Science.

]]>
Photo by Brett Jordan on Unsplash
Photo by Brett Jordan on Unsplash

Can we predict the extinction of a language? It doesn’t sound easy, and it indeed shouldn’t, but it shouldn’t stop us from trying to model it.

I was recently interested in this topic and started reviewing some of the existing literature. I came across one article[1] that I enjoyed and thought of sharing.

So, in this post, I’ll be sharing the insights of that paper, translated into (hopefully) a simple read and applied to a practical case so we can see Data Science and mathematical modeling in action.

Introduction

I am Catalan and, for those who don’t know, Catalan is a co-official Language in Catalonia, Valencian Community, and the Balearic Islands (Spain) along with Spanish. It’s also the official language in Andorra, found in the south of France and even in Alghero (Italy).

It’s often that we see on local TV or media that the Catalan language is at risk of extinction. Focusing only on Catalonia, we can easily dig deeper into the case because the government takes care of studying the use of the language through what’s called the "survey of linguistic uses of the population" (Enquesta d’usos lingüistics de la població)[2].

Let’s pick 2018 and analyze the mother tongue ratio per language of the surveyed population (people over 14):

Surveyed population by mother tongue (2018) - Image from GenCat (open source)
Surveyed population by mother tongue (2018) – Image from GenCat (open source)

The first three columns relate to Catalan (blue), Spanish (red), and both Catalan and Spanish (green). While both Spanish and Catalan are official, we see Spanish standing out above Catalan.

That made me think: if it was a competition where only one would survive, Spanish would obviously be the winner. But, how much would it take for Catalan to disappear?

This is what we’ll check now.

Disclaimer: I’m not advocating the extinction of Catalan, quite the opposite, in fact. I am concerned about the reduction of its use and I love to use it. This post is also aimed to share the existence of this beautiful language and promote its use and learning.

Let’s talk math

For Catalan to disappear, all the Catalan speakers would have to stop using it (duh). In other words, we’d need Catalan speakers to transition to another language (let’s assume Spanish).

We’ll define Pyx(x,s) as the probability of an individual converting from language Y (Catalan) to X (Spanish), where x is the ratio of X – Spanish – speakers and s is a measure of X’s relative status (between 0 and 1). If we want to model the language change, here’s the proposed equation:

Language change model - Image by the author
Language change model – Image by the author

What we do is multiply the ratio of Catalan speakers times the probability of a Catalan speaker transitioning to Spanish and subtract the result of multiplying the ratio of Spanish speakers and the probability of a Spanish speaker transitioning to Catalan.

To make Math simple, we’ll assume the competition is between these two languages only. Then, the ratio of Catalan speakers is y = 1-x. Due to this symmetry, we can define the following: Pxy(x,s) = Pyx(1-x, 1-s).

Also, we can assume a language without speakers (x=0) or a null language status (s=0) makes the probability of transitioning equal to 0. We can redefine the equation as such:

Language change model based on Pyx - Image by the author
Language change model based on Pyx – Image by the author

Ok, cool. But what does the transition probability function looks like? It could take many shapes and forms, but the ones the authors picked were

Transition functions - Image by the author
Transition functions – Image by the author

Let’s dissect this function, focusing on the Pyx(x,s) representation. First of all, c is a constant scaling factor and a is an exponent that modifies the influence of the proportion of x (Spanish) speakers.

The value of a will determine the function’s behavior:

  • If a=1, it will be linear.
  • If a > 1, then it’ll be convex (meaning the transition probability increases more rapidly as x increases)
  • If 0 < a < 1, then the function follows a concave behavior (meaning that the transition probability increases more slowly as x increases).
  • If a=0, the function doesn’t depend on x (very unlikely).

This function was chosen because it has the true potential of being able to model language shift where the likelihood of individuals adopting a new language (and leaving another behind) depends on how prevalent that language already is and the status of that language. So, the higher the status, the higher the probability of people changing to language X.

Furthermore, if we wanted we could apply this type of function to model market transitions in economics, for example, where x could represent the market share and s could be an economic incentive, for example, the amount of advertising for a product.

Going back to our case, the problem now resides in computing parameters c, s, a, and x(0). Well, it is not our problem, it was the researchers’… Until they acquired the data to effectively estimate those parameters through least absolute-values regression[3].

Data on the number of speakers of endangered languages in 42 regions in Peru, Scotland, Wales, Bolivia, Ireland, and Alsace-Lorraine was collected and used to fit the model.

What they found out, unexpectedly, is that the exponent a was found to be roughly constant across cultures, with a = 1.31 ± 0.25.

Obviously, the most significant parameter from all these 4 estimated is the status (s), as it could serve as a useful measure of the threat to a given language.

For example, they found that Quechua still had many speakers in Huanuco, Peru, but its low status was driving a rapid shift to Spanish.

Unfortunately, we don’t have enough data available at the moment to use mathematically and get a clear function for our Catalan case. We can, however, try to mimic it using the work Daniel M. Abrams and Steven H. Strogatz already shared.

So, if we assume:

  • a = 1.31 – the mean they discovered
  • x(0)=0.466— The actual ratio of Spanish speakers (focusing on mother-tongue only, because in terms of speakers roughly 99% of the people talk it)
  • s= 0.40— The estimated status for a language like Catalan (supposing the Catalan status in Catalonia is similar to the Welsh’s in Wales, or the Scottish Gaelic’s in Scotland).
  • c = 1 – unit constant (for simplicity).

And we work the maths out for Equation 1, using the probability function, we get:

Final equation - Image by the author
Final equation – Image by the author

This is almost the final form for our language transitioning model… But we should solve this differential equation and add a time dimension to it. To do so, we’ll resort to Python.

Time to Code

We’ll only be using three modules: numpy, scipy and matplotlib. So make sure you install them if you want to use this code. For now, let’s just import them:

import numpy as np
from scipy.integrate import solve_ivp
import matplotlib.pyplot as plt

The next step is defining the differential equation that will take care of computing, over time, this population shift from Catalan to Spanish:

def language_change_model(t, x, c, a, s):
    return c * (x**a * s - x**(a+1) * s + x * (1-x)**a * (1-s))

We’ll also define our parameters, just as we did before, plus add the time spans:

c = 1.0  # unit constant
a = 1.31  # mean value
s = 0.4  # estimated Catalan's status
x0 = 0.466  # ratio of X (Spanish) mother-tonguers in Catalonia as of now

t_span = (1, 10) # 10-period window
t_eval = np.linspace(t_span[0], t_span[1], 100)

Very simple until now. But it’s time to do the math and we’ll use the solve_ivp function from the scipy package[4], which numerically integrates a system of ordinary differential equations given an initial value.

We have to choose our integration method and we’ll use the one by default: Explicit Runge-Kutta method of order 5[5].

sol = solve_ivp(language_change_model, t_span, [x0], args=(c, a, s), t_eval=t_eval, method='RK45')

And the last step is to plot our solution:

# Plot the solution
plt.plot(sol.t, sol.y[0], label=f'c={c}, a={a}, s={s}, x0={x0}')
plt.xlabel('Time')
plt.ylabel('Proportion of people speaking Spanish')
plt.legend()
plt.title('Language Change Model Over Time')
plt.show()

Just in case you want to try it out yourself, with different parameters, time frames, or methods, here’s the full code:

import numpy as np
from scipy.integrate import solve_ivp
import matplotlib.pyplot as plt
import time

# Define the differential equation
def language_change_model(t, x, c, a, s):
    return c * (x**a * s - x**(a+1) * s + x * (1-x)**a * (1-s))

# Parameters
c = 1.0  # example value
a = 1.31  # example value
s = 0.4  # example value
x0 = 0.466  # initial condition

# Time span for the solution
t_span = (1, 10) # 10-period window
t_eval = np.linspace(t_span[0], t_span[1], 100)

# Solve the differential equation
sol = solve_ivp(language_change_model, t_span, [x0], args=(c, a, s), t_eval=t_eval, method='RK45')

# Plot the solution
plt.plot(sol.t, sol.y[0], label=f'c={c}, a={a}, s={s}, x0={x0}')
plt.xlabel('Time')
plt.ylabel('Proportion of people speaking Spanish')
plt.legend()
plt.title('Language Change Model Over Time')
plt.show()

Want to see the result?

Catalan Population using Spanish as mother tongue over time - Image by the author
Catalan Population using Spanish as mother tongue over time – Image by the author

The time axis is shown as integers, not years, because mother tongues are established at birth or during infancy and they change from generation to generation, not within a person’s lifespan.

Conclusions

If we assume the data is right, then we can assume that the Spanish language will take over all Catalan speakers about 10 generations from now – maybe even less.

But let’s not become desperate, my Catalan fellows. This model isn’t stable.

Our model’s flaws

This model has several flaws, that’s why I didn’t spend much time in the conclusions phase.

To start, one of the basic premises is that one language competes with another. While this competition premise might make sense somewhere, Catalans are bilingual: two languages coexist.

Yes, there will almost always be only one mother tongue – as we saw, a very few percentage of Catalans have both Spanish and Catalan as their initial language – but that doesn’t mean the other language will disappear.

This is happening all over the world, in fact. English has made it to almost every country on Earth but it isn’t replacing local languages… Just becoming an extra asset for their population.

Another problem with this model is that’s based on research performed over 20 years ago. Yes, the math is correct but there might be more advanced and modern literature to go through and hopefully build a more stable version.

Also, we used estimated data: we don’t know the real a, s, and c parameters.

But don’t let these facts take the relevancy of the topic or results here. Catalan, like many other minority languages, is becoming less and less spoken as time goes by. And maybe not in 8 generations as we predicted, but they can at some point disappear.

Unless we do something… And I also involve politicians and governments, who should design strategies to boost its usage, increase its number of speakers, and overcome the danger of being extinguished.

We should all aim to use as many languages as possible, not the opposite.

Thanks for reading the post! 

I really hope you enjoyed it and found it insightful. There's a lot more to 
come, especially more machine learning-based posts I'm preparing.

Follow me and subscribe to my mail list for more 
content like this one, it helps a lot!

@polmarin

Resources

[1] Abrams, D., Strogatz, S. Modelling the dynamics of language death. Nature 424, 900 (2003). https://doi.org/10.1038/424900a

[2] Enquesta d’usos lingüístics de la població de Catalunya. GenCat. https://llengua.gencat.cat/ca/serveis/dades_i_estudis/poblacio/Enquesta-EULP/

[3] Wikipedia contributors. (2023, November 29). Least absolute deviations. In Wikipedia, The Free Encyclopedia. Retrieved 07:05, June 28, 2024, from https://en.wikipedia.org/w/index.php?title=Least_absolute_deviations&oldid=1187408818

[4] solve_ivp – SciPy. solve_ivp – SciPy v1.14.0 Manual

[5] J. R. Dormand, P. J. Prince, "A family of embedded Runge-Kutta formulae", Journal of Computational and Applied Mathematics, Vol. 6, №1, pp. 19–26, 1980.

The post Modeling the Extinction of the Catalan Language appeared first on Towards Data Science.

]]>
Using Clustering Algorithms for Player Recruitment https://towardsdatascience.com/using-clustering-algorithms-for-player-recruitment-98208d3a6cb4/ Mon, 15 Apr 2024 17:37:06 +0000 https://towardsdatascience.com/using-clustering-algorithms-for-player-recruitment-98208d3a6cb4/ Which players could help Fulham overcome their major flaws?

The post Using Clustering Algorithms for Player Recruitment appeared first on Towards Data Science.

]]>
Sports Analytics
Photo by Mario Klassen on Unsplash
Photo by Mario Klassen on Unsplash

Some days ago, I was fortunate to be able to participate in a football analytics hackathon that was organized by xfb Analytics[1], Transfermarkt[2], and Football Forum Hungary[3].

As we recently received permissions to share our work, I decided to write a blog post about the approach I used.

The goal was to pick a Premier League team, analyze their playing style, highlight two flaws, and prepare two lists of 5 players each that could help the team improve. The premise was that we had to look to fill two different positions (hence the "two lists of 5 players each").

Then, from those two lists, we had to pick the top target in each and further explain why they were the best fit for their respective positions.

The final result had to be realistic and the sum of both players’ prices had to be below 60M (we were given their Transfermarkt valuations).

Now that you know what it was about, I want to talk about my approach. I’m a data science guy who loves football so I had to perform some sort of technical analysis or modeling with Python.

Here’s how I’ll structure this post:

  • Introductory Analysis
  • Player Clustering
  • Picking the Defensive Midfielder
  • Picking the Striker
  • Conclusions

Take into account that, as said, this was a hackathon. The time to do it was limited and the resources were quite scarce. With proper time and enough data, the results would have been even better.

Introductory Analysis

When it comes to player recruitment, data is probably our best friend. It doesn’t guarantee anything in the future, but it allows us to understand the past and present of a player in a purely objective manner: his playing style, his profile, advantages and disadvantages…

For that reason, I wanted this project to be 90% based on data, and let common sense reign over the remaining 10%.

The team that I chose was Fulham. Why Fulham? First, I didn’t want any top team (i.e. Liverpool, Man City, Arsenal…). Among the remaining 15 teams, I didn’t care and I knew very little about them. All had to have their flaws and strengths, so all were possible options.

I did what someone in my position would do: let another person choose for me. It was my girlfriend and she picked Fulham because they were sitting in the 12th position, our number.

Having never watched Fulham, I started to dig in. I was surprised to see that I actually knew some of their players: Adama Traoré, Willian, Raúl Jiménez, Bernd Leno, and João Palinha. Not bad, right?

After reading some posts and watching some videos, here are some takeaways from their playing style:

  • They play in a 4–2–3–1 formation that turns into a 4–4–2 (or 4–4–1–1) on defense.
  • They like to play in a possession-based style, building from the back.
  • A lot of their attacking play is focused on wide play and getting crosses into the box. On counterattacks, they also try to move the ball towards the flanks.
  • You rarely see Fulham pressing high. They tend to let the opponent come out before facing their mid-block.

Okay, as in any data science project, getting to know the context of the data is crucial. At this point, I feel like we do. It’s now time to find the problems we need to solve (i.e. discover two of their flaws).

Fulham has had major struggles this season in two different areas:

  • Build-up: They’ve lost a lot of balls this season trying to build from the back. The way other teams press them forces them to mostly rely on Palinha with the other defensive midfielder (DM) not being an asset for this possession-based football.
  • Attack: Last season Mitrović was their key player. A traditional striker with lots of goals and quality. He left the past summer and the team hasn’t been able to find a proper replacement. Muniz is filling that position but he’s only scored 7 goals this season, quite the same number as Jiménez, Iwobi, Bobby Reid, or Willian.

What about the team’s transfer history? We could end this analysis by proposing Haaland and Rodri for the striker (ST) and defensive midfielder (DM) positions but remember we left a 10% for common sense? This is where we apply it.

We want to find players that fit within the team’s budget, philosophy, and status. Seeing Haaland in a Fulham jersey is almost impossible. So, we have to familiarize ourselves with the team’s past expenses:

Player expenses by season for Fulham, as per Transfermarkt[2] - Image by the author
Player expenses by season for Fulham, as per Transfermarkt[2] – Image by the author

We see a huge shift in the summer of 2022. The team can now afford to spend around 70M and has invested in more expensive players. Therefore, a 60M budget fits well within the team’s position, and spending 30M or 40M on a single player would not be disproportionate.

Player Clustering

Let machine learning enter the conversation.

My goal was to create clusters of players with similar characteristics and find the best possible targets from those clusters that had what we needed.

So, with already cleaned data, I scaled it:

num_vars = ['minutes_played', 'goals_cnt', 'shots_total_cnt',
       'shots_on_target_cnt', 'off_deff_aerial_duels_won_cnt',
       'fouls_suffered_cnt', 'offsides_cnt', 'tackles_total_cnt',
       'interceptions_total_cnt', 'clearances_total_cnt',
       'fouls_committed_cnt', 'passes_total_cnt', 'passes_accurate_cnt',
       'long_passes_total_cnt', 'long_passes_accurate_cnt']

onehot_vars = [
    'AM', 'D', 'DM', 'FW', 'GK', 'M', 'Sub'
]
num_x = data[num_vars].to_numpy()
onehot_x = data[onehot_vars].to_numpy()

# Scale numerical vars
scaler = StandardScaler()
scaled_x = scaler.fit_transform(num_x)

# Concatenate scaled_x and onehot_x
x = np.concatenate((scaled_x, onehot_x), axis=1)

Here’s where the lack of time and resources might harm the most: I used a set of numerical variables that were available. I’d like to have used more and better features.

However, what I did was pretty simple: separated those numerical variables from the one-hot encodings and scaled the first group using scikit-learn’s StandardScaler[4]. Then, I created the x array by concatenating the scaled numerical features and the one-hot encoding ones.

Those that I one-hot encoded were simply player-position data. I simplified them to have only goalkeepers (GK), defenders (D), defensive midfielders (DM), central midfielders (M), attacking midfielders (AM), forwards (FW), and Subs (Sub).

Once prepared, it was now time to create the clusters. I didn’t have any labels so I had to go with an unsupervised learning algorithm. The selected one was K-Means.

For those who don’t know, K-Means is a method that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (that’s called the centroid).

We know n is the number of players we have in our x array (2306) but what about the number of clusters (k)? The most used approach to determine the optimal value of k is called the Elbow method, which is a graphical method that resembles an elbow and it’s at that elbow point that we want to get our k from.

"Silhouette coefficients" would have been another option to use, but I kept it simple. Here’s how I plotted it:

sse = []
clusters = 50
for k in range(1, clusters):
    kmeans = KMeans(
        init="random",
        n_clusters=k,
        n_init=10,
        max_iter=300,
        random_state=42
    )
    kmeans.fit(x)
    sse.append(kmeans.inertia_)

ax = sns.lineplot(x=range(1, clusters), y=sse)
ax.set_xlabel("Number of Clusters")
ax.set_ylabel("SSE")
Elbow Plot - Image by the author
Elbow Plot – Image by the author

The optimal value isn’t really clear at first sight… It could have been anywhere from around 8 until 15. But, another way to understand that optimal point is as "the point in which the value begins to decrease almost linearly". So 15 made sense.

# I'll use 15 clusters
k = 15
kmeans = KMeans(
    init="random",
    n_clusters=k,
    random_state=42
)
kmeans.fit(x)
data['cluster'] = kmeans.predict(x)

To analyze those clusters, I’ll plot the parallel coordinates of the cluster centroids:

centroids_df = pd.DataFrame(kmeans.cluster_centers_, columns=num_vars+onehot_vars)
centroids_df['cluster'] = centroids_df.index

fig, ax = plt.subplots(figsize=(12,6))
pd.plotting.parallel_coordinates(centroids_df, 'cluster', ax=ax, colormap='tab20')
fig.autofmt_xdate()
Parallel coordinates for all cluster centroids - Image by the author
Parallel coordinates for all cluster centroids – Image by the author

This plot alone tells us a lot about each cluster. For example:

  • Cluster 1 has players who score and shoot a lot. They also produce a lot of offsides… So most of these are probably strikers.
  • Cluster 14 has players who play a lot of minutes but "underperform" in all stats except for long passes, in which they are way above the rest. These are probably goalkeepers.

Now that we have our clusters, the goal is to find those in which we want to look for our players. And the approach I used is fairly simple: I just looked where some of the top STs and top DMs were.

  • For strikers, I looked for Lewanwoski, Haaland, and Kane among others. They mostly belonged to cluster 1 (as we had already guessed).
  • For defensive midfielders, I looked for Rodri, Kimmich, and Zubimendi (among others). They mostly belonged to cluster 5.

So all that was left was to find players within these two clusters that, combined, had a market value below 60M, and their personal value wasn’t above 40M. Luckily for me, my dataset had also an "Average rating" column and I used that to sort those player combinations and pick the best ones.

Considering these filters and removing older players, here are the 5 best DMs based on the cluster:

  • João Neves (Benfica)
  • Johnny Cardoso (Betis)
  • Pierre Less-Melou (Brest)
  • Anton Stach (Hoffenheim)
  • Stephen Eustaquio (Porto)

While João Neves’ level is probably beyond the others, it’s not a realistic target for Fulham. Neves is likely going to be signed by a top European club this summer and our team cannot compete against Man United, for example.

So, after applying that 10% of common sense and briefly analyzing their player profiles, the chosen one was Anton Stach. His playing style is extremely balanced, with most stats way above his peers in the same position, and with amazing possession characteristics (which is what we’re looking for).

He even has an added value compared to Palinha, and it’s his final third involvement:

Anton Stach's final third vs box involvement - Image from Cube[5] with permission
Anton Stach’s final third vs box involvement – Image from Cube[5] with permission

As for the striker, here are our preferred options after searching in the desired cluster:

  • Serhou Guirassy (Stuttgart)
  • Ivan Toney (Brentford)
  • Victor Boniface (Leverkusen)
  • Jonas Wind (Wolfsburg)
  • Evanilson (Porto)

Here, the decision wasn’t as easy. Boniface would probably have been the best target but, again, he might be a top-team target this summer (FYI he’s being the scoring leader of the best team in the Bundesliga).

So the chosen striker was Serhou Guirassy, from Stuttgart. He’s a very complete striker not only with good attacking characteristics but off-the-charts possession stats compared to other strikers.

What’s even more impressive is the amount of shots he produces and the quality of those, making him a player who can score from almost anywhere (and has done so). The two plots below visually explain what I just said:

Shooting plots for Serhou Guirassy - Images from Cube[5] with permission
Shooting plots for Serhou Guirassy – Images from Cube[5] with permission

Conclusion

In this post, I shared the project I built for the Football Analytics Hackathon organized by xfb Analytics, Transfermarkt, and Football Forum Hungary.

I leveraged my data science skills to create a very simple K-Means model to create player clusters, find those with world-class players who play in the positions I needed to fill, and get other, more realistic targets from those clusters for Fulham.

We used the K-means algorithm without a predetermined number of clusters and we used the Elbow method to determine K. With 15 clusters, we analyzed each of those and found the one that had players with top-class numbers, and searched there for our potential signings.

I ended up suggesting Anton Stach (15M) for the defensive midfielder position to play alongside Palinha, to be able to build from the back, prioritizing possession. As a striker, I went for Serhou Guirassy (40M) who has been having an amazing season and who I believe could be a great alternative after losing Mitrović.

While this approach might seem extremely niche, it’s not. K-means can be used in several situations outside of football or sports, when trying to understand an audience, your clients or animal species. This unsupervised algorithm can be a powerful tool for data professionals for its simplicity and potential (there are more complex and better approaches too).

Thanks for reading the post! 

I really hope you enjoyed it and found it insightful. There's a lot more to 
come, especially more AI-based posts I'm preparing.

Follow me and subscribe to my mail list for more 
content like this one, it helps a lot!

@polmarin

Resources

[1] xfb Analytics

[2] Football transfers, rumours, market values and news | Transfermarkt

[3] Football Forum Hungary 2024

[4] sklearn.preprocessing.StandardScaler

[5] Cube – Football, In depth

The post Using Clustering Algorithms for Player Recruitment appeared first on Towards Data Science.

]]>
Markov Chain Monte Carlo: Made Simple Once and For All https://towardsdatascience.com/markov-chain-monte-carlo-made-simple-once-and-for-all-e86e8384186c/ Fri, 01 Mar 2024 04:55:27 +0000 https://towardsdatascience.com/markov-chain-monte-carlo-made-simple-once-and-for-all-e86e8384186c/ Introduction to MCMC, dividing it into its simplest terms

The post Markov Chain Monte Carlo: Made Simple Once and For All appeared first on Towards Data Science.

]]>
Data Science

Markov Chain Monte Carlo: Made Simple Once and For All

Photo by Karine Avetisyan on Unsplash
Photo by Karine Avetisyan on Unsplash

I recently posted an article where I used Bayesian Inference and Markov chain Monte Carlo (MCMC) to predict the CL round of 16 winners. There, I tried to explain Bayesian Statistics in relative depth but I didn’t tell much about MCMC to avoid making it excessively large. The post:

Using Bayesian Modeling to Predict The Champions League

So I decided to dedicate a full post to introduce Markov Chain Monte Carlo methods for anyone interested in learning how they work mathematically and when they prove to be useful.

To tackle this post, I’ll adopt the divide-and-conquer strategy: divide the term into its simplest terms and explain them individually to then solve the big picture. So this is what we’ll go through:

  • Monte Carlo methods
  • Stochastic processes
  • Markov Chain
  • MCMC

Monte Carlo Methods

A Monte Carlo method or simulation is a type of computational algorithm that consists of using sampling numbers repeatedly to obtain numerical results in the form of the likelihood of a range of results of occurring.

In other words, a Monte Carlo simulation is used to estimate or approximate the possible outcomes or distribution of an uncertain event.

A simple example to illustrate this is by rolling two dice and adding their values. We could easily compute the probability of each outcome but we could also use Monte Carlo methods to simulate 5,000 dice-rollings (or more) and get the underlying distribution.

Stochastic Processes

Wikipedia’s definition is "A stochastic or random process can be defined as a collection of random variables that is indexed by some mathematical set"[1].

In more readable terms: "it’s any mathematical process that can be modeled with a family of random variables".[2]

Let’s use a simple example to understand the concept. Imagine you put a video camera on your favorite store to, once every 2 minutes, check how many visitors there are. We define X(0) as the initial state and it shows the number of visitors seen at t=0. Then, 2 minutes later, we see X(1) and so on.

The state space is a set of values that our random variables (X(i)) can adopt, and they can get from 1 to the maximum store capacity.

One of the properties of a stochastic process is that whatever happens in a specific moment is conditioned to what has happened in the preceding moments. Keeping up with our example, if we have 100 visitors at t=0, the probability of having 100 ± 20 at t=1 is greater than seeing it drop to 10, for example (if no unexpected event happens). Therefore, these X variables aren’t independent.

Markov Chains

A Markov chain is a sequence of numbers where each number is dependent on the previous value of the sequence.

So it’s a stochastic method with one peculiarity: knowing the current state is as good as knowing the entire history. In mathematical terms, we say that a stochastic process is Markovian if X(t+1) conditioned to x(1), x(2),…x(t) only depends on x(t):

Mathematical expression - Image by the author
Mathematical expression – Image by the author

Keeping up with our example, for it to be considered a Markov chain we would need the number of visitors in a given time – t – to only depend on the number of visitors we saw in the previous instant – t-1. That’s not true in real life but imagine it is, then we define the transition probability as the probability of going from state i to state j in a specific instant:

Transition probability in a Markov chain - Image by the author
Transition probability in a Markov chain – Image by the author

And, if that probability is time-independent, we say it’s stationary.

With this transition probability, we now define the transition matrix, which is just a matrix with all transition probabilities:

Markovian transition matrix - Image by the author
Markovian transition matrix – Image by the author

This matrix comes in handy when we want to compute the probabilities of transitioning from one state to another in n steps, which is achieved mathematically with power operations on the matrix:

Power operation on the matrix to get transition probability after n steps - Image by the author
Power operation on the matrix to get transition probability after n steps – Image by the author

Let’s define now a new – and dumb – example, in which we consider that a striker’s probability of scoring a goal in a football (soccer) match depends only on whether he/she scored in the previous game or not. Because we suppose it’s also time-independent – when the match is played doesn’t matter – we are working with stationary transition probabilities.

Concretely, if a player scored in the previous match, we assume the probability of scoring again in the next game is 70% (the player is hypermotivated to keep the streak going). If the player doesn’t score, this probability drops to 40%.

Let’s put that into the transition matrix:

Transition matrix for our example - Image by the author
Transition matrix for our example – Image by the author

The proper way to read it is: we have two possible outcomes (goal or no goal). Row 1 defines the next game probabilities for the case in which the player has scored; row 2 does the same but for the case in which he/she hasn’t scored. Columns are read similarly: the first one relates to the probabilities of scoring and the second to the probabilities of not scoring.

So, for example, 0.7 is the probability of scoring after having scored in the previous game.

Now, what are the chances that a certain player scores in the n = 2 game knowing that he hasn’t scored today?

Transition matrix for n=2 - Image by the author
Transition matrix for n=2 – Image by the author

If the player hasn’t scored today, we have to focus on the second row. As we’re interested in the chances of scoring, we focus on the first column. And where these both intersect we have 0.52–the probability of scoring in the game ahead of the next one is 52%.

We could want to work out the marginal distributions for each X(t) and we can do it by using the initial conditions in which the chain initialized: X(0).

Keeping up with the example, the question would now be: knowing that the player has a 50–50% chance of scoring in the first game, what are the chances that he/she scores then and in the second game ahead of the first?

Transition matrix with marginal distributions - Image by the author
Transition matrix with marginal distributions – Image by the author

The answer is 0.565, or 56.5%.

What’s curious about Markov Chains is that, independently of which values we choose for p0, we might end up with the same distribution after a certain number of iterations. That’s called a _stationary distribution, a_nd this is key for MCMC.

Markov Chain Monte Carlo (MCMC)

Now it’s time to combine both methods together.

MCMC methods constitute Monte Carlo simulations where the samples are drawn from random Markov chain sequences to form a probability distribution. In the case of Bayesian modeling, this stationary distribution will be the posterior distribution.

Simulating the chain after a given set of steps (what’s called the burn-in phase) we’ll get us to the desired distribution. These simulations are dependent on each other but, if we discard a few after certain iterations, we make sure these simulations are almost independent (thinning).

MCMC comes in handy when we want to perform inference for probability distributions where independent samples from the distribution cannot be easily drawn.

Regarding the different MCMC algorithms that exist, we’ll focus on the two more common ones:

  • Gibbs Sampling: this algorithm for sampling samples from the conditional distributions. Here, we sample our variables based on the distribution conditional to the other variables and iteratively repeat this process. For example, in a case where we have 3 variables, we would simulate the first one by sampling, for each t in 1…N iterations:
Variable update using conditional distribution - Image by the author
Variable update using conditional distribution – Image by the author
  • Metropolis-Hastings: is usually the alternative to Gibbs when simulating the complete conditionals isn’t possible (i.e. when we cannot sample a variable conditioned to all the other ones). This works by proposing a candidate for the next step in the Markov chain – x(cand) – by sampling from a simple distribution – q – built around x(t-1). Then we choose to accept the candidate or not with a determined probability (if it’s not accepted, then the chain doesn’t change). This probability is defined by:
Acceptance probability in Metropolis-Hastings - Image by the author
Acceptance probability in Metropolis-Hastings – Image by the author

Conclusion

In short, MCMC methods consist of drawing random samples conditioned to the previous value/step only and potentially deciding whether we keep them or not. And repeat multiple times until we form the chains.

To schematize, let’s define the algorithm in a set of steps:

  1. Get/Assign the initial values.
  2. For each iteration: a) Sample the candidates from a distribution that only depends on the previous value (Markov Chain). b) If we’re using the Metropolis-Hastings algorithm, decide whether we accept or reject the candidates by computing and using the acceptance probability. c) Update/Store the new values.

Thanks for reading the post! 

I really hope you enjoyed it and found it insightful. There's a lot more to 
come, especially more AI-based posts I'm preparing.

Follow me and subscribe to my mail list for more 
content like this one, it helps a lot!

@polmarin

Resources

[1] Wikipedia contributors. (2024, January 8). Stochastic process. In Wikipedia, The Free Encyclopedia. Retrieved 19:43, February 24, 2024, from https://en.wikipedia.org/w/index.php?title=Stochastic_process&oldid=1194369849

[2] Christopher Kazakis. (2021, January 8th). See the Future with Stochastic Processes. Retrieved 19:43, February 24, 2024, from https://towardsdatascience.com/stochastic-processes-a-beginners-guide-3f42fa9941b5.

The post Markov Chain Monte Carlo: Made Simple Once and For All appeared first on Towards Data Science.

]]>
Using Bayesian Modeling to Predict The Champions League https://towardsdatascience.com/using-bayesian-modeling-to-predict-the-champions-league-8ebb069006ba/ Tue, 20 Feb 2024 15:11:32 +0000 https://towardsdatascience.com/using-bayesian-modeling-to-predict-the-champions-league-8ebb069006ba/ Bayesian Inference Applied to Real-World Examples

The post Using Bayesian Modeling to Predict The Champions League appeared first on Towards Data Science.

]]>
Sports Analytics
Photo by Anders Krøgh Jørgensen on Unsplash
Photo by Anders Krøgh Jørgensen on Unsplash

Oh, the Champions League. Possibly the competition that attracts the most fans regardless of the team they support. It’s the best against the best. The show is almost guaranteed… And the outcome is almost impossible to predict.

But that shouldn’t stop us from trying.

I was the other day going through old college assessments and found one that inspired me to write this post, where we’ll use Bayesian Inference to create a model that tries to predict the next Champions League matches: the first leg in the Round of 16 (well, it can be used in any match from any round, to be honest).

The aim is to build a model through Bayesian Statistics applied to a real-case scenario that I find to be interesting and entertaining.

Whether you know about Bayesian Inference or not, this post is for you. Even if you already knew about everything I’m about to share, the final predictions will at least serve you to either congratulate or laugh at me after the first leg is played.

Here’s what we’ll go through today:

  • Bayesian Inference & Modeling
  • The Data
  • Building the Model
  • Validating the Model
  • Predictions
  • Conclusion, Improvements, and Future Steps

The full code is on my GitHub page which you’ll find in the resources section[1].

Bayesian Inference & Modeling

If you’re new to this, you’ll need a proper intro with the basic math fundamentals, which is what’s coming next. If you want to go straight to the point, feel free to skip this section.

As per Wikipedia, "Bayesian Inference is a method of statistical inference in which Bayes’ theorem is used to update the probability for a hypothesis as more evidence or information becomes available."[2]

That’s the formal definition, but there’s a lot of information here. I think it’s better if we dissect this sentence into different parts:

  • "Bayesian Inference is a method of statistical inference" – Statistical inference is the process of using data analysis and statistics to infer properties of a population, by hypothesis testing and deriving estimates.
  • "in which Bayes’ theorem is used to update the probability for a hypothesis" – Bayes’ theorem[3] describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
  • "as more evidence or information becomes available" – Different from frequentist statistics, the data (here, evidence or information) is not the only factor we take into account, it is only used to update our prior beliefs.

To summarize: Bayesian Inference is the process of inferring properties from a certain population based on Bayes’ theorem by using the prior knowledge we possess (as a distribution) and the data to get the final distribution.

Lots of concepts, and many more still missing, but I won’t go in-depth today. However, we can’t ignore the Bayes’ theorem formula, as it’s the core of the whole process:

Bayes' theorem - Image generated by the author
Bayes’ theorem – Image generated by the author

The formula itself is really simple and chances are you’ve already seen it before. If you haven’t, just take into account that P(A|B) is read as the probability of the event "A" happening given the event "B".

For example, we could read P(rain|clouds) as the probability of rain given that there are clouds in the sky.

Keeping up with the example, where A is "rain" and B is "clouds", the formula would be:

Bayes' theorem using rain/cloud examples - Image by the author
Bayes’ theorem using rain/cloud examples – Image by the author

Using this formula we could compute the probability of experiencing rain given that the sky is cloudy. For that, we’d have to know the probability of having a cloudy sky when it’s raining (P(clouds|rain)), the probability of rain at that given moment (P(rain)), and the probability of seeing clouds in the sky (P(clouds)).

But this equation can also be expressed in terms of posterior, likelihood, prior and evidence:

  • Posterior: P(rain | clouds) – is what we’re trying to infer.
  • Likelihood: P(clouds | rain) – is the probability of observing the data that has been observed (clouds in the sky) assuming that the data came from a specific scenario (rain).
  • Prior: P(rain) – The previous information we have about the event we’re interested in.
  • Evidence: P(clouds) – It’s the data itself.
Bayes' theorem in other terms - Image by the author
Bayes’ theorem in other terms – Image by the author

The evidence itself is only used to normalize the density and make it a valid probability, but it’s just a constant. For that reason, it’s sometimes omitted because most of the time we’ll be interested in the posterior distribution only, not the density probability. And the final formula we’ll see is

Bayes' theorem in proportional terms - image by the author
Bayes’ theorem in proportional terms – image by the author

meaning that the posterior distribution is proportional to the likelihood times the prior.

Modeling, on its own, refers to the process of finding the statistical model that governs the behavior of any real process.

The Data

We’ve managed to survive the math, now it’s time to dig into the practical case and data. We’re trying to predict the winner of a given match so we’ll define θ as our hypothesis. We know it’s a number and it can either be 1 (the team won) or 0 (the team didn’t win).

So, we can assume that θ follows a Bernoulli distribution.[4] We don’t know the "p" parameter, and that’s what we’ll try to infer using Bayes’ theorem.

Note that θ isn’t linear. It’s, in fact, from the exponential family and, when this happens, we say we are dealing with a generalized linear model (GLM). In our case, where our variable follows a Bernoulli distribution, we’ll present the logistic regression.

Logistic regression is a model in which a certain variable can only be either 0 or 1, meaning success or failure where "p" is the success probability. So

Our variable's distribution - Image by the author
Our variable’s distribution – Image by the author

where i = 1,…,n are all the matches we have times 2.

As the distribution isn’t linear, we need a link function that provides the relationship between the linear predictor and the mean of the distribution function. It depends on the distribution we’re dealing with but the link function for our case is the logit:

Link function for our Bernoulli variable - Image by the author
Link function for our Bernoulli variable – Image by the author

We can now provide a linear shape to this logit function:

Modelized p - Image by the author
Modelized p – Image by the author

And these "x" are the data we’re going to use. So Bernoulli’s variable has now become a sort of linear one, which is amazing.

Our model today will be based on the premise of simplicity. The data we’ll be using will only consist of three parameters: the number of goals the team scores, whether they play at home or not, and the average player ratings.

Keeping up with the simplicity, we’ll only be using the team’s data to predict its chances of winning, without caring about the rival. You’ll see why in the next section.

The catch is that we’re only going to use data from the current Champions League campaign (2023–24) – from the group phase. Obtaining open-source football data can be tricky nowadays so we’re going to use only a very limited set of matches to build our model, and we’ll compensate for that with Markov Chain Monte Carlo (MCMC) approximations.

Talking about what MCMC is, is outside of this post’s scope. But I’ll put a link in the resources section to an article that explains it incredibly well, named "A Gentle Introduction to Markov Chain Monte Carlo for Probability"[5].

Building The Model

Enough intro and theoretical comments. Let’s go straight to the code.

I had two options: Python or R. I chose the latter because I especially like the rjags library for the task we’re about to go through. However, you could easily do the same with Python either from scratch or using any other library you’d like.

JAGS[6] comes from Just Another Gibbs Sampler, and it’s a program for simulation from Bayesian hierarchical models using MCMC. Perfect for our scenario.

So what we’ll do first is build the model:

modelString = "
model{
   # Likelihood
   for(i in 1:n){
       y[i] ~ dbern(pi[i])
       logit(pi[i]) <- beta[1] + beta[2]*goals[i] + beta[3]*is_home[i] + 
                       beta[4]*avg_rating[i]
   }

   # Prior distributions
   for (j in 1:4) {
       beta[j] ~ dnorm(0.0, 0.0001)
   }
}"
writeLines( modelString , con="cl_model.txt" )

In the model, which we store as a string and then into a TXT file, what we do is define the likelihood and the prior distributions:

  • Likelihood: in a for loop that goes through all the data rows we have (192, which comes from 96 matches times 2 teams per match) and assigns the Bernoulli distribution to the predictor variable for that sample. We then compute the logit function which, as you’ll recall, is a linear equation using the three stats we’re using to predict.
  • Prior distributions: we don’t know anything about the linear parameters so we’re just making them follow a normal distribution with a standard deviation of 100 and mean 0.

What we do next is initialize the beta values using a normal distribution:

init_values <- function(){
  list(beta = rnorm(4, 0, 3))
}
init_values()
Initialized beta values - Image by the author
Initialized beta values – Image by the author

Time to import the data. If you’ve read my articles before, you know I’m a DuckDB fan[7] and there’s where I have all the data stored. As the SQL query itself isn’t relevant – only the data we extract – I’ll omit this step and show the data straightaway:

First 10 rows of data - Screenshot by the author
First 10 rows of data – Screenshot by the author

We’re seeing one row per team and game, with each team’s stats and whether they won or not. For simplicity, the avg_rating feature is rounded.

To create the JAGS model, we’ll need to provide the data. Not just the covariables used to predict but also some variables we’re using in the model like the outcome column y or the static number of observations n. To do so, we just create a list of values like this:

jags_data <- list(y = data$win,
                  goals = data$goals,
                  is_home = data$is_home,
                  avg_rating = data$avg_rating,
                  n = nrow(data))

We have our model defined, prepared the function to initialize values (and executed it) and we already have the data in the proper JAGS format. It’s now time to run the model and find the proper beta values.

JAGS uses different MCMC samplers to approximate the posterior distribution and automatically chooses the proper one, but it needs a prior "adaptation" period. Here’s how we do it:

jagsModel <- jags.model(file = "cl_model.txt",
                        data = jags_data,
                        inits = init_values,
                        n.chains = 3,
                        n.adapt = 1000)

This jags.model function needs the model file, the data we just prepared, and the function that initializes values. Then, we set the number of Markov chains to 3 and the number of iterations for adaptation to 1000.

It then shows some info relevant to the model and its nodes, but we don’t have to do anything with it.

To ensure convergence, we’ll apply what’s called "burn-in", which is only a way to say that we discard a certain number of iterations that we don’t know if fall within the posterior region or not. It’s as easy as running the next function:

update(jagsModel, n.iter=10000)

Lastly, we produce the inference by telling JAGS to generate MCMC samples (these samples will then be used to generate the posterior distribution). To do so, we use coda.samples() as follows:

samples <- coda.samples(jagsModel,
                        variable.names = c("beta"),
                        n.iter = 10000,
                        thin = 10)

We’re basically using the model as the first parameter, we then specify the parameters the values of which want to record (beta), and set the number of steps for each chain to 10.000.

The last parameter, thin, is crucial here. As we’re working with Markov Chains, we expect to have autocorrelated samples. For that reason, it’s common to use thinning to only keep 1 every X iterations for our final samples. In this case, I’m using X=10.

Now run the code and let it produce the samples.

Validating the Model

We’re almost there!

We’ve already got our samples. Let’s now check if everything’s fine. To do so we’ll examine the chains. Do you remember that we decided to use 3 and not 1? It’s only for pure comparison purposes.

We’re going to use here an R module created by Kruschke, J. K. (2015) called DBDA2E-utilities.R[8]. From it, we’ll first take advantage of the diagMCMC() function by passing the samples as the first parameter and then the parameter name as the second. We’ll try, for example, with the second one (beta[2]):

diagMCMC(codaObject = samples, parName = "beta[2]")
Chain examinations - Screenshot by the author
Chain examinations – Screenshot by the author

This is what we see in each of these graphs:

  • Top-left: Trace plot to see if the chains converge and none is left orphan.
  • Bottom-left: We see how the Gelman-Rubin statistic evolves with each iteration. If you’re not used to it, just take into account that we want it as close to 1 as possible.
  • Top-right: Autocorrelation is shown here and, as we want as little of it as possible, we want to see the lines going down as much as possible.
  • Bottom-right: Density plot to see if the three chains are well superimposed or not.

We’re now confident that our chains look fine. Let’s see the actual posterior distributions, shall we?

From the same file module we just used, we’ll now take advantage of the plotPost() function:

plotPost( samples[,"beta[2]"])
plotPost() output - Screenshot by the author
plotPost() output – Screenshot by the author

For now, everything looks reasonably fine. We’re ready to start predicting.

Predictions

Suppose we want to predict Napoli vs Barcelona. To compute the distributions of each team’s winning probabilities, we would need to know in advance the match’s predicted goals and the predicted average player ratings. There’s no way to know these before the match gets played, but we could use two strategies:

  1. Build different models to predict these variables (neural networks, for example).
  2. Use the last-X matches average from these stats.

Today, we’ll go with the second one, and we’ll use the 5-game average to predict the winning chance distributions. This adds value to our predictions because it takes into account the team’s current form.

This is the DF we’re now going to use, which now contains data from all competitions, not just the Champions League:

Random sample of averaged data - Screenshot by the author
Random sample of averaged data – Screenshot by the author

As our model predicts team by team, we’ll have to perform two predictions and overlap both histograms into one single plot. Let’s start with a real example: RB Leipzig vs Real Madrid.

I’m finishing this post as of February 15th and this game was just played two days ago, so using it as an example is good to see if the predictions are accurate or not. To be clear: data from yesterday’s matches isn’t used for these predictions.

We need to filter the data from this DF and get each team’s:

# Home team data
home_team = 'RBL'
home_pred_home <- 1
goals_pred_home <- avg_data[avg_data$team == home_team, 'goals_avg']
avg_rating_pred_home <- avg_data[avg_data$team == home_team, 'rating_avg']

# Away team data
away_team = 'Real Madrid'
home_pred_away <- 0 
goals_pred_away <- avg_data[avg_data$team == away_team, 'goals_avg']
avg_rating_pred_away <- avg_data[avg_data$team == away_team, 'rating_avg']

Now that we have valid values for each variable, we need to run the predictions. Recall the logit function allowed us to convert this problem into a linear one, and we found the coefficients through JAGS. Now it’s time to use these and the averages to get the linear predictor values:

predictor_pred_home <- samples[[1]][,1] +
  samples[[1]][,2]*goals_pred_home +
  samples[[1]][,3]*home_pred_home  +
  samples[[1]][,4]*avg_rating_pred_home

predictor_pred_away <- samples[[1]][,1] +
  samples[[1]][,2]*goals_pred_away +
  samples[[1]][,3]*home_pred_away  +
  samples[[1]][,4]*avg_rating_pred_away

And finally, we compute the inverse of the logit function to find the final pi estimate:

pi_pred_home <- as.numeric(exp(predictor_pred_home)/(1+exp(predictor_pred_home)))
pi_pred_away <- as.numeric(exp(predictor_pred_away)/(1+exp(predictor_pred_away)))

To grasp the full value of these predictions, we’ll now plot them into a histogram:

preds <- data.frame(pi_pred_home, pi_pred_away)
names(preds) <- c(home_team, away_team)

ggplot(preds) +
  geom_histogram(aes(x = pi_pred_home, y = ..density.., 
                     color = home_team, fill=home_team),
                 bins = 30, alpha=0.5) +
  geom_histogram(aes(x = pi_pred_away, y = ..density..,     
                     color = away_team, fill=away_team),
                 bins = 30, alpha=0.5) +
  theme_light() +
  xlim(c(0,1)) +
  xlab(expression(pi)) +
  theme(axis.text = element_text(size=15),
        axis.title = element_text(size=16))

And the result:

Leipzig vs Madrid predictions - Screenshot by the author
Leipzig vs Madrid predictions – Screenshot by the author

How do we interpret this? The most direct way to do it is by using the median of both teams’ predictions and comparing them both relatively:

  • For RB Leipzig, the median is at 37.07%.
  • For Real Madrid, the median is at 58.84%.

From these numbers, we could conclude that both teams have been doing quite well recently but Madrid seems to be better though not by much. So the model is inclined towards Madrid but without much certainty.

Taking into account that the match ended 0–1 yesterday, we can be extremely proud of our model.

Copenhagen vs City (1–3), Lazio vs Bayern (0–1), and PSG vs Real Sociedad (2–0) have already been played, so we can use these to assess our model’s accuracy as well:

Already-played games predictions - Screenshot by the author
Already-played games predictions – Screenshot by the author

This is impressive. The model not only predicted these games with 100% accuracy, but it was able to predict Lazio’s higher winning chances vs Bayern even if it was the underdog and the odds were clearly against them.

Almost everyone would have betted for Bayern as a favorite, even myself, but the model didn’t and Lazio ended up winning 1–0.

So far, 4/4 correctly predicted. So let’s predict the 4 remaining matches being played next week!

Next week's predictions - Screenshot by the author
Next week’s predictions – Screenshot by the author

These look reasonable again:

  • Inter vs Atlético de Madrid: Inter is the clear favorite (personally not so sure about it).
  • PSV vs Dortmund: No clear favorite but Dortmund seems to be in better form and its winning chances seem higher… But a draw wouldn’t be unexpected.
  • Porto vs Arsenal: Arsenal is, without a doubt, the model’s favorite. We see Arsenal’s distribution being thin and tall, meaning the confidence in that prediction is higher, while Porto’s is quite wide, meaning the model is not so sure about what it’s predicting and the true value lies within 25% and a 70% chance of winning for the Portuguese team.
  • Napoli vs Barcelona: Similar to the case above, the model doesn’t see Napoli winning the game, while with Barça it can’t decide with confidence its chances of winning, being within 50% and 80% (being myself a Barça fan, that’s too high in my opinion)

Now we just have to wait, enjoy the matches, and come back to see how many predictions our Bayesian model got right!

Conclusions, Future Steps, and Improvements

What we built here was rather simple… Yet we can say the predictions were reasonable. More than that, probably, having seen that it was accurate on all played matches already, being able to predict one underdog’s win.

However, this model is too simple to expect great predictions from it. For example, the Napoli vs Barcelona prediction favors Barça because the model doesn’t take into account goals conceded and that’s the team’s major flaw this season.

That’s why I wanted to dedicate a final section to suggest just a few potential improvements to turn this model into a high-class one:

  • Expand the feature set: a proper feature analysis should be performed to assess which features we’ve left behind would be great parameters for our model. For example: win%, possession, shots, tackles, xG…
  • Include rival data: the current model predicts the team’s winning distribution based on the team’s stats only, without taking into account the rivals they face. If we added the other team’s info, the model would be more complete.
  • Use more data: Even if we used MCMC, another solution would have been to use a lot more data from past CL seasons to create the model and make it (potentially) more robust.
  • Create a machine learning model to predict the team’s features instead of using simple averages.

Many others could be added to this list. Feel free to implement these on your own!

Thanks for reading the post! 

I really hope you enjoyed it and found it insightful. There's a lot more to 
come, especially more AI-based posts I'm preparing.

Follow me and subscribe to my mail list for more 
content like this one, it helps a lot!

@polmarin

Resources

[1] Pol Marín (2024). Bayesian Inference in Champions League. https://github.com/polmarin/bayesian_inference_champions_league

[2] Wikipedia contributors. (2024, January 31). Bayesian inference. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Bayesian_inference&oldid=1201216717

[3] Wikipedia contributors. (2024, January 12). Bayes’ theorem. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Bayes%27_theorem&oldid=1195202672

[4] Wikipedia contributors. (2024, January 25). Bernoulli distribution. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Bernoulli_distribution&oldid=1199096378

[5] Jason Brownlee (2019, September 25). A Gentle Introduction to Markov Chain Monte Carlo for Probability. In Machine Learning Mastery. https://machinelearningmastery.com/markov-chain-monte-carlo-for-probability/

[6] JAGS – Just Another Gibbs Sampler. https://mcmc-jags.sourceforge.io/

[7] Pol Marín (2023, March 16). Forget about SQLite, Use DuckDB Instead – And Thank Me Later. In Medium. https://towardsdatascience.com/forget-about-sqlite-use-duckdb-instead-and-thank-me-later-df76ee9bb777

[8] Kruschke J. K. (2015). GitHub –https://github.com/kyusque/DBDA2E-utilities/

The post Using Bayesian Modeling to Predict The Champions League appeared first on Towards Data Science.

]]>
Your Data Science Visualizations Will Never Be The Same – Plotly & Dash https://towardsdatascience.com/your-data-science-visualizations-will-never-be-the-same-plotly-dash-6327d07d9efb/ Tue, 24 Oct 2023 06:52:39 +0000 https://towardsdatascience.com/your-data-science-visualizations-will-never-be-the-same-plotly-dash-6327d07d9efb/ Using Plotly and Dash to create interactive dashboards

The post Your Data Science Visualizations Will Never Be The Same – Plotly & Dash appeared first on Towards Data Science.

]]>
Data Viz

Your Data Science Visualizations Will Never Be The Same – Plotly & Dash

Photo by Isaac Smith on Unsplash
Photo by Isaac Smith on Unsplash

Not so long ago, I wrote a simple intro to four Python Data Visualization libraries where I showcased their pros and cons, and used practical examples to show what they are capable of.

As we’re going to get deeper into the ones that I like the most, I highly encourage you to check that article first, as this one will expand on what was shown there:

Building Interactive Data Visualizations with Python – The Art of Storytelling

Today we’ll focus on Plotly[1] and Dash[2]. Why two? Because they go hand-in-hand. As I stated in the article linked above, "Dash isn’t a plotting library per se. It’s an amazing framework used to generate dashboards."

So Plotly is the library we use to plot, and Dash is the framework we use to generate cool, interactive dashboards from those plots.

Here’s the set of steps we’ll follow to build today’s dashboard:

  • Setup and installation – to get us in the proper state.
  • Some simple use cases – to show how Plotly works
  • Building a dashboard with Dash – to create the best dashboards.
  • Conclusions – to wrap up the story and see the results.

Before going deeper, we need to talk about the data. We need some sort of data to be able to visualize it, right? Keeping up with most of my latest Medium content, I’ll be focusing on sports and, more concretely, football (soccer).

I’ll be using Statsbomb’s free data[3] from the 2015–16 LaLiga campaign.

There’s a lot of data from that season but I want to visualize Futbol Club Barcelona’s players’ performance focusing mostly on attacking terms: shots, goals, assists…

The purpose might differ based on the analyst’s position: are you a Real Madrid analyst? Then I’m sure you’ll want to decipher how your team can stop Leo Messi (spoiler: you can’t).

But if you work within the Barça organization, you might want to just check your player’s numbers and see where some players perform better than others.

Whatever it is, always make sure you define your goals before creating any dashboard – there’s so much info you can visualize that you have to purposely pick the plots you want to look at.

And always aim for simplicity; non-technical people will have to draw conclusions from your dashboards.

Setup and Installation

I like to keep things ordered and structured. So the first thing we’ll do is create a new directory in whatever path you want your app to be hosted in. I’ll create it on my Desktop, for simplicity. Here are the two commands I run on a terminal:

$ cd ~/Desktop
$ mkdir plotly-dash

Now, the next natural step is to create a new Python environment within the new directory. I’ll use pipenv [4] but you can use your virtualenv management tool of preference (or none).

If you haven’t got pipenv installed in your machine, then run this command first:

$ pip install --user pipenv

Then, create the environment:

$ cd plotly-dash
$ pipenv shell

This will create a new environment and automatically activate it. Anything you install now from that terminal is going to be installed on the environment only.

So let’s start installing libraries using pip:

(plotly-dash) $ pip install dash pandas statsbombpy

Yep, by installing these three we’ll have more than enough. They all have their own dependencies and we’re going to take advantage of some of them like Plotly or NumPy.

With everything set up, we’re now ready to start exploring Plotly.

Visualizing Data with Plotly

My recommendation here is to test it from a jupyter notebook, as it will make your development phase more fluid. In this case, you should also install it – I promise it’s the last installation we run – and we’ll also open it:

(plotly-dash) $ pip install notebook
... (installation outputs)

(plotly-dash) $ jupyter notebook

As always, we’ll need to prepare the data and we’ll create a new notebook called plotly.ipynb. To avoid extremely large notebooks and files, I like to modularize my code. For that reason, I created a src directory within the project folder and added two new files there: functions.py and classes.py. The structure now looks like:

- plotly-dash/
    - src/
        - classes.py
        - functions.py
    - plotly.ipynb

The first function I’ll create is going to be called prepare_team_data() and will return events, shots, and assist data from the specified team (in our case, Barcelona).

As the function itself is not useful for today’s purposes, because we want to focus on plotting and creating dashboards, I won’t put the code to the function. But you have the link to the whole code in the Resources section[5].

# Third-party libraries
import pandas as pd
from statsbombpy import sb

# Custom modules
from src.functions import prepare_team_data

events, shots, assists = prepare_team_data('Barcelona')
shots.head()

And here’s a snapshot of what the shots DF looks like.

shots DF screenshot - image by the author
shots DF screenshot – image by the author

Good, let’s start with shots then. I want to plot a player’s shot distribution, to see where he shoots from more and where his goals come from. To do this, I’ve created a FootballPitch class in the classes.py module.

This class allows us to plot a complete football pitch, half of it (the attacking half, it is), or even a heatmap as we’ll be doing.

Again, you can find the code in the GitHub link[5] in the Resources section at the bottom of this article. But we’re going to inspect it a little bit because here’s where we’ve used a lot of Plotly’s gifts.

The class has basically two methods: plot_pitch() and plot_heatmap. As we’re first interested in displaying the player shots, let’s start with the first one by dividing it into little code chunks.

Note that you will see some variables and class attributes that we haven’t assigned any value to. These are function parameters or initialized when creating the object.

First thing first: let’s declare the essential variables the function will use.

# Fig to update
fig = go.Figure()

# Internal variables
self.height_px = self.pitch_width*10*zoom_ratio
self.width_px = self.pitch_length*10*zoom_ratio

pitch_length_half = self.pitch_length/2 if not self.half else 0
pitch_width_half = self.pitch_width/2
corner_arc_radius = 1

centre_circle_radius = 9.15

goal = 7.32
goal_area_width = goal + (5.5*2)
goal_area_length = 5.5
penalty_area_width = goal_area_width + (11*2)
penalty_area_length = goal_area_length + 11
penalty_spot_dist = 11
penalty_circle_radius = 9.15

Now that we have the figure declared, what we’ll do over and over again is add traces or shapes into it to customize it as we want. So, for example, the first thing the function does is plot a rectangular shape, being it the pitch itself:

fig.add_trace(
    go.Scatter(
        x=[0, self.pitch_length, self.pitch_length, 0, 0], 
        y=[0, 0, self.pitch_width, self.pitch_width, 0], 
        mode='lines',
        hoverinfo='skip',
        marker_color=line_color,
        showlegend=False,
        fill="toself",
        fillcolor=bg_color
    )
)

Here, we add a trace which is a scatterplot with mode lines – meaning we want a line, not a real scatter plot with independent dots. The parameters are pretty self-explanatory, such as the x and y (the data we want to plot), the colors… The hoverinfo label is used to determine what we want to show when we hover our mouse over these lines. As we’re building the pitch as part of the background and isn’t telling us anything about the data we want to analyze, I’m setting it to skip.

Then we set some extra configurations into the figure’s layout:

fig.update_layout(
    yaxis_range=[-self._vertical_margin, self.pitch_width + self._vertical_margin], 
    xaxis_range=[-self._horizontal_margin, self.pitch_length + self._horizontal_margin],
    height=self.height_px,
    width=self.width_px,
    plot_bgcolor='rgba(0,0,0,0)',
    xaxis=dict(showgrid=False, visible=False),
    yaxis=dict(showgrid=False, visible=False)
)

That gives us the following result:

Football pitch (grass only) - image by the author
Football pitch (grass only) – image by the author

And we now have our pitch plotted. Not really meaningful… Yet.

Plotting in Plotly is really this easy! By adding some more traces and shapes into the plot, here’s what my pitch background ends up looking like:

Football pitch - image by the author
Football pitch – image by the author

Now, you might not be interested in displaying a football pitch. That’s why I didn’t put all the code here… But great dashboards are the result of creativity and skills, and plotting a pitch is a great way to display football events that happen on the pitch (if we’re interested in location).

So let’s get going and start displaying real data!

As we want to display shots – and goals – a scatter plot looks like a fair option to use. Remember that we already have the data prepared, we just have to filter it and display it.

Let’s plot Leo Messi’s shots and goals:

import plotly.graph_objects as go
from src.classes import FootballPitch

player = 'Leo Messi'

pitch = FootballPitch(half=True)
fig = pitch.plot_pitch(False, bg_color='#C1E1C1') 

player_shots = get_player_shots(player, shots.copy(), pitch)
scatter_colors = ["#E7E657", "#57C8E7"]

for i, group in enumerate([True, False]):
    fig.add_trace(go.Scatter(
        x=player_shots[player_shots['goal'] == group]['x'],
        y=player_shots[player_shots['goal'] == group]['y'],
        mode="markers",
        name='Goal' if group else 'No Goal',
        marker=dict(
            color=scatter_colors[i],
            size=8,
            line=dict(
                color='black',
                width=1
            )
        ),
    ))

fig.update_layout(
    title='Shot distribution'
)

The first part is self-explanatory: we just declare variables, instantiate the pitch, store the figure in the fig variable, and run a function that filters the shots data frame to return only the player ones.

Then, in a 2-iteration loop, we add a scatter plot twice: one for shots that weren’t a goal (which will be displayed in blue) and one for shots that resulted in a goal. The result:

Leo Messi's shot and goal distribution in 2015/16 - image by the author
Leo Messi’s shot and goal distribution in 2015/16 – image by the author

And what makes plotly amazing is that this plot is fully interactive. We can play around hovering our mouse to see the real shot locations, hide the non-goal ones to inspect just the scoring shots…

Let’s go ahead now and build a line plot. It will be interactive of course, and we’ll use it to inspect the player’s shots by quarter and also to compare it with his teammates’ and the team’s average.

To do so, we’ll start by grouping shots in a quarterly manner (in 15-minute chunks) for each player. The next part will be plotting the values themselves and playing with line opacity to highlight the current player (Messi).

player = 'Leo Messi'
max_shots = 0
fig = make_subplots()

for p in shots.player.unique():
    player_shots = get_player_shots(p, shots)

    xy = 15 * (player_shots[['float_time', 'minutes']]/15).round()
    xy = xy.groupby(['float_time']).count()[['minutes']]

    max_shots = xy.minutes.max() if xy.minutes.max() > max_shots else max_shots

    fig.add_trace(
        go.Scatter(
            name=p,
            x = xy.index, 
            y = xy.minutes,
            mode='lines',
            opacity=1 if p == player else 0.2
        )
    )

Now that we have all players ready, we’ll add the team’s average as a dashed line. The code does exactly the same as the snippet above but uses team-level data.

# Add team's avg
xy = 15 * (shots[['float_time', 'minutes']]/15).round()
xy = xy.groupby(['float_time']).count()[['minutes']]/len(shots.player.unique())

fig.add_trace(
    go.Scatter(
        name="Team's Average",
        x = xy.index, 
        y = xy.minutes,
        line = go.scatter.Line(dash='dash'),
        marker=None,
        mode='lines'
    )
)

And we’ll end up adding some styling to the layout:

fig.update_xaxes(range=[0, 91])
fig.update_layout(
    #title='Shots by Quarter',
    margin=dict(l=20, r=20, t=5, b=20),
    xaxis = dict(
        tickmode = 'array',
        tickvals = xy.index.values
    ),
    height=200,
    plot_bgcolor="#F9F9F9", 
    paper_bgcolor="#F9F9F9",
    yaxis_range=[-3,max_shots+5]
)

The result:

The green, highlighted line is Leo Messi’s data (as the label shows when I hovered over the 60th-minute shot count). For some reason, maybe due to fatigue, Messi’s shots decreased during the 60-75 minutes but they increased in the last minutes of the game.

We see most of the team’s shots during the last 15 minutes decrease but Leo goes the other way. This shows a lot about his impact on the team and his desire to win.

Anyway, enough for the intro. We’ve managed to plot two different plots and also create an amazing background for our plots. I think we’ve covered more than Plotly’s basics.

Creating the Dashboard

A dashboard is just a combination of plots displayed in an ordered and attractive way. And we already have the plots created – we did it in the last section – so we just need to display them.

Now, it isn’t as straightforward. We’ll have to add some changes to the code snippets shared above but I promise they’ll be tiny.

To complete the dashboard, I’ll add some more plots and functionalities to make it fully interactive.

Having Dash already installed, I’ll create a new file called app.py:

- plotly-dash/
    - src/
        - classes.py
        - functions.py
    - plotly.ipynb
    - app.py

And the file’s template will start being this simple:

from dash import html, Dash, dcc, Input, Output, callback

app = Dash(__name__) 

if __name__ == '__main__':
    app.run(debug=True)

If you went on and executed the file (python app.py), you’d get a message in your terminal like the next one:

(plotly-dash) $ python app.py
Dash is running on http://127.0.0.1:8050/

 * Serving Flask app 'app'
 * Debug mode: on

Go ahead and navigate to http://127.0.0.1:8050/. You’ll see a blank page, but that’s actually your dashboard.

Let’s start adding stuff, shall we? Take a look at the next code.

@callback(
    Output('shot_distribution', 'figure'),
    Input('player_dropdown', 'value')
)
def create_shot_distribution(player):
    pitch = FootballPitch(half=True)
    fig = pitch.plot_pitch(False, bg_color='#C1E1C1', zoom_ratio=0.8) 

    player_shots = get_player_shots(player, SHOTS.copy(), pitch)

    scatter_colors = ["#E7E657", "#57C8E7"]

    for i, group in enumerate([True, False]):
        fig.add_trace(go.Scatter(
            x=player_shots[player_shots['goal'] == group]['x'],
            y=player_shots[player_shots['goal'] == group]['y'],
            mode="markers",
            name='Goal' if group else 'No Goal',
            marker=dict(
                color=scatter_colors[i],
                size=8,
                line=dict(
                    color='black',
                    width=1
                )
            ),
            #marker_color=scatter_colors[i] # #E7E657 i #57C8E7  
        ))

    fig.update_layout(
        margin=dict(l=20, r=20, t=5, b=20),
    )

    return fig

By now, it should sound familiar. It’s exactly the same code we used to display Messi’s shots… But now, instead of defining the player to be Leo Messi, it’s the function argument.

And where does this argument come from? Just above the function declaration, we have the callback decorator. These callbacks are what make Dash’s dashboards interactive.

We use them to determine the inputs and outputs of the associated app component. In this case, we’re saying that the function needs the player parameter which will come from the element called player_dropdown (which we haven’t defined yet).

As for the output, we made the function return the fig. Thanks to the callback decorator, the app knows that this will be the figure being used in the shot_distribution element from our dashboard.

You probably have too many questions right now. How do I define a dropdown or any interactable component? How do I actually get to plot the shot_distribution element?

Let’s start with the first question: the dropdown. Dash has its own core components (dcc) and the dropdown is one of them. Creating it is as simple as:

dcc.Dropdown(
    PLAYER_OPTIONS,
    'All players', 
    id='player_dropdown', 
    style={'width': '200px', 'margin': '20px auto', 'text-align': 'left'}
)

This will create a dropdown using all player names as possible options, using All players as the default value. But the most important part is the id. Here’s where we get to tell Dash that this dropdown is the one associated with the previous function’s input callback.

In other words, the value this dropdown has will be the player being shown on the shot distribution plot.

But we still need to place both of these components into our dashboard. The page remains blank.

You’ll need some HTML knowledge now, but basic knowledge will be more than enough (though it can get as complex as you want).

We need to place these components within an HTML code. Dash, again, makes it extremely easy for us to do so. In the case of the dropdown, it can be done by simply wrapping the code with an html.Div component, basically placing the dropdown within a <div></div> HTML element:

filter = html.Div(
    [
        dcc.Dropdown(
            PLAYER_OPTIONS,
            'All players', 
            id='player_dropdown', 
            style={
                'width': '200px', 
                'margin': '20px auto', 
                'text-align': 'left'
            }
        )
    ],
    style={'display': 'inline-block'}
)

The way this works is the html.Div can have many child elements (hence the list) and then we can set the element’s CSS style using the style attribute, which is a dictionary. Easy, right?

In the case of the shot distribution graph, here’s the equivalent:

shot_distribution_graph = html.Div(
    [
        html.H2('Shot Distribution'),
        dcc.Graph(id='shot_distribution', figure={})
    ], 
    style={
        'padding': '2%',
        'display': 'inline-block'
    }
)

Same structure, but to display graphs we use the dcc.Graph component and, as you probably guessed, the id attribute is key here too. It links this particular component with the output callback from the function we declared. So, whatever is computed there, will be displayed here.

We have now wrapped the components with HTML code. But they aren’t being displayed yet. We need to add them to the dashboard’s layout:

app.layout = html.Div([
    shot_distribution_graph, filter
], style={
    'width': '1650px', 
    'margin': 'auto'
})

No secret here; the structure is the same but on a higher level. We’re placing the previous <div></div> elements into a big one (the whole website container) and providing some extra styling. Now, yes, if you refresh the website or restart the app, you’ll see your first results:

Shot distribution plot - image by the author
Shot distribution plot – image by the author

Amazing what we’ve built already, right? This interactability is powerful.

To finish this section, let’s do the same but with the other plot we built. This time, I’ll paste the whole new code here so you can check it all at once:

# Functions
@callback(
    Output('shots_by_quarter', 'figure'),
    Input('player_dropdown', 'value')
)
def create_shots_by_quarter(player):
    fig = make_subplots()

    max_shots = 0

    for p in SHOTS.player.unique():
        player_shots = get_player_shots(p, SHOTS)

        xy = 15 * (player_shots[['float_time', 'minutes']]/15).round()
        xy = xy.groupby(['float_time']).count()[['minutes']]

        max_shots = xy.minutes.max() if xy.minutes.max() > max_shots else max_shots

        fig.add_trace(
            go.Scatter(
                name=p,
                x = xy.index, 
                y = xy.minutes,
                mode='lines',
                opacity=1 if p == player else 0.2
            )
        )

    # Add team's avg
    xy = 15 * (SHOTS[['float_time', 'minutes']]/15).round()
    xy = xy.groupby(['float_time']).count()[['minutes']]/len(SHOTS.player.unique())

    fig.add_trace(
        go.Scatter(
            name="Team's Average",
            x = xy.index, 
            y = xy.minutes,
            line = go.scatter.Line(dash='dash'),
            marker=None,
            mode='lines'
        )
    )

    fig.update_xaxes(range=[0, 91])
    fig.update_layout(
        margin=dict(l=20, r=20, t=5, b=20),
        xaxis = dict(
            tickmode = 'array',
            tickvals = xy.index.values
        ),
        height=200,
        plot_bgcolor="#F9F9F9", 
        paper_bgcolor="#F9F9F9",
        yaxis_range=[-3,max_shots+5]
    )

    return fig

# Dashboard's layout components
shots_by_quarter = html.Div(
    [
        html.H2('Shots By Quarter', style={'margin-top': '20px'}),
        dcc.Graph(id='shots_by_quarter', figure={})
    ],
    style={
        'padding': '2%'
    }
)

# Create layout
app = Dash(__name__)
app.layout = html.Div([
    shot_distribution_graph, filter, shots_by_quarter
], style={'width': '1650px', 'margin': 'auto'})

# Run app
if __name__ == '__main__':
    app.run(debug=True)
Resulting dashboard with two plots - image by the author
Resulting dashboard with two plots – image by the author

Now, this is functional. But it isn’t really attractive… HTML and CSS will be our tools making it more visually appealing (even though I’m not good at design).

However, this is outside of our scope. Our goal was to create a dashboard and we’ve done it. This one’s really simple but if you managed to understand everything we did, how the final dashboard was done I shared at the beginning and will share again in the next section will be no secret to you (again, the code is freely available at the bottom of this article)

Wrapping Up

Today we built a dashboard with two plots and one dropdown. But we can scale it as needed. For example, knowing how to place a dropdown, we know how to place one slider. And, what about two?

Everything we learned today can be applied to any data you want to visualize, from economic reports to medical results or ad campaign insights. I chose to apply it to football because I’m deeply passionate about it, but please generalize the knowledge and apply it anywhere.

Knowing how to place two plots, we can create many, many more. And different ones: one showing assists, another showing the player’s influence on the pitch, the comparison between his goals and the expected… And with all this plus a little bit of HTML and CSS, we get the final dashboard:

Final dashboard - image by the author
Final dashboard – image by the author

I really hope you can see how good this tool is.

Dash and Plotly must be in any data analyst’s skillset. They are amazing libraries we can use to share our data and insights in a way that’s highly customized – i.e. adapted to your needs – and easy to comprehend.

Thanks for reading the post! 

I really hope you enjoyed it and found it insightful.

Follow me and subscribe to my mail list for more 
content like this one, it helps a lot!

@polmarin

Resources

[1] Plotly: Low-code Data App Development

[2] Dash Documentation and User Guide – Plotly

[3] Free Data | StatsBomb

[4] Pipenv: Python Dev Workflow for Humans

[5] Plotly & Dash Project Code – GitHub

The post Your Data Science Visualizations Will Never Be The Same – Plotly & Dash appeared first on Towards Data Science.

]]>
AI Search Algorithms - A Deep Dive Into The Most Popular Ones https://towardsdatascience.com/ai-search-algorithms-a-deep-dive-into-the-most-popular-ones-2760679368bb/ Wed, 23 Aug 2023 19:32:30 +0000 https://towardsdatascience.com/ai-search-algorithms-a-deep-dive-into-the-most-popular-ones-2760679368bb/ Going through 4 of the most used search algorithms in AI

The post AI Search Algorithms - A Deep Dive Into The Most Popular Ones appeared first on Towards Data Science.

]]>
AI Search Algorithms: A Deep Dive into the Most Popular Ones

Going through four of the most-used search algorithms in AI

It’s as if we don’t have enough humans on Earth, that we’ve been trying for years to create machines that behave like us. We create mathematical models or agents that act rationally, so we don’t have to rely on other human beings’ decisions.

Search algorithms were the most used for a long time, but with the rise of machine and deep learning, they’ve kind of taken a seat back. However, I think all data scientists should know about them because they are an amazing toolset that will prove useful in many situations.

They can be applied in many situations, but the most representative ones are games: tic tac toe, maze, even chess… And we’ll use these to explain the algorithms we’ll be visiting today.

We’ll be introducing four of the most famous ones and we’ll be expanding a little bit on them, using some practical and visual examples.

As always, refer to the Resources section at the bottom of this article for more info and code.

But before that, we need to introduce some definitions to understand some key terms.

Terminology

  • Agent: it’s the human, model, or algorithm that interacts with its environment.
  • State: a particular environment or set of elements that represent the problem at a given moment.
  • State space: the set of all possible reachable states.
  • Goal state: the final state, where certain conditions are met.
  • Action: a function or decision the agent must make in a given state to move to the next one.

We could define more terms but these are the basic-most ones, enough for today’s topic.

Representation

AI Search Algorithms are usually explained using graphs and we won’t change the status quo today.

Here’s a sample graph:

Sample graph we'll use to explain the concepts - image by the author
Sample graph we’ll use to explain the concepts – image by the author

In a graph, each node is a state. And we always have the initial state (A in this case, in green) and the goal state (which I’ve decided it’ll be E, in red). It’s through the actions we defined in the previous section that our agent can move from one state to the next and repeat that successively until we reach the goal state.

All the nodes, from A to F, define the state space.

What differs then is the algorithm we use to get from A to E. And that’s what we’ll start exploring next.

Depth First Search

The Depth First Search (DFS)[1] algorithm is one of the uninformed search algorithms, in which the only information we have is the one provided in the problem definition.

In other words, in an uninformed search algorithm, we know where we want to go – we want to reach the goal state E – but we don’t know how close or far we are from it.

The Depth First Search in particular keeps on moving forward until it reaches a leaf node. At each bifurcation it might encounter, it randomly selects one direction and keeps on moving forward.

If the leaf node we reach is the goal state, then we’re done. But if it’s not, then it goes up again to the last bifurcation and keeps on moving forward. This process is repeated recursively until we get to the goal state.

Let’s see it now:

The first action applied - image by the author
The first action applied – image by the author

From the initial state A, we can only move towards B. So, after the action takes place, our agent will be in that second state. Then comes the bifurcation, and we need to choose. As DFS is an uninformed type of algorithm, it chooses randomly because we know nothing about which one is best. Let’s suppose it chose F instead of C.

DFS second decision - image by the author
DFS second decision – image by the author

As the algorithm is trained to go in-depth until it explores other branches, it will have to again decide between going to D and E. If it chose E, we’d be done. But we’re unlucky this time and it randomly goes to D.

Possible DFS implementation - image by the author
Possible DFS implementation – image by the author

At this point, the algorithm has visited A, B, F, and D. Being at D, there’s only one direction possible and it takes us to E. There, we’re done.

This algorithm is simple yet effective in multiple cases. It follows the last-in first-out strategy and that’s why it could be implemented using a stack.

In our example, the optimal solution gets us from A to E in 3 steps, but it’s not guaranteed. In fact, the example we followed took us 4 actions.

Breadth-First Search

Breadth First Search or BFS[2] is another type of uninformed search algorithm and we could say it’s the opposite of DFS: instead of going deep branch by branch, it doesn’t reach the next depth level until all nodes from the previous level have been explored.

While we used a stack in DFS, here we use a queue: it follows the first-in-first-out strategy.

Again, let’s traverse the graph to see it visually. It goes from A to B as we saw previously, we have no alternative here. Then we arrive at the bifurcation at B and, as it’s an uninformed algorithm, we simply choose randomly between C and F. We’ll basically be exploring both, it’s just a matter of which one we explore first.

Second step of BFS: we've explored A, B, C and F— image by the author
Second step of BFS: we’ve explored A, B, C and F— image by the author

We’ve explored level 1 (node A), level 2 (node B), and level 3 (nodes C and F). We now move on to level 4, where we explore the final leaves – E and D.

Again, whether we explore E before D or vice versa is a completely arbitrary decision:

  • If C went first in the B bifurcation, then we would reach E right after that.
  • But if F went first, then a random decision would be made as to whether we explore D or E first.

BFS has some potential, as well as DFS, but both have their flaws. One will be best in some cases but will result inefficient in others. These are simple algorithms and, while still useful, we deserve more.

Greedy Best-First Search

This is the first informed search algorithm we’ll be visiting today. In this kind of algorithm, we do have information about the goal state, and our duty is to use that information to choose wisely before each step. We define the term heuristic as this piece of information that tells us how close or far we are from the goal.

The Greedy Best-First Search (GBFS)[3] algorithm is simple: we simply move toward the nodes that take us closer to the goal node, and the heuristic h(x) is the one measuring the closeness.

In the typical maze game, the closeness could be expressed by the number of pixels/tiles away we are from it, both in the vertical and horizontal direction.

Let’s suppose an example, in which we want to go from A to B, and we’ve already played up until we reach a tricky decision point. If we were using BFS or DFS, the selection would have been random. But we’re using informed algorithms now.

Check the following maze, in which there’s a number in each tile defining how many tiles away we are from the goal. In other words, the heuristic defines the number of steps we need to go from the actual tile to the goal, supposing a maze without walls in which we can only move horizontally or vertically:

Sample maze with closeness heuristic values - image by the author
Sample maze with closeness heuristic values – image by the author

If we used GBFS, the closeness heuristic we’re using suggests that we go up because if we go up the number of steps needed to reach B is decreased from 2 to 1, but if we go down it increases from 2 to 3. In other words, if we go up we get closer to the goal and we move further if we decide to go down.

This isn’t perfect, because we’re only taking into account how close we are to B, but not the actual walls we have between us and the goal state. That’s something we just cannot control with GBFS.

For consistency purposes, let me go back to the graph representation we’ve been using on and on, only that this time it will be a whole new graph. We still want to get from A to E but, as we already know, we now have information in each node about how close we are to the end state.

We’re going to use the letter h to provide an imaginary value to the heuristic.

Sample graph with heuristics - image by the author
Sample graph with heuristics – image by the author

The proposed example works the same way the maze does: every action required to move from state to state gets us either one unit closer to the goal, one unit further, or makes us remain the same. The most optimal path is clear: From A to C to E. It’s a matter of moving toward the node with a smaller h(x) if there’s a bifurcation.

But it could have been different. Imagine that the heuristic didn’t differ by one unit at max from node to node, but it depended on several environmental factors. If it was that way, the optimal path would not have changed in our example, because it’s a really simple one, but it would be a totally different story in a more complex graph.

Greedy Best-First Search is a powerful algorithm but it isn’t perfect. While it considers how close we are to the goal, it doesn’t take into account the actual costs of having gone that far.

Believe it or not, the cost of getting to a state or node is as important as the cost of the greedy search.

Get ready for the next algorithm.

A* Search

The *A search algorithm**[4] is a variation of the GBFS algorithm but now, as stated, we take into account the costs of getting to each node. In other words, traversing via different edges might not have the same cost.

We’ll define:

  • Forward costh(x). It’s the same heuristic used in GBFS: the distance between the current state and the goal state.
  • Backward cost – g(x). It’s the cost of going from the initial state to the current state. It’s a cumulative cost, in the sense that it sums all the previous costs.

The goal of A* Search then is to find a path where the sum of both costs is the least. In mathematical terms, the overall cost of the node x is given by f(x) = g(x) + h(x).

To illustrate this, I’ll be using the initial graph we’ve already used. We still want to get from A to E and the costs are shown now:

New graph to perform the A* Search in - image by the author
New graph to perform the A* Search in – image by the author

Take a moment to think about which could be the optimal path here. The numbers around the nodes are the forward costs and the numbers close to the lines are the backward costs.

Let’s try to decipher it now, going step by step:

  1. We have no other option than going from A to B. The cost of reaching B is f(B) = 2 + 2 = 4.
  2. From B, we can either go to C or F. Here we pause and check which one’s lower. We see that f(C) = 7 + 1 = 8 and f(F) = 3+1 = 4. Then, we determine F to be the best solution.
  3. From F, we again have two options: we can go to E and reach the goal state or go towards D. Most would choose E because we wouldn’t need to do extra steps, but we’re smart and we’ll let the math do the work: f(D) = 6+1 = 7 and f(E) = 13+0 = 13. D is therefore a better option than E. However, remember that we still have an open path at B, where we could go towards C. But the cost of going from B to C is 8, still above 7, so we’ll still keep on moving in the F->D direction.

  4. From D we can only go to E (or move backward to F, which doesn’t make sense). The cost is f(E) = 7+0 = 7. As it is still lower than f(C), we’re done here.

So the optimal path is finally A->B->F->D->E.

Optimal path using A* Search— image by the author
Optimal path using A* Search— image by the author

Conclusion

Put this knowledge into practice.

We haven’t used any code here because I believe trying to build it on your own is the best way to understand how these algorithms work. However, in the resources section below you’ll have a link to a GitHub repo[5] I’ve created where I’ve coded three of these four algorithms we’ve seen today.

My implementations are simple and not the most efficient. My goal in providing the code was to let you play with it and improve it your way by making an iterative function recursive, by creating more complex classes…

Go ahead and try to implement your own maze implementation using DFS, BFS, GBFS, or A* Search. Don’t just copy my code, use it as a template (you might find some surprises there).

Be creative and resourceful, and don’t stop here. Try other algorithms we haven’t covered here like minimax or uniform cost search. Google is your friend.

Today I’ve covered the basics of these four algorithms showing how they work with visual examples and following a set of steps. I used simple examples but the process is the same on larger graphs.

The key here is, as data scientists, to understand them, know how to use them, and, most importantly, identify situations in which they could be useful for us.

Search algorithms are just a part of AI and they’re still key even though they’ve been around for a while.

Go ahead and try them out!

Thanks for reading the post! 

I really hope you enjoyed it and found it insightful.

Follow me and subscribe to my mail list for more 
content like this one, it helps a lot!

@polmarin

If you’d like to support me further, consider subscribing to Medium’s Membership through the link you find below: it won’t cost you any extra penny but will help me through this process.

Join Medium with my referral link – Pol Marin

Resources

[1] Depth-First Search – Wikipedia

[2] Breadth-First Search – Wikipedia

[3] Best-First Search – Wikipedia

[4] A* Search Algorithm – Wikipedia

[5] AI Search Algorithms Repo – GitHub

Additional Resources

The post AI Search Algorithms - A Deep Dive Into The Most Popular Ones appeared first on Towards Data Science.

]]>