Basketball | Towards Data Science

Feature Engineering with Microsoft Fabric and PySpark

Roger Noble — Mon, 08 Apr 2024 09:38:31 +0000

A Huge thanks to Martim Chaves who co-authored this post and developed the example scripts.

In our previous post we took a high level view of how to train a machine learning model in Microsoft Fabric. In this post we wanted to dive deeper into the process of feature engineering.

Feature engineering is a crucial part of the development lifecycle for any Machine Learning (ML) systems. It is a step in the development cycle where raw data is processed to better represent its underlying structure and provide additional information that enhance our ML models. Feature engineering is both an art and a science. Even though there are specific steps that we can take to create good features, sometimes, it is only through experimentation that good results are achieved. Good features are crucial in guaranteeing a good system performance.

As datasets grow exponentially, traditional feature engineering may struggle with the size of very large datasets. This is where PySpark can help – as it is a scalable and efficient processing platform for massive datasets. A great thing about Fabric is that it makes using PySpark easy!

In this post, we’ll be going over:

How does PySpark Work?
Basics of PySpark
Feature Engineering in Action

By the end of this post, hopefully you’ll feel comfortable carrying out feature engineering with PySpark in Fabric. Let’s get started!

How does PySpark work?

Spark is a distributed computing system that allows for the processing of large datasets with speed and efficiency across a cluster of machines. It is built around the concept of a Resilient Distributed Dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. RDDs are the fundamental data structure of Spark, and they allow for the distribution of data across a cluster of machines.

PySpark is the Python API for Spark. It allows for the creation of Spark DataFrames, which are similar to Pandas DataFrames, but with the added benefit of being distributed across a cluster of machines. PySpark DataFrames are the core data structure in PySpark, and they allow for the manipulation of large datasets in a distributed manner.

At the core of PySpark is the SparkSession object, which is what fundamentally interacts with Spark. This SparkSession is what allows for the creation of DataFrames, and other functionalities. Note that, when running a Notebook in Fabric, a SparkSession is automatically created for you, so you don’t have to worry about that.

Having a rough idea of how PySpark works, let’s get to the basics.

Basics of PySpark

Although Spark DataFrames may remind us of Pandas DataFrames due to their similarities, the syntax when using PySpark can be a bit different. In this section, we’ll go over some of the basics of PySpark, such as reading data, combining DataFrames, selecting columns, grouping data, joining DataFrames, and using functions.

The Data

The data we are looking at is from the 2024 US college Basketball tournaments, which was obtained from the on-going March Machine Learning Mania 2024 Kaggle competition, the details of which can be found here, and is licensed under CC BY 4.0 [1]

Reading data

As mentioned in the previous post of this series, the first step is usually to create a Lakehouse and upload some data. Then, when creating a Notebook, we can attach it to the created Lakehouse, and we’ll have access to the data stored there.

PySpark Dataframes can read various data formats, such as CSV, JSON, Parquet, and others. Our data is stored in CSV format, so we’ll be using that, like in the following code snippet:

# Read women's data
w_data = (
    spark.read.option("header", True)
    .option("inferSchema", True)
    .csv(f"Files/WNCAATourneyDetailedResults.csv")
    .cache()
)

In this code snippet, we’re reading the detailed results data set of the final women’s basketball college tournament matches. Note that the "header" option being true means that the names of the columns will be derived from the first row of the CSV file. The inferSchema option tells Spark to guess the data types of the columns – otherwise they would all be read as strings. .cache() is used to keep the DataFrame in memory.

If you’re coming from Pandas, you may be wondering what the equivalent of df.head() is for PySpark – it’s df.show(5). The default for .show() is the top 20 rows, hence the need to specifically select 5.

Combining DataFrames

Combining DataFrames can be done in multiple ways. The first we will look at is a union, where the columns are the same for both DataFrames:

# Read women's data
...

# Read men's data
m_data = (
    spark.read.option("header", True)
    .option("inferSchema", True)
    .csv(f"Files/MNCAATourneyDetailedResults.csv")
    .cache()
)

# Combine (union) the DataFrames
combined_results = m_data.unionByName(w_data)

Here, unionByName joins the two DataFrames by matching the names of the columns. Since both the women’s and the men’s detailed match results have the same columns, this is a good approach. Alternatively, there’s also union, which combines two DataFrames, matching column positions.

Selecting Columns

Selecting columns from a DataFrame in PySpark can be done using the .select() method. We just have to indicate the name or names of the columns that are relevant as a parameter.

Here’s the output for w_scores.show(5):

# Selecting a single column
w_scores = w_data.select("WScore")

# Selecting multiple columns
teamid_w_scores = w_data.select("WTeamID", "WScore")

Here’s the output for w_scores.show(5):

+------+
|Season|
+------+
|  2010|
|  2010|
|  2010|
|  2010|
|  2010|
+------+
only showing top 5 rows

The columns can also be renamed when being selected using the .alias() method:

winners = w_data.select(
    w_data.WTeamID.alias("TeamID"),
    w_data.WScore.alias("Score")
)

Grouping Data

Grouping allows us to carry out certain operations for the groups that exist within the data and is usually combined with a aggregation functions. We can use .groupBy() for this:

# Grouping and aggregating
winners_average_scores = winners.groupBy("TeamID").avg("Score")

In this example, we are grouping by "TeamID", meaning we’re considering the groups of rows that have a distinct value for "TeamID". For each of those groups, we’re calculating the average of the "Score". This way, we get the average score for each team.

Here’s the output of winners_average_scores.show(5), showing the average score of each team:

+------+-----------------+
|TeamID|       avg(Score)|
+------+-----------------+
|  3125|             68.5|
|  3345|             74.2|
|  3346|79.66666666666667|
|  3376|73.58333333333333|
|  3107|             61.0|
+------+-----------------+

Joining Data

Joining two DataFrames can be done using the .join() method. Joining is essentially extending the DataFrame by adding the columns of one DataFrame to another.

# Joining on Season and TeamID
final_df = matches_df.join(stats_df, on=['Season', 'TeamID'], how='left')

In this example, both stats_df and matches_df were using Season and TeamID as unique identifiers for each row. Besides Season and TeamID, stats_df has other columns, such as statistics for each team during each season, whereas matches_df has information about the matches, such as date and location. This operation allows us to add those interesting statistics to the matches information!

Functions

There are several functions that PySpark provides that help us transform DataFrames. You can find the full list here.

Here’s an example of a simple function:

from pyspark.sql import functions as F

w_data = w_data.withColumn("HighScore", F.when(F.col("Score") > 80, "Yes").otherwise("No"))

In the code snippet above, a "HighScore" column is created when the score is higher than 80. For each row in the "Score" column (indicated by the .col() function), the value "Yes" is chosen for the "HighScore" column if the "Score" value is larger than 80, determined by the .when() function. .otherwise(), the value chosen is "No".

Feature Engineering in Action

Now that we have a basic understanding of PySpark and how it can be used, let’s go over how the regular season statistics features were created. These features were then used as inputs into our machine learning model to try to predict the outcome of the final tournament games.

The starting point was a DataFrame, regular_data, that contained match by match statistics for the regular seasons, which is the United States College Basketball Season that happens from November to March each year.

Each row in this DataFrame contained the season, the day the match was held, the ID of team 1, the ID of team 2, and other information such as the location of the match. Importantly, it also contained statistics for each team for that specific match, such as "T1_FGM", meaning the Field Goals Made (FGM) for team 1, or "T2_OR", meaning the Offensive Rebounds (OR) of team 2.

The first step was selecting which columns would be used. These were columns that strictly contained in-game statistics.

# Columns that we'll want to get statistics from
boxscore_cols = [
    'T1_FGM', 'T1_FGA', 'T1_FGM3', 'T1_FGA3', 'T1_OR', 'T1_DR', 'T1_Ast', 'T1_Stl', 'T1_PF', 
    'T2_FGM', 'T2_FGA', 'T2_FGM3', 'T2_FGA3', 'T2_OR', 'T2_DR', 'T2_Ast', 'T2_Stl', 'T2_PF'
]

If you’re interested, here’s what each statistic’s code means:

FGM: Field Goals Made
FGA: Field Goals Attempted
FGM3: Field Goals Made from the 3-point-line
FGA3: Field Goals Attempted for 3-point-line goals
OR: Offensive Rebounds. A rebounds is when the ball rebounds from the board when a goal is attempted, not getting in the net. If the team that attempted the goal gets possession of the ball, it’s called an "Offensive" rebound. Otherwise, it’s called a "Defensive" Rebound.
DR: Defensive Rebounds
Ast: Assist, a pass that led directly to a goal
Stl: Steal, when the possession of the ball is stolen
PF: Personal Foul, when a player makes a foul

From there, a dictionary of aggregation expressions was created. Basically, for each column name in the previous list of columns, a function was stored that would calculate the mean of the column, and rename it, by adding a suffix, "mean".

from pyspark.sql import functions as F
from pyspark.sql.functions import col  # select a column

agg_exprs = {col: F.mean(col).alias(col + 'mean') for col in boxscore_cols}

Then, the data was grouped by "Season" and "T1_TeamID", and the aggregation functions of the previously created dictionary were used as the argument for .agg().

season_statistics = regular_data.groupBy(["Season", "T1_TeamID"]).agg(*agg_exprs.values())

Note that the grouping was done by season and the ID of team 1 – this means that "T2_FGAmean", for example, will actually be the mean of the Field Goals Attempted made by the opponents of T1, not necessarily of a specific team. So, we actually need to rename the columns that are something like "T2_FGAmean" to something like "T1_opponent_FGAmean".

# Rename columns for T1
for col in boxscore_cols:
    season_statistics = season_statistics.withColumnRenamed(col + 'mean', 'T1_' + col[3:] + 'mean') if 'T1_' in col 
        else season_statistics.withColumnRenamed(col + 'mean', 'T1_opponent_' + col[3:] + 'mean')

At this point, it’s important to mention that the regular_data DataFrame actually has two rows per each match that occurred. This is so that both teams can be "T1" and "T2", for each match. This little "trick" is what makes these statistics useful.

Note that we "only" have the statistics for "T1". We "need" the statistics for "T2" as well – "need" in quotations because there are no new statistics being calculated. We just need the same data, but with the columns having different names, so that for a match with "T1" and "T2", we have statistics for both T1 and T2. So, we created a mirror DataFrame, where, instead of "T1…mean" and "T1opponent…mean", we have "T2…mean" and "T2opponent…mean". This is important because, later on, when we’re joining these regular season statistics to tournament matches, we’ll be able to have statistics for both team 1 and team 2.

season_statistics_T2 = season_statistics.select(
    *[F.col(col).alias(col.replace('T1_opponent_', 'T2_opponent_').replace('T1_', 'T2_')) if col not in ['Season'] else F.col(col) for col in season_statistics.columns]
)

Now, there are two DataFrames, with season statistics for "both" T1 and T2. Since the final DataFrame will contain the "Season", the "T1TeamID" and the "T2TeamID", we can join these newly created features with a join!

tourney_df = tourney_df.join(season_statistics, on=['Season', 'T1_TeamID'], how='left')
tourney_df = tourney_df.join(season_statistics_T2, on=['Season', 'T2_TeamID'], how='left')

Elo Ratings

First created by Arpad Elo, Elo is a rating system for zero-sum games (games where one player wins and the other loses), like basketball. With the Elo rating system, each team has an Elo rating, a value that generally conveys the team’s quality. At first, every team has the same Elo, and whenever they win, their Elo increases, and when they lose, their Elo decreases. A key characteristic of this system is that this value increases more with a win against a strong opponent than with a win against a weak opponent. Thus, it can be a very useful feature to have!

We wanted to capture the Elo rating of a team at the end of the regular season, and use that as feature for the tournament. To do this, we calculated the Elo for each team on a per match basis. To calculate Elo for this feature, we found it more straightforward to use Pandas.

Central to Elo is calculating the expected score for each team. It can be described in code like so:

# Function to calculate expected score
def expected_score(ra, rb):
    # ra = rating (Elo) team A
    # rb = rating (Elo) team B
    # Elo function
    return 1 / (1 + 10 ** ((rb - ra) / 400))

Considering a team A and a team B, this function computes the expected score of team A against team B.

For each match, we would update the teams’ Elos. Note that the location of the match also played a part – winning at home was considered less impressive than winning away.

# Function to update Elo ratings, keeping T1 and T2 terminology
def update_elo(t1_elo, t2_elo, location, T1_Score, T2_Score):
    expected_t1 = expected_score(t1_elo, t2_elo)
    expected_t2 = expected_score(t2_elo, t1_elo)

    actual_t1 = 1 if T1_Score > T2_Score else 0
    actual_t2 = 1 - actual_t1

    # Determine K based on game location
    # The larger the K, the bigger the impact
    # team1 winning at home (location=1) less impressive than winning away (location = -1)
    if actual_t1 == 1:  # team1 won
        if location == 1:
            k = 20
        elif location == 0:
            k = 30
        else:  # location = -1
            k = 40
    else:  # team2 won
        if location == 1:
            k = 40
        elif location == 0:
            k = 30
        else:  # location = -1
            k = 20

    new_t1_elo = t1_elo + k * (actual_t1 - expected_t1)
    new_t2_elo = t2_elo + k * (actual_t2 - expected_t2)

    return new_t1_elo, new_t2_elo

To apply the Elo rating system, we iterated through each season’s matches, initializing teams with a base rating and updating their ratings match by match. The final Elo available for each team in each season will, hopefully, be a good descriptor of the team’s quality.

def calculate_elo_through_seasons(regular_data):

    # For this feature, using Pandas
    regular_data = regular_data.toPandas()

    # Set value of initial elo
    initial_elo = 1500

    # DataFrame to collect final Elo ratings
    final_elo_list = []

    for season in sorted(regular_data['Season'].unique()):
        print(f"Season: {season}")
        # Initialize elo ratings dictionary
        elo_ratings = {}

        print(f"Processing Season: {season}")
        # Get the teams that played in the season
        season_teams = set(regular_data[regular_data['Season'] == season]['T1_TeamID']).union(set(regular_data[regular_data['Season'] == season]['T2_TeamID']))

        # Initialize season teams' Elo ratings
        for team in season_teams:
            if (season, team) not in elo_ratings:
                elo_ratings[(season, team)] = initial_elo

        # Update Elo ratings per game
        season_games = regular_data[regular_data['Season'] == season]
        for _, row in season_games.iterrows():
            t1_elo = elo_ratings[(season, row['T1_TeamID'])]
            t2_elo = elo_ratings[(season, row['T2_TeamID'])]

            new_t1_elo, new_t2_elo = update_elo(t1_elo, t2_elo, row['location'], row['T1_Score'], row['T2_Score'])

            # Only keep the last season rating
            elo_ratings[(season, row['T1_TeamID'])] = new_t1_elo
            elo_ratings[(season, row['T2_TeamID'])] = new_t2_elo

        # Collect final Elo ratings for the season
        for team in season_teams:
            final_elo_list.append({'Season': season, 'TeamID': team, 'Elo': elo_ratings[(season, team)]})

    # Convert list to DataFrame
    final_elo_df = pd.DataFrame(final_elo_list)

    # Separate DataFrames for T1 and T2
    final_elo_t1_df = final_elo_df.copy().rename(columns={'TeamID': 'T1_TeamID', 'Elo': 'T1_Elo'})
    final_elo_t2_df = final_elo_df.copy().rename(columns={'TeamID': 'T2_TeamID', 'Elo': 'T2_Elo'})

    # Convert the pandas DataFrames back to Spark DataFrames
    final_elo_t1_df = spark.createDataFrame(final_elo_t1_df)
    final_elo_t2_df = spark.createDataFrame(final_elo_t2_df)

    return final_elo_t1_df, final_elo_t2_df

Ideally, we wouldn’t calculate Elo changes on a match-by-match basis to determine each team’s final Elo for the season. However, we couldn’t come up with a better approach. Do you have any ideas? If so, let us know!

Value Added

The Feature Engineering steps demonstrated show how we can transform raw data – regular season statistics – into valuable information with predictive power. It is reasonable to assume that a team’s performance during the regular season is indicative of its potential performance in the final tournaments. By calculating the mean of observed match-by-match statistics for both the teams and their opponents, along with each team’s Elo rating in their final match, we were able to create a dataset suitable for modelling. Then, models were trained to predict the outcome of tournament matches using these features, among others developed in a similar way. With these models, we only need the two team IDs to look up the mean of their regular season statistics and their Elos to feed into the model and predict a score!

Conclusion

In this post, we looked at some of the theory behind Spark and PySpark, how that can be applied, and a concrete practical example. We explored how feature engineering can be done in the case of sports data, creating regular season statistics to use as features for final tournament games. Hopefully you’ve found this interesting and helpful – happy feature engineering!

The full source code for this post and others in the series can be found here.

Originally published at https://nobledynamic.com on April 8, 2024.

References

[1] Jeff Sonas, Ryan Holbrook, Addison Howard, Anju Kandru. (2024). March Machine Learning Mania 2024. Kaggle. https://kaggle.com/competitions/march-machine-learning-mania-2024

The post Feature Engineering with Microsoft Fabric and PySpark appeared first on Towards Data Science.

Fabric Madness

Roger Noble — Mon, 01 Apr 2024 19:01:29 +0000

Image by author and ChatGPT. "Design an illustration, focusing on a basketball player in action, the design integrates sports and data analytics themes in a graphic novel style" prompt. ChatGPT, 4, OpenAI, 28 March. 2024. https://chat.openai.com.

A Huge thanks to Martim Chaves who co-authored this post and developed the example scripts.

At the time of writing, it’s Basketball season in the United States, and there is a lot of excitement around the men’s and women’s college basketball tournaments. The format is single elimination, so over the course of several rounds, teams are eliminated, till eventually we get a champion. This tournament is not only a showcase of upcoming basketball talent, but, more importantly, a fertile ground for data enthusiasts like us to analyse trends and predict outcomes.

One of the great things about sports is that there is lots of data available, and we at Noble Dynamic wanted to take a crack at it .

In this series of posts titled Fabric Madness, we’re going to be diving deep into some of the most interesting features of Microsoft Fabric, for an end-to-end demonstration of how to train and use a machine learning model.

In this first blog post, we’ll be going over:

A first look at the data using Data Wrangler.
Exploratory Data Analysis (EDA) and Feature Engineering
Tracking the performance of different Machine Learning (ML) Models using Experiments
Selecting the best performing model using the ML Model functionality

The Data

The data used was obtained from the on-going Kaggle competition, the details of which can be found here, which is licensed under CC BY 4.0 [1]

Among all of the interesting data available, our focus for this case study was on the match-by-match statistics. This data was available for both the regular seasons and the tournaments, going all the way back to 2003. For each match, besides the date, the teams that were playing, and their scores, other relevant features were made available, such as field goals made and personal fouls by each team.

Loading the Data

The first step was creating a Fabric Workspace. Workspaces in Fabric are one of the fundamental building blocks of the platform, and are used for grouping together related items and for collaboration.

After downloading all of the CSV files available, a Lakehouse was created. A Lakehouse, in simple terms, is a mix between a Database of Tables (structured) and a Data Lake of Files (unstructured). The big benefit of a Lakehouse is that data is available for every tool in the workspace.

Uploading the files was done using the UI:

Fig. 1 – Uploading Files. Image by Martim Chaves

Now that we have a Lakehouse with the CSV files, it was time to dig in, and get a first look at the data. To do that, we created a Notebook, using the UI, and attached the previously created Lakehouse.

Fig. 2 – Adding Lakehouse to Notebook. Image by Martim Chaves

First Look

After a quick data wrangling, it was found that, as expected with data from Kaggle, the quality was great. With no duplicates or missing values.

For this task we used Data Wrangler, a tool built into Microsoft Fabric notebooks. Once an initial DataFrame has been created (Spark or Pandas supported), Data Wrangler becomes available to use and can attach to any DataFrame in the Notebook. What’s great is that it allows for easy analysis of loaded DataFrames.

In a Notebook, after reading the files into PySpark DataFrames, in the "Data" section, the "Transform DataFrame in Data Wrangler" was selected, and from there the several DataFrames were explored. Specific DataFrames can be chosen, carrying out a careful inspection.

Fig. 3 – Opening Data Wrangler. Image by Martim Chaves

Fig. 4 – Analysing the DataFrame with Data Wrangler. Image by Martim Chaves

In the centre, we have access to all of the rows of the loaded DataFrame. On the right, a Summary tab, showing that indeed there are no duplicates or missing values. Clicking in a certain column, summary statistics of that column will be shown.

On the left, in the Operations tab, there are several pre-built operations that can be applied to the DataFrame. The operations feature many of the most common data wrangling tasks, such as filtering, sorting, and grouping, and is a quick way to generate boilerplate code for these tasks.

In our case, the data was already in good shape, so we moved on to the EDA stage.

Exploratory Data Analysis

A short Exploratory Data Analysis (EDA) followed, with the goal of getting a general idea of the data. Charts were plotted to get a sense of the distribution of the data and if there were any statistics that could be problematic due to, for example, very long tails.

Fig. 5 – Histogram of field goals made. Image by Martim Chaves

At a quick glance, it was found that the data available from the regular season had normal distributions, suitable to use in the creation of features. Knowing the importance that good features have in creating solid predictive systems, the next sensible step was to carry out feature engineering to extract relevant information from the data.

The goal was to create a dataset where each sample’s input would be a set of features for a game, containing information of both teams. For example, both teams average field goals made for the regular season. The target for each sample, the desired output, would be 1 if Team 1 won the game, or 0 if Team 2 won the game (which was done by subtracting the scores). Here’s a representation of the dataset:

Feature Engineering

The first feature that we decided to explore was win rate. Not only would it be an interesting feature to explore, but it would also provide a baseline score. This initial approach employed a simple rule: the team with the higher win rate would be predicted as the winner. This method provides a fundamental baseline against which the performance of more sophisticated predictive systems can be compared to.

To evaluate the accuracy of our predictions across different models, we adopted the Brier score. The Brier score is the mean of the square of the difference between the predicted probability (p) and the actual outcome (o) for each sample, and can be described by the following formula:

Image by the author

The predicted probability will vary between 0 and 1, and the actual outcome will either be 0 or 1. Thus, the Brier score will always be between 0 and 1. As we want the predicted probability to be as close to the actual outcome as possible, the lower the Brier score, the better, with 0 being the perfect score, and 1 the worst.

For the baseline, the previously mentioned dataset structure was followed. Each sample of the dataset was a match, containing the win rates for the regular season for Team 1 and Team 2. The actual outcome was considered 1 if Team 1 won, or 0 if Team 2 won. To simulate a probability, the prediction was a normalised difference between T1’s win rate and T2’s win rate. For the maximum value of the difference between the win rates, the prediction would be 1. For the minimum value, the prediction would be 0.

After calculating the win rate, and then using it to predict the outcomes, we got a Brier score of 0.23. Considering that guessing at random leads to a Brier score of 0.25, it’s clear that this feature alone is not very good .

By starting with a simple baseline, it clearly highlighted that more complex patterns were at play. We went ahead to developed another 42 features, in preparation for utilising more complex algorithms, machine learning models, that might have a better chance.

It was then time to create machine learning models!

Models & Machine Learning Experiments

For the models, we opted for simple Neural Networks (NN). To determine which level of complexity would be best, we created three different NNs, with an increasing number of layers and hyper-parameters. Here’s an example of a small NN, one that was used:

Fig. 6 – Diagram of a Neural Network. Image by Martim Chaves using draw.io

If you’re familiar with NNs, feel free to skip to the Experiments! If you’re unfamiliar with NNs think of them as a set of layers, where each layer acts as a filter for relevant information. Data passes through successive layers, in a step-by-step fashion, where each layer has inputs and outputs. Data moves through the network in one direction, from the first layer (the model’s input) to the last layer (the model’s output), without looping back, hence the Sequential function.

Each layer is made up of several neurons, that can be described as nodes. The model’s input, the first layer, will contain as many neurons as there are features available, and each neuron will hold the value of a feature. The model’s output, the last layer, in binary problems such as the one we’re tackling, will only have 1 neuron. The value held by this neuron should be 1 if the model is processing a match where Team 1 won, or 0 if Team 2 won. The intermediate layers have an ad hoc number of neurons. In the example in the code snippet, 64 neurons were chosen.

In a Dense layer, as is the case here, each neuron in the layer is connected to every neuron in the preceding layer. Fundamentally, each neuron processes the information provided by the neurons from the previous layer.

The processing of the previous layer’s information requires an activation function. There are many types of activation functions – ReLU, standing for Rectified Linear Unit, is one of them. It allows only positive values to pass and sets negative values to zero, making it effective for many types of data.

Note that the final activation function is a sigmoid function – this converts the output to a number between 0 and 1. This is crucial for binary classification tasks, where you need the model to express its output as a probability.

Besides these small models, medium and large models were created, with an increasing number of layers and parameters. The size of a model affects its ability to capture complex patterns in the data, with larger models generally being more capable in this regard. However, larger models also require more data to learn effectively – if there’s not enough data, issues may occur. Finding the right size is sometimes only possible through experimentation, by training different models and comparing their performance to identify the most effective configuration.

The next step was running the experiments !

What is an Experiment?

In Fabric, an Experiment can be seen as a group of related runs, where a run is an execution of a code snippet. In this context, a run is a training of a model. For each run, a model will be trained with a different set of hyper-parameters. The set of hyper-parameters, along with the final model score, is logged, and this information is available for each run. Once enough runs have been completed, the final model scores can be compared, so that the best version of each model can be selected.

Creating an Experiment in Fabric can be done via the UI or directly from a Notebook. The Experiment is essentially a wrapper for MLFlow Experiments. One of the great things about using Experiments in Fabric is that the results can be shared with others. This makes it possible to collaborate and allow others to participate in experiments, either writing code to run experiments, or analysing the results.

Creating an Experiment

Using the UI to create an Experiment simply select Experiment from the + New button, and choose a name.

Fig. 7 – Creating an Experiment using the UI. Image by Martim Chaves

When training each of the models, the hyper-parameters are logged with the experiment, as well as the final score. Once completed we can see the results in the UI, and compare the different runs to see which model performed best.

Fig. 8 – Comparing different runs. Image by Martim Chaves

After that we can select the best model and use it to make the final prediction. When comparing the three models, the best Brier score was 0.20, a slight improvement !

Conclusion

After loading and analysing data from this year’s US major college basketball tournament, and creating a dataset with relevant features, we were able to predict the outcome of the games using a simple Neural Network. Experiments were used to compare the performance of different models. Finally, the best performing model was selected to carry out the final prediction.

In the next post we will go into detail on how we created the features using pyspark. Stay tuned for more!

The full source code for this post can be found here.

Originally published at https://nobledynamic.com on April 1, 2024.

References

[1] Jeff Sonas, Ryan Holbrook, Addison Howard, Anju Kandru. (2024). March Machine Learning Mania 2024. Kaggle. https://kaggle.com/competitions/march-machine-learning-mania-2024

The post Fabric Madness appeared first on Towards Data Science.

College Basketball’s NET Rankings Explained

Giovanni Malloy — Wed, 08 Mar 2023 00:07:37 +0000

College Basketball’s NET Rankings, Explained

How data science drives March Madness

Photo by Jacob Rice on Unsplash

If you are a college basketball fan, you are starting to salivate because March Madness is just around the corner. If you are new to the college basketball scene, March Madness is the name for the NCAA tournament crowning the champion of Division I men’s basketball. Whether you are a burgeoning fan or a 50-year veteran spectator, data science is playing a larger role than ever before in how you experience the game. The teams comprising the tournament are chosen in large part by a data-driven algorithm called NET rankings.

As a data scientist or machine learning engineer, it is important to understand how the field can impact different industries, including sports and entertainment. While college basketball is a late adopter of a growing trend, the NET rankings are a prime example of how the way we shape our Algorithms can influence outcomes and incentivize behaviors. If you work in sports analytics, understanding NET rankings is an absolute must, but regardless of your industry, the college basketball world has laid out an important case study in using data science to improve their product and grow their revenue.

Quick introduction to March Madness

For those of you who have never heard of March Madness, this blog requires some additional context: March Madness is a 68-team men’s college Basketball tournament that runs from mid-March to early April every year. The winner of the tournament is crowned the National Champion. To start the tournament, there are four play-in games called the "First Four". After these four games, the remaining 64 teams are then divided into four regions of 16 teams ranked 1–16. The champion of each region makes it to the semi-finals called the "Final Four".

Much of the discussion throughout the season revolves around March Madness. There is broad interest in the tournament, as it is often an excellent excuse to wager money among friends or in Las Vegas. Of the 68 teams that make up the field, 31 are conference champions and 37 receive "at-large" bids [3]. The study and chatter of how these teams will be organized in the tournament is called "bracketology." Bracketology is more of an art than a science, however. Deciding who receives the "at-large" bids is a topic of constant debate. This is where NET rankings come into play.

Introduction to NET Rankings

Back in 2018, the NCAA first released a new ranking system called the NCAA Evaluation Tool or NET [1]. The ranking system is a collaboration with Google Cloud Professional Services aimed at providing a data-driven indicator of the quality of a given college basketball team. When the rankings were first released, they relied on five different metrics: Team Value Index, Net Efficiency, win percentage, adjusted win percentage, and scoring margin [2]. However, since then, the rankings have been adjusted to include only Team Value Index and Net Efficiency [1].

There is certainly a debate as to whether this is the best system for determining the quality of teams among sports writers and basketball fans. Regardless of the various opinions of NET rankings, it is used as the basis for decision making by the NCAA selection committee to determine which teams receive an "at-large" bid and how to assign rankings within a region (these rankings are called seeds). All of these decisions can affect the outcome of the tournament. Thus, you can start to see how Data Science underlies the bedrock of March Madness.

Calculating NET Rankings

NET rankings are driven by data science. The NCAA tweeted this graphic in 2018 to explain the metric:

As you can see, the Team Value Index is a function of the game result, opponent, and location. The algorithm to calculate the Team Value Index is not published and therefore a black box, but we know for sure that an important component of Team Value Index is related to opponent quality. The NET rankings subdivide opponent quality into four quadrants aptly named Quad 1, Quad 2, Quad 3, and Quad 4. According to [4], here is how the quadrants are defined:

Quad 1: "Home games vs. opponents with NET ranking of 1–30, Neutral games vs. opponents with NET ranking of 1–50, Away games vs. opponents with NET ranking of 1–75" [4]
Quad 2: "Home games vs. opponents with NET ranking of 31–75, Neutral games vs. opponents with NET ranking of 51–100, Away games vs. opponents with NET ranking of 76–135" [4]
Quad 3: "Home games vs. opponents with NET ranking of 76–160, Neutral games vs. opponents with NET ranking of 101–200, Away games vs. opponents with NET ranking of 135–240" [4]
Quad 4: "Home games vs. opponents with NET ranking of 161–363, Neutral games vs. opponents with NET ranking of 201–363, Away games vs. opponents with NET ranking of 241–363" [4]

The Quad system inherently captures features of opponent strength and location. Therefore, regardless of the output Team Value Index, the selection committee focuses heavily on Quad 1 wins and Quad 4 losses when assigning "at-large" bids and tournament seeds.

Net Efficiency, on the other hand, is extremely transparent. Net Efficiency is a function of offensive and defensive efficiency [2]. Offensive efficiency is calculated as:

O = PF/(FGA – OREB+TO+.475*FTA)

Where O is offensive efficiency, PF is points for (total points scored), FGA is field goal attempts (number of shots), OREB is offensive rebounds, TO is turnovers, and FTA is free throw attempts [2].

Defensive efficiency is calculated as:

D = PA/(Opp_FGA – Opp_OREB+Opp_TO+.475*Opp_FTA)

Where D is defensive efficiency, PA is points against, Opp_FGA is opponent’s field goal attempts, Opp_OREB is opponent’s offensive rebounds, Opp_TO is opponent’s turnovers, and Opp_FTA is opponent’s free throw attempts [2].

Net efficiency is simply the difference between offensive and defensive efficiency, or NE = O – D [2]. Net efficiency is a dense metric and captures a team’s performance relative to their opponent in a wholistic manner.

Example NET Team Sheet

So, how does this all come together for the NCAA selection committee? The answer is not completely clear. Obviously, they will have access to the NET rankings. In addition, they will have access to a report on each team in the form of a NET sheet. Each team’s NET sheet is split up into several sections. Across the top of the sheet, there is the NET rank, information on the team record, strength of schedule, opponent average NET rankings, other result-based and predictive rankings, and the win-loss record broken down by opponent quadrant and game location. The bottom half of the sheet is a game-by-game breakdown of team performance divided into sections by opponent NET ranking/quadrant. I encourage you to look at some example NET team sheets such as the one I have below from [5].

Photo by NET – Nitty Gritty Report with Team Sheets for NCAA Men’s College Basketball | WarrenNolan.com used with permission.

While it isn’t the most beautiful data visualization ever created, there is a lot of information packed into a tight space. On the team sheets, the Quad 1and Quad 2games are further divided into upper and lower halves. Also notice that non-conference games are highlighted in blue and delineated in the metrics above. Losses are highlighted in red so as to easily point out bad (Quad 4) losses or great (Quad 1) wins. As you can tell, the quadrant system plays a key role in the presentation of the data.

Limitations of NET Rankings

I know there are many basketball fans out there who are critical of the NET ranking system. No model is perfect, and NET is no exception. However, I will try to highlight some limitations that I see of the NET ranking system from the perspective of a data scientist (in conjunction with a college basketball fan).

The biggest limitation I see with the NET ranking system is that does not take recency into account [1]. While it’s true that consistency over an entire season is valuable and laudable, there is something to be said for peaking at the right time in the season. Whether it is conditioning, chemistry, or confidence, everything needs to align perfectly to have success during March Madness. In basketball speak, these are the "intangibles". They are not easily measured (although some have tried), but they are changing over time and do affect outcomes. In econometrician speak, this is the "heterogeneity" inherent to the model.

Another curiosity of the NET rankings that I will categorize as a limitation is the delay in their initial release. The NET rankings are updated daily but not until early December – after most teams have played between 5 and 10 games. I think this likely signifies that there is a highly uncertain initialization state for the NET rankings. It would be interesting to know whether each team begins the season in a specific ranking or quadrant based on historical data, subjective intuition, or a random distribution. If we were able to see the initial state of the NET rankings before the first tip-off of the season, I think we could gain some very valuable insight into how the algorithm works. Is it completely naive or is there an element of transfer learning from seasons prior or other polls’ preseason rankings?

To tabulate a final NET ranking, I assume that there is some manner in which Team Value Index is converted to a numerical value and it is combined with Net Efficiency to calculate a weighted metric of team quality. I will admit that NET rankings could very well be a heuristic or other non-AI algorithm. Certainly, the manner in which the Net Efficiency statistic is calculated would suggest that the NCAA would be open to a heuristic-type approach. Moreover, my experience with providing data science insight into a non-technical realm, such as health policy, has shown me that sometimes less is more. More understandable models can sometimes be more attractive to decision makers.

Nonetheless, my third and final limitation relies on the assumption that this is a supervised learning algorithm. If NET rankings are derivative of a supervised learning algorithm, then I wonder where the training data might come from. What would be the baseline truth? How is accuracy measured? What truly distinguishes team #232 from team #233? Even when comparing the same team to itself year over year, you could be comparing wildly different rosters. It would be hard to find meaning in an error metric like root mean squared error.

Hypothesizing the underlying algorithm

So, how does the NET ranking system come together? Perhaps we should try to re-create it? We do know a couple of things for certain:

The former gold standard statistical model for college basketball rankings, the RPI ranking system, was an elegant but simple heuristic algorithm. Institutions like the NCAA are not necessarily known for innovation, and I doubt the college basketball community wants to feel that its crown jewel tournament is driven by non-interpretable AI algorithm. So, my best guess is that there is limited, if any, machine learning at play. Harkening back to the third limitation I mentioned earlier, a supervised learning approach is probably more trouble than it is worth.
The NET rankings are in some sense recursive. The NET ranking of a team is dependent on the NET rankings of its opponents which are dependent on the NET rankings of its opponents, and so on and so forth. NET rankings could be driven by a Bayesian approach whereby there is an initial naive distribution assumed for each team, and after each game, that distribution is updated.
Google Cloud Professional Services are involved. This might be a great example of cognitive bias or clever marketing, but I want to believe that whatever Google touches uses cutting-edge methodology. While not necessarily true, partnering with Google gives the NCAA access to massive computational resources and ability to develop methods beyond the traditional sports analytics’ realm. Even if the algorithm is interpretable, perhaps the structure is complex and potentially even counterintuitve.
The historical NET rankings are difficult to find. After about an hour of searching the web, it was hard to find any sources that publish the NET rankings each day. This makes me skeptical enough to directly contradict my supposition in point 3. Perhaps, the algorithm is simple enough, that it could be easily re-engineered with access to a season’s worth of data and NET rankings. Perhaps, we could fit a simple linear regression to produce a score value for each team and the NET rankings are a sorted list of the resulting scores.

Given that there are many possible underlying methods to producing the ultimate NET rankings, I believe the most likely scenario is that the NCAA is using ensemble learning, such as voting. This means that they could be taking multiple approaches to producing a NET ranking as a function of Team Value Index and Net Efficiency. Then, they combine the results of these methods to develop a final NET ranking that gets published each week.

Follow the Money

Photo by Giorgio Trovato on Unsplash

While it is a fun exercise to attempt to surmise the possible methods to produce the NET ranking, my opinion is that the results are not the most important outcome of NET ranking. Of course, the NCAA wants to identify the best teams to make March Madness as competitive and meritorious as possible. So, the rankings need to appear to be reasonable and likely will align closely with the AP and Coaches Polls.

However, the NET rankings do something important for the business of college basketball that no other ranking system explicitly does: NET rankings incentivize high quality non-conference games.

This point cannot be understated. This is the other side of the power of data. By publishing the components of the NET rankings, the NCAA is proclaiming very publicly the metrics to which all teams ought to perform. From a business strategy perspective, the Team Value Index is genius. Like the Net Efficiency metric, it strives to quantify the quality of a college basketball team in an understandable way. Unlike the Net Efficiency metric, however, Team Value Index pushes athletic directors, coaches, and all parties involved in scheduling to ensure that the non-conference slate of games is flush with Quad 1 games. Compared to the conference schedule, non-conference schedules are flexible and versatile. While the importance of non-conference competition has been increasing even before the NET rankings debuted, the Team Value Index and emphasis on the quadrant system in NET rankings formalizes this importance and rewards teams with tougher schedules.

In a sports media and entertainment landscape where streaming is becoming increasingly important, content is king. The greater number of high quality (Quad 1) games that NCAA basketball can produce during the season, the greater the value of the media content that is college basketball. As more streaming services dip their toes into college athletics, improved in-season games means more fans for the sport. More fans mean more intrigue in during both the regular season and post-season. More intrigue means more revenue all season long and much more revenue for the already lucrative March Madness.

If you doubt that NET rankings are intended to incentivize higher quality non-conference play, please refer back to the image of the NET team sheet. The non-conference strength of schedule and records get their own lines at the top while the non-conference games are highlighted in bright cyan. I cannot guarantee that this is the version of the team sheet that the selection committee sees, but it nonetheless supports my claim.

Conclusions

If you are like me, by the end of this blog, you might have more questions than answers. The components of the NET ranking are so simple, but the algorithm that brings those components together into one coherent ranking is shrouded in secrecy. There are many possible methods and models that could be used to produce the NET rankings, but whatever the method is, it is backed by data.

Data is driving March Madness from both ends. The NET rankings inform the selection committee on how to structure the tournament while they explicitly incentivize the teams to improve their non-conference strength of schedule. Overall, I think the NET rankings are good for the sport of basketball and the tournament. They can help reduce bias in selecting and seeding teams during March Madness and improve the quality of the sport during the regular season. So, whether you have never missed a First Four or never heard of the Final Four, now you know how data science is behind it all.

References

[1] College basketball dictionary: 51 terms defined | NCAA.com

[2] College basketball’s NET rankings, explained | NCAA.com

[3] The First Four of the NCAA tournament, explained | NCAA.com

[4] College basketball NET rankings, explained: How Quad 1 wins impact NCAA tournament teams | Sporting News

[5] NET – Nitty Gritty Report with Team Sheets for NCAA Men’s College Basketball | WarrenNolan.com

Interested in my content? Please consider following me on Medium.

Follow me on Twitter: @malloy_giovanni

What algorithm do you think is behind the NET rankings? Do you prefer another system? Please comment below with your thoughts or experiences to keep the discussion rolling!

The post College Basketball’s NET Rankings Explained appeared first on Towards Data Science.

Predicting the Probability of Scoring a Basket in the NBA using Gradient Boosted Trees

Christophe Brown — Thu, 11 Nov 2021 17:42:45 +0000

Introduction

In my first blog on Machine Learning in the NBA I wrote on how professional sports generate extensive amounts of data, opening a wide avenue of possibilities. We can look beyond the perception of spectators and deeper into the inner workings of the game of Basketball.

With as much data as we have available, data scientists can be creative in both the models they create and the data used to create them. That brings us to this blog post. To start us off, let’s pose a bizarre question – and use Machine Learning to see if we can get close to a rational answer:

You’re putting $1 million on the line for LeBron James to make a shot in a basketball game. How could you be certain that you will win the payout?

Check out the companion video for this blog as well: https://youtu.be/Lxfsvw7rHgU

Photo by Alexander Schimmeck on Unsplash

The Stats

We want to use information that will give us a strong indication as to whether or not James will score a basket. At the time of writing, LeBron James has a 50.4% career shooting percentage. Loosely speaking, that means if we look at his entire career dating back to 2003, every time he’s attempted a shot, the odds of him scoring were about the same as flipping a coin. Not great if we have $1 million on the line. We need to tip the scale in our favor.

Talent comes in all shapes and sizes in the NBA, and one example of this is seeing how different players have different play styles and shooting preferences. Some instances of this include:

Shooting right-handed versus left-handed
Shooting long-range versus mid-range
Performing well under pressure (e.g. in the final seconds of a close game)

Believe it or not, this information, and more, is captured in NBA stats data. We can leverage this data and exploit it to see how we can find the right basket attempt to place a bet on. While we’re on this note, I’ll mention that all of the data referenced henceforth in this blog comes from https://www.basketball-reference.com/.

Without overcomplicating things, we’ll stick with just two criteria to build our model: location of the shot attempt and the stage of the game. Our model will use these features from LeBron James’ career shooting data:

XY coordinates of shot attempt (i.e. the location on the court)
Distance from the player to the basket
Time remaining in the quarter when the shot took place
Whether or not the shot was attempted towards the end of a half
Whether or not the shot was in the fourth quarter

The last two features are included to add further context to the stage of the game. In close matchups especially, those last-minute or last-second baskets can influence a game’s outcome significantly, and we want to capture that. This means we’re assuming that baskets scored as the game clock runs out are more valuable, and a player will be more incentivized to score a basket.

Gradient Boosted Trees

We’re going to try answering this question using Gradient Boosted Trees. These are a type of decision tree __ used in supervised Machine Learning.

Decision trees are a widely-used family of algorithms where a "mathematical flowchart" is learned from the input features. Gradient Boosted Trees are a variant where decision trees are built in series, and each tree tries to correct the mistakes the of the previous one. One tree could look something like this:

One of many decision trees that can be used to determine the likelihood of scoring. Visualized with GraphViz.

Above, we can see a tree with its values learned from the input features. It shows us how each value gets us closer to the decision of the tree. The "samples" field tells us how many of the total samples (or basket attempts) fell in this category. The "value" field (as well as the darkness of the shade of color) tells us the more significant indicators are of this tree. The number of total trees used will be equivalent to the _nestimators parameter described below.

Learn more about Gradient Boosting here: Gradient Boosting explained [demonstration] (arogozhnikov.github.io

There are two hyper parameters we’ll spend time tuning in our trees:

_n_estimators_: number of boosting stages to perform (in other words, the number of trees in sequence; we "boost" or learn from one tree to the next)
_learning_rate_: how strongly each tree learns from the last

In this experiment, we expect the algorithm to learn "sweet spots" on the court to shoot from as well as a window marking the best time to take a shot. It may look something like "short-range to the hoop with about 3:26 left will yield a 73% chance at scoring a basket."

What I like about decision trees is that they are simpler to understand in concept as well as in analysis. We’re looking for the highest probability of making a field goal attempt. This will be at some node of the tree, and there exists some combination of our feature set that will navigate us there.

Results and Case Analysis

We’ll spend more time discussing the approach and the results in this blog rather than the code setup. If you’re interested in the source code, I have a GitHub repository linked at the end of this blog.

Train-Test Split

LeBron James has more field goal attempts than any active player in the 2021–22 season. This is great news for us in that we can use all of his scored baskets and all of his missed baskets to train a model to understand where and when he performs best. The features we’re using are known for each shot attempt, alongside whether or not the attempt was successful. For training the following models, I took a standard train-test-split with 75% of the data used to train and 25% used for testing. It’s important to shuffle the data before splitting it to avoid biasing our model to learn from only the first 3/4 of James’ career.

Hyper Parameter Tuning

Next, to maximize the likelihood of success, I ran a brief experiment to look at the influence of _learningrate and _nestimators. This helps select the best model.

I used learning rates between 0.01 and 0.1 with a step size of 0.01. Accuracy ramped up quickly and gradually tapered off, all while hovering within a 2% range. The best accuracy was 64.619% with a learning rate of 0.02.

Tuning the learning_rate (lr) hyper parameter

Next, I tuned the n_estimators parameter. Accuracy peaked marginally higher at 64.733%. This was found using the learning_rate=0.02.

Tuning the n_estimators hyper parameter

Model Testing

With our model ready for action, we can feed it test data to examine the probability of LeBron making a basket given his court position and time remaining on the game clock. I should note that the roughly 64% prediction accuracy is an average prediction rate across all of LeBron’s shots (i.e. the shots used to train the model). This means there are specific shooting scenarios where the model predicts whether or not a basket is scored with even greater accuracy. We can exploit those scenarios to win our cash prize.

With thousands of samples to select from, what I found to be the most insightful way to look at this was to look at the upper percentiles for his shooting. In other words, we can plot out the shot scenarios that LeBron is at least 70% likely to land a field goal:

70-percentile shooting by LeBron James (axes are X-Y court coordinates)

Very little surprise here. LeBron stands at a tall 6’9" weighing in at 250 pounds. Throughout his career, he’s had few issues driving straight to the basket, knocking down anything from clean layups to high-flying dunks. His most reliable shots should be right under the rim. That blue blob comprises of 609 unique shots. Interestingly enough, these are shots that exclusively take place with between 1:30 and 5:07 left on the game clock in any quarter.

Therefore, we now have an answer: despite LeBron James’ 50.4% shooting average, if we were to put $1 million on the line for LeBron James to score a basket, we’d have a 70% chance at winning on an attempt that’s right under the hoop with between 1:30 and 5:07 left on the game clock.

What about other players?

We can apply this algorithm to any player in the League. Ideally those who are more experienced with scoring. Let’s briefly take a look at two other prolific shooters: Kevin Durant and Stephen Curry.

70-percentile shooting by Kevin Durant

Kevin Durant has a similar stature to James, and therefore makes baskets in the paint (under the rim) with ease. He is, however, known to be the "@EasyMoneySniper" hitting long range shots as well. If you were to place this same bet on Kevin Durant, you would have a few more locations to choose from, and you’ll want his shot to take place with between 2.8 seconds and 5:33 left on the game clock. Talk about clutch!

70-percentile shooting by Stephen Curry (4th quarter shots in orange)

If you happen to skip over the text in this blog and look only at these plots, you’d still see that Steph Curry is in a league of his own when it comes to shooting. Plotted above are 613 shots taken by Curry. To my surprise, Steph Curry has a lower career shooting percentage than LeBron James, at 47.6%. So why can we select a shot from Steph Curry from virtually anywhere?

Because Steph Curry can shoot from anywhere.

His shots are not concentrated like James’. He simply has the versatility to land from nearly anywhere on the court. If you want to place your bet on Curry instead, make sure it’s between 0.8 seconds and 11:53 remaining. That’s effectively any time throughout the game! I’ve also color-coded optional fourth-quarter shots in orange. Given his fourth-quarter performance still covers the entire court, Curry is solidified as one of the most reliable shooters the game has ever seen.

Model Limitations

With every great machine learning model comes its limitations. Provided we’re talking about professional basketball games here, we should acknowledge that there are many quantified and unquantified variables not accounted for in our model that could influence the outcome of a game or the outcome of a shot attempt. Some examples include:

Contested shots: in a game, a player can attempt a shot completely unguarded, or have up to 5 defenders guarding them. Obviously, more defenders implies a tougher shot and a lower probability of scoring.
Injury history: players coming off of hand or foot injuries may be reluctant to shoot from one side or the other, even if it’s their dominant side. They may also choose to avoid heavy-contact, which is likely to happen under the rim (some still prevail, however).
Regular season vs. playoffs: While the model described in this blog accounts for shots made in both the regular season and the playoffs, no explicit distinguishment is made in the data itself. Some players are known to elevate their game in the NBA post-season or play more minutes overall, which can affect their in-game performance for the better or worse.

Further studies would be needed in order to understand a player’s mentality when taking shots under pressure. We could additionally consider player physiology, such as how high they jump, the angling of their wrist and shoulders, or their visibility to the net. This data, however, is more scarce.

Bias, Overfitting, and the Context of NBA Basketball

In the proof-of-concept stages of this experiment, I shared a video to my Twitter followers showing that LeBron James could, in fact, have as high as 98% likelihood of making a basket. Purely from a number’s standpoint, it is possible for a probability to be high. One scenario could look something like this:

Shaquille O’Neal is notoriously known for almost never taking a three-point shot throughout his 19 years in the league. In 22 total attempts, he’s scored once. Say that, instead, he scored all 22 of those attempts and his 3-point shooting percentage was near 100%. Looking at all of Shaq’s attempts, he’s attempted over 19,000 shots, most of which are two-point shot attempts averaging a scoring percentage of 58.3%. If his short-range, two-point baskets are successful at a rate of 58.3% and his long-range, three-point attempts are successful at a rate of 100%, you would expect our model to treat long-range three-pointers as an effective guaranteed shot. This is model bias, __ as we are overestimating the true likelihood of making a three-point attempt, which is widely-agreed as a more difficult shot.

The other scenario, which was the case as per my video, is the result of overfitting to the training data. This highlights the importance of tuning hyper parameters to select an adequate model that can perform in the context of professional Sports. The competitive nature of the NBA would likely cause a high probability like this to never see the light of day.

Conclusion

To recap this blog, we covered a space-time analysis for how NBA players have varying likelihoods of scoring a basket. Using Gradient Boosted Trees, I created a model that would account for a player’s positioning on the court as well as the remaining time in the quarter to predict the likelihood that a respective combination would yield a scoring shot attempt. In analysis, we found that a player like LeBron James will often score shots close to the rim, and players like Steph Curry are even better than we thought, being able to reliably land a basket at almost any time, virtually anywhere.

I hope you learned something new and learned something useful in the blog. If you’re interested in checking out the source code, you can find my GitHub repo here:

ChristopheBrown/nba-ml: Main Repository for all NBA Machine Learning demos (github.com)

The post Predicting the Probability of Scoring a Basket in the NBA using Gradient Boosted Trees appeared first on Towards Data Science.

Basketball Analytics, Part 2: Shot Quality

Sebastien Darius — Thu, 15 Jul 2021 22:13:11 +0000

From Wikimedia Commons

So we’ve reached the point that you’ve been waiting for. In this post, I’m going to explain why no player should ever shoot a midrange jumper ever.

Of course I’m not really going to do that. Very few things in sports are that cut and dry. But we are going to discuss one of the important points of contention for those who are skeptical of analytics. By the end of this, you should have some pretty good answers as to why the midrange shot has fallen so far out of favor. But our analysis will also raise at least as many questions as it answers.

What does the data say?

Let’s look at this in the simplest way possible. If we take all field goals from the 2020–2021 NBA regular season, how do these shots from different locations stack up to one another?

Highlight Table created with Tableau | Image by Author

There is a clear winner in terms of optimizing shot location. Shots that are very close to the basket are good. But the surprising thing is how quickly efficiency falls as we move away from the basket. There is a huge difference in efficiency between midrange jumpers and shots at the rim (Essentially layups and dunks).

Although 3-pointers are slightly more difficult than midrange jumpers (They have a lower FG%), they are more efficient. And this isn’t a small difference either. The average 3-pointer is .16 points per shot (PPS) more efficient than the average midrange jumper. Most NBA teams have a pace at around 100 possessions per 48 minutes. So over the course of a game, we would expect a team that shoots only 3 pointers to outscore a team that only shoots midrange jumpers by 16 points! That’s a difference too big to ignore.

Are there any exceptions?

The next step we might take to examine this question is to see if there are some players who buck the trend. In statistics, we like to aggregate things together and take averages. But we can benefit from getting more precise. After all, there are some players that we consider to be midrange specialists. So let’s see if these players tend to be more efficient from the midrange than from 3.

Pie Chart created with Tableau | Image by Author

I looked to see how many players were actually more efficient on midrange attempts than on 3-pointers in the 2021 season. My sample only included players with at least 100 total field goals and 20 midrange and 3-point attempts. Out of these 274 players, only 21 of them were more efficient on midrange shots. That’s fewer than 8% of all players. Every other player was more efficient from 3-point range than from the midrange. And yes, this includes elite midrange shooters like Chris Paul and Kawhi Leonard.

Text Table created with Tableau | Image by Author

So even though it is important for us to treat each player differently, it’s not a good idea to ignore the overwhelming evidence based on only a few players.

What this all means

The end goal of all this analysis is to inform decision making. So what takeaways can we take from this data? The obvious answer would be to take fewer midrange jumpers and more layups and 3-pointers. But it’s not that simple. There are two teams in a Basketball game, and the other team wants to prevent you from executing your game play. This means that the "best shot" on a given play might change depending on the context. What’s the opposing team’s defensive scheme? Who’s on the court? How much time is left in the game? What’s the score? These are all relevant questions when players decide which shots to take or not to take.

One way to look at the shot selection question is to realize that efficiency does not exist in a vacuum. Joe Harris led the league in 3-point percentage with 47% on 6.4 attempts per game. So one might say: "The Nets should just let him shoot a 3-pointer every single time down the court." While that would be interesting to see, every basketball fan understands intuitively why this wouldn’t work out. One of the reasons why Joe Harris is so efficient is that he tends to take good shots and leave bad shots alone. Unless the team made some drastic shift in their overall strategy, the only way for Joe Harris to drastically increase the number of shots he takes would be to take more bad shots. That in turn would decrease his overall efficiency.

All that is to illustrate the point that there’s only so much we can take from this simple analysis. We can’t tell NBA offenses exactly how they should run, but we can make some general suggestions based on the data we saw.

Don’t take long contested 2-pointers

This one should make sense if you’ve read this far. Long 2’s are the worst type of shot, and the worst type of long 2 is one with a defender in your face. With the exception of late clock and late game situations, these shots are better left alone (Unless your name is Kevin Durant, in which case every shot is open).

Perimeter players should cut out long catch and shoot 2’s

A big part of the 3-point shooting revolution has been an emphasis on the concept of spacing. Teams want to put good shooters on the floor and have them stand outside the three-point line when they don’t have the ball rather than inside of it. Why? Well the first reason is essentially what our whole analysis was about. We’d rather go for three points than two.

But there’s another reason. In general, you want players to be as spread out as possible. If players are all crowded inside the three-point line, that makes it easier for the other team to play help defense. This makes it more difficult for ball handlers and post players to get open shots. But this also makes it harder for shooters to get open and stay open. It’s much easier to close out on a 18-ft jumper than a 22-ft jumper.

Coaches should emphasize schemes that result in 3’s and layups

This one is pretty obvious. Call plays that get you good shots.

Here’s a list of what we cannot conclude based on this data:

Players should never shoot midrange jumpers.
Players shouldn’t practice shooting midrange pull up jumpers.
Bad shooters should shoot more 3-pointers.
Teams should keep shooting from deep even when they’re cold.

A couple of these may be questions that are worth studying, but they are not a result of the analysis that we’ve done.

Three pointers have done a lot to change the NBA game. They’ve also helped offenses to improve a ton. There obviously isn’t a perfect relationship between the 3’s and efficiency. However, the dramatic increase in 3-point attempts has coincided with a dramatic increase in offensive efficiency for NBA teams.

Line Chart created with Tableau | Image by Author

Player tracking data will undoubtedly give teams more and better tools to study the topic in more depth. Also, Thinking Basketball recently put out a great video which discusses how and why certain stars still can make great use of the midrange shot.

If you’d like to see the Python code used for this analysis, check out this Jupyter notebook. All data is from Basketball-Reference.com.

Thanks for reading!

The post Basketball Analytics, Part 2: Shot Quality appeared first on Towards Data Science.

Can we see an upset before it happens?- Predicting the madness of March

Derek J. Hanson — Sun, 28 Mar 2021 03:16:00 +0000

Can we see an upset before it happens?

Predicting the Madness of March

Image by Izuddin Helmi Adnan (@izuddinhelmi) | Unsplash Photo Community

You’ve seen it before. A team with a higher seed puts together a solid game on offense and defense, defeating a lower seeded team everyone thought would win for sure. Your bracket of picks is destroyed and Cinderella moves on. Who could have predicted that upset? Maybe you, next year, after learning a new way to look inside predictive models in this article.

The internet is full of predictions for the tournament, giving you expected probabilities of one team beating another. They’ll tell you about the simulations they ran, the huge dataset feeding the model, and the sophistication of their model. But they give you only a single probability for each game. This simplicity is nice, but it is misleading. It is also the downfall of many brackets using these probabilities to choose a winner.

A probability derived from a model is a single point estimate. It alone does not give any information about how confident the model is about the prediction. Without an understanding of how confident we are in the estimate, we can’t know how sure we are that the model models uncertainty well.

In statistics, point estimates are scary things without a snug confidence interval wrapped around them.

I believe the solution is to look for the uncertainty in these models and find a way to visualize it. State-of-the-art models might incorporate deep-learning algorithms, or bayesian methods, but under their fancy hoods they all tend to rely upon the same basic inputs: estimates of a team’s offensive efficiency, defensive efficiency, and overall strength.

Models use different algorithms (math) to match these parameters to outcomes (i.e. winning), but in the end, what they are trying to do is estimate the difference between the true strength of each team. I believe I’ve thought of an easy way to visualize this difference which also gives us an estimate of the certainty around those point estimates we commonly come across. And all with some very simple math using frequency Statistics.

A New Model – Focusing on Visualizing Differences

To visualize estimates of a team’s true strength, I’ve gathered seven variables designed to estimate different parameters which are proxies for a team’s true ability. Four of these, two for offense, two for defense, estimate a team’s efficiency on both ends of the floor – Ken Pomeroy‘s Adjusted Offense/Defense and Massey’s Offense/Defense ratings. The other three take a more holistic view of the team, giving an estimate of overall strength or power – FiveThirtyEight’s ELO, Massey’s Power, and Massey’s Rating.

Each of these variables are ratings, not rankings, so they reflect true distances between teams on each measure (ordinal measures, like rankings are messy estimates of team’s capabilities). By standardizing each measure we get a z-score for each team on each measure which can then be directly compared. In other words, we have different estimates of a team’s strength which can now be averaged together for a single estimate of a team’s strength, on a unified scale. Because these measures can be directly compared, we can also calculate confidence intervals around this composite score for each team giving us the Holy Grail – a single estimate of each team’s strength with an idea of our certainty about this estimate. Voila. Now we have an idea of each team’s strength which also includes a range of possible values indicating how confident we are that we’ve captured the true ability of the team.

To accomplish this task, my model simply calculates the average of these seven standardized scores, plots this value as a point estimate of overall team ability and adds 95% confidence intervals around this estimate using the variability in each team’s data. In the figures below outlining the efficacy of this model on 2021 NCAA Tournament data, zero indicates an average team (among all 350+ Division I teams), with positive/negative values indicating above/below average. The values [-3,3] on the x-axis represent standard deviations above/below the mean. The data and code to create the analysis and plots are available on my GitHub. But for now, let’s see if thinking about uncertainty would help us find those upsets in the future by looking at the (recent) past.

Visualizing Uncertainty Around Predictions: Two Seeds versus Fifteen Seeds – Round One

Consider the four teams given a 2 seed in the 2021 Men’s NCAA Tournament. Models, such as FiveThirtyEight’s forecast, gave similar probabilities of victory to each of the number two seeds: Alabama (95.19%), Houston (96.40%), Iowa (94.39%), and Ohio State (94.38%). These roughly correspond to the historical average of a two seed winning (they win 94.28% of the time). They also closely matches the number of people picking Ohio State to win (95.2% of nearly 15 million brackets on ESPN.com). Yet Oral Roberts pulled off the dramatic upset in overtime. I wouldn’t argue these models are ‘wrong’, but perhaps their simplistic output hides relevant uncertainty which we should be looking at when making our picks.

While each sophisticated model gave similar estimates for each 2 seed to win, my estimates, with a sense of uncertainty, reveal a different picture. Instead of clear differences between each team, only two games look like runaway victories (no overlap either of the confidence intervals or the ratings inside them). However, Iowa seems to have a below average defense (0 is an average defense), which came to haunt them in the next round, and the Ohio State game looks interesting. In the Ohio State matchup it looks like Ohio State has a below average defense and they are matched with Oral Roberts, which has an average or above average offense. Note also the upper limit of Oral Roberts’ confidence interval is very close to the lower limit of that of Ohio State. This indicates that there is a possibility these teams are actually both very similar in ability in reality. For perspective, Alabama is estimated to be nearly two standard deviations better than Iona, indicating their probability of winning should be much higher than that given to Ohio State. Visualizing a point estimate of strength, with confidence intervals, reveals that, while we could not be certain of Oral Roberts’ victory, estimates of Ohio State at close to 100% are clearly too high. If anyone in this batch looked ripe for an upset it was Ohio State.

But maybe this is just a coincidence. Does this model of team strength and its uncertainty hold up to other games? Let’s check it against two other blocks of picks (3 and 4 seeds) and the second round of games.

Visualizing Uncertainty Around Predictions: Three Seeds versus Fourteen Seeds – Round One

This year, the surprise in these matchups was Texas losing to Abilene Christian. Team strength indicators are confident of their point estimate of Texas (all are located within a tight range slightly above average in every category). While Abilene Christian is likely a below average team (the full confidence interval sits below zero), they appear to have a high quality defense, which Texas can now attest was an issue after 23 turnovers against the Wildcats. Again, this form of visualizing/modeling teams would not predict Abilene Christian as the winner of this matchup, but it does appear to indicate the 84.59% probability of a Texas victory was too high for this matchup. Kansas and Arkansas each faced a team with a potent offense, but in this visualization it is easy to see that both had very strong defenses to counter those offenses.

Visualizing Uncertainty Around Predictions: Four Seeds versus Thirteen Seeds – Round One

The 4 seeds this year produced two upsets. Let’s see if there was an indication of this vulnerability we could have been made aware of. In the case of (4) Virginia and (13) Ohio there is a clear indication Virginia would not likely run away with this game as there is an overlap between the confidence intervals (indicating both teams may be equivalent). Additionally, one measure of Virginia’s offense rates it as a full standard deviation below average. An upset definitely looked plausible here. Further, (4) Purdue and (13) North Texas looks a lot like (3) Texas and (14) Abilene Christian. The estimates of Purdue’s true ability are tightly arranged just above average, while North Texas is average or below, but the defense of North Texas is nearly a standard deviation above average. Neither of the other matchups have as strong of overlap or a prominent ability on offense or defense which may pose a hazard to a specific opponent. If you were to pick upsets here it would have been those two games, another indication that this method of comparing teams has merit.

Visualizing Uncertainty Around Predictions: Second Round

In the second round the blowout of (3) Kansas by (6) USC should not have been a surprise, according to this model. Some of the other upsets, such as (8) Loyola-Chicago over (1) Illinois, (11) Syracuse over (3) West Virginia, or (7) Oregon over (2) Iowa, also look less surprising with these visualizations as these teams appear very evenly matched when considering the confidence intervals around their estimated strengths. Some underdogs even have a clear strength which is better than the favored opponent (Loyola-Chicago’s defense, Oral Roberts’ offense) or the favorite has a clear weakness (Iowa’s defense). Once again, this form of visualizing/modeling the data gives additional insight into predicting a team’s chance of winning beyond today’s standard models.

Concluding Thoughts

Understanding the certainty, or uncertainty, around predictions is important. Without knowledge of how confident you are in a prediction you may be surprised, and unsure of how to explain unexpected outcomes (like your broken bracket). For the NCAA Tournament, visualizing team differences with a standardized, aggregated point estimate with 95% confidence interval provides important additional knowledge about team matchups which can be used to identify favorites that may be more vulnerable than state-of-the-art models indicate.

The post Can we see an upset before it happens?- Predicting the madness of March appeared first on Towards Data Science.

Why Should We Use Viziball to Analyze and Compare Basketball Players’ Performance?

Guillaume Blot — Mon, 08 Feb 2021 10:18:41 +0000

In a previous article, I discussed how we use PIE (Player Impact Estimate) within Viziball, a basketball analytics website. This general measure proposes to aggregate statistical outputs in order to obtain a quick summary of how a player influences the game.

Despite all it gives us, this metric has limitations (as with any indicator). The first is that we only have visible game facts and that there are ways to influence a game other than producing stats. The second limitation is that the PIE measure gives a general assessment and that it is, therefore, difficult to trace the affected game aspects.

It’s this second limitation that we try to deal with. In order to provide more elaborate insights, we propose a segmentation of the PIE measure around 8 axes: shooting, scoring, offensive aggressiveness, defensive aggressiveness, go-to guy, catch & score, play-making, and clutchness.

For each of these axes, we calculate a score (between 0 and 100) based on various formulas that I will discuss next. But before I go any further, I would like to introduce our visual feature, which is the reason why we have to put all the axes on the same scale: the spider chart.

Damian Lillard’s career performance overview. Chart by Viziball.

Spider charts (or Radar charts) are useful for comparing data in an attractive way. They are very effective for seeing which variables are scoring high or low within a dataset, making them ideal for displaying performance statistics. Moreover, spider charts are easily understood by a large audience and are already quite widespread (which is not the case for all data visualization graphics). Then, the chart would almost go without comment and can be easily shared.

Analysts tell that this brings a lot of insights very quickly. How does a player control and distribute the ball (play-making indicator)? Where is he situated in the team hierarchy (go-to guy indicator)? Is he good during the money time (clutch indicator)? What is his offensive profile (offensive aggressiveness)? How does he impact the defense (defensive aggressiveness)? Is he creating his own buckets, or is he more of a catch & shooter (catch & score)?

To give you more details on how Viziball tries to provide some answers, here is how the indicators are established. Some are based on basic stats, while others are based on more advanced formulas or play-by-play analysis.

To give you more details on how Viziball tries to provide some answers, here is how the indicators are established. Some are based on basic stats, others are based on more advanced formulas or play-by-play analysis.

Play-making: Based on the Assist to Turnover Ratio. This indicator measures ball control and the ability to create efficient plays.
Shooting: Mixing Effective Field Goal Percentage (EFG%) and True Shooting Percentage (TS%) to measure the ability to score from any distance.
Scoring: Number of Points.
Go-to guy: Based on Minutes Played and Usage indicator (percentage of possessions ending in a player’s hands while on the field).
Catch & Score: Ability to score following an assist (based on Assisted FG%). A low score doesn’t imply that the player misses his catch & shoots, but it means that most of his shots are coming from a personal action (no assist from a teammate).
Offensive Aggressiveness: Drawn Fouls and Offensive Rebounds.
Defensive Aggressiveness: Steals, Blocks, and Personal Fouls.
Clutchness: this combines several basic statistics to produce a single value. Unlike PIE or Game Score, here we add a coefficient for each stat, depending on the remaining playing time and the point differential at the time the action occurred.

As you can see, all indicators are based on statistical outputs recorded during games. I do not give the normalization process that brings all indicators at the same scale. I agree that this process might sound arbitrary. But, what we have to keep in mind here is that we are trying to compare players’ performances. So at the end of the process, when all performances lay on the spider chart, that’s what we get.

It is also important to notice that no subjective opinions are introduced in our calculation. Like for example, saying that Rudy Gobert had a great night in defense after watching the game. Of course, it would have been comfortable to have a dashboard where we could configure and moderate values. Doing so, one can imagine many more axes. But this is not the philosophy of Viziball. We avoid introducing biases in the workflow. What we do is bring tools to analysts and insiders, not the opposite. Then they are free to draw conclusions.

To make one last comment, we can also say that this is not supposed to represent the intrinsic value of the player, like a video game could try to do. For any reason, the player’s performance could be altered by the context: choice of coaches, referees’ decisions, or even injuries.

Our positioning could be discussed for a long time (I also encourage you to send me feedback and help us enrich our model), but I think it’s time to see how we use the spider chart.

Post-game report

Every day and for all available games, Viziball establishes a post-game report in which each player’s spider chart is available.

For example, below is the spider chart of Bam Adebayo after his game against the Brooklyn Nets on 01/23/2021. Very quickly, and before knowing anything about his statistics during that game, we can understand that he had a great night in various aspects.

Bam Adebayo’s performance overview on 01/23/2021. Post-game report by Viziball.

At the same moment, Kyrie Irving had a clutch night: 3-pointers, driving layups, and key free throws to help lift the Nets in the 4th quarter. This can be read through his spider chart, where both Shooting and Clutch indicators are high.

Kyrie Irving’s performance overview on 01/23/2021. Post-game report by Viziball.

Viziball also provides a heat map in the same post-game report, showing the players’ impact per sequence. Here, this impact is represented by the intensity of the bricks. The darker it is, the higher the impact is high. Below, the 4 last bricks related to the last quarter look to be the most intense of Kyrie’s game. We can also obverse how this can be related to his points line (blue), which increases at the end of the game.

Progression of Kyrie Irving’s performance during his game against Miami. Post-game report by Viziball.

Player’s profile page

Every player recorded in the Viziball database has his own player’s profile page, which has a spider chart among various other widgets. Using this feature, we can observe a player’s performance over a specific period, which can be their whole career, a particular season, or any personalized duration.

This web page can be highly customized (like showing or hiding the spider chart’s axes) and can be shared easily either using the personalized URL or the embedded mode.

At the bottom of the image is the date picker module. We can see the yellow line showing how general performance (PIE) evolves through the years.

Lebron James career performance overview. Chart by Viziball.

This page also includes a timeline module called Career path, displaying the player’s career highest records and his first game with all the teams he played for.

Lebron James’ career path. Timeline by Viziball.

Comparing players

In my opinion, where the spider chart brings the most important insights is when we overlay several players. Viziball’s comparative feature makes it possible to visualize two players at the same time. This brings various scenarios: comparing opponents, comparing players from the same team to understand how they can fit together, or also comparing the same player at different moments of his career.

For example, below is a comparative spider chart with Jayson Tatum (green) and Bradley Beal (red) for the first few games of the current season (2020–2021). We can see their similarities on both catch & score, play-making and shooting indicators, but also some significant disparities on other aspects, with an advantage to Bradley Beal in scoring, and offensive & defensive aggressiveness.

On the other hand, Tatum seems to be brilliant in the 4th quarter during the selected period, as he gets an 84 clutchness indicator (vs. 49 for Bradley Beal).

Jayson Tatum (green) vs. Bradley Beal (red) performance over their first games of the 2020–2021 season. Spider chart by Viziball.

Beal seems to be the first option in offense with a go-to guy indicator equal to 93 (vs. 78 for Jayson Tatum). To see how this correlates with traditional figures, let’s get a bit deeper. Beal records 35.6 minutes per game and a 37.8% Usage. Tatum has 34.3 minutes and 31.2% Usage.

Jayson Tatum & Bradley Beal average figures. Chart by Viziball.

This means that Jayson Tatum is sharing the ball a bit more. To go further, we can observe how usage is distributed among both teams. We can see below that Beal has a high usage compared with all his teammates (except Russell Westbrook, who has 31.42%), whereas the Celtics seem to share offensive phases among three go-to guys: Jayson Tatum, Jaylen Brown, and Kemba Walker, with 31.2%, 35.75%, and 32.35% Usage, respectively.

Usage distribution among the two teams : Wizards vs. Celtics. Chart by Viziball.

Observing performance evolution

Finding the future Most Improved Player (or also the Most Declining Players), distinguishing the evolution of a player when he changes teams, or just observing a player’s evolution on a particular aspect of the game: all this can be done thanks to this feature.

Personally, I chose to illustrate this feature with Nikola Jokic’s comparative spider chart:

In blue are his stats from the 2019–2020 season (including the bubble and playoffs).
In yellow are his stats from the first games of the current season.

Nikola Jokic’s comparative spider chart over two periods : 2019–2020 (Dark blue) vs. first games of 2020–2021 (Yellow). Spider chart by Viziball.

This makes it possible to better reflect his impressive evolution. We can see that he is improving on almost all axes without necessarily extending his go-to guy indicator (which only goes from 68 to 72, i.e., +2.7 minutes per game and + 1.8% usage compared to the previous season).

Diving into a little more detail, we can understand how the Joker manages to increase his impact without playing much longer. First, he improves on the shooting axis, which allows him to directly increase his scoring (almost +5 points per game). He also provokes more fouls, which more regularly brings him to the free-throw line. Finally, he seems to expand his ball control and distribution skills, with a play-making indicator that goes from 75 to 88 (i.e., +2.5 assists per match and -1.1 turnovers).

Finally, we can observe his huge impact in fourth quarters. His _clutch_ness indicator is 100 for the beginning of this season! And it was already very high the previous season. Everyone (starting with the Jazz and the Clippers) remembers last year’s playoffs and Denver’s ability to perform during money-time.

Going further

As you can see, this feature is highly configurable. Obviously, there are things that are not yet possible (like, for example, comparing more than two players on the same spider chart), but the tool is already able to accommodate a large number of combinations, which could bring fans, analysts, gamblers, or even Basketball professionals to optimize their investigation process. You are, therefore, free to express your analyzes through this tool!

I would like to point out that Viziball is a 100% free service that is accessible to everyone without any restrictions. Launched at the end of 2019, this research project led by the French company, Data Nostra, an expert in R&D in the field of data analysis in general (not exclusively in sport), aims to identify innovative methods in the treatment of sports data. The site is under constant development at different levels. We have to deal with the classic issues of a website (UX design, SEO), but also on the basketball analytical aspects. This is the reason why we try to communicate our work as much as possible. To go further, we would be particularly fond of informed feedback from professional sports analysts.

Today, the project runs exclusively on the company’s own funds, which aims to perpetuate the Viziball service. For this, we are open to any partnership proposal, whether with clubs, leagues, or specialized media.

We have just launched a Twitter account, where we regularly share Viziball-based content. This makes it possible to follow the news of basketball with a different statistical axis. I, therefore, invite you to follow us, share our content and experience the tool yourself!

The post Why Should We Use Viziball to Analyze and Compare Basketball Players’ Performance? appeared first on Towards Data Science.

Developing Rebound Probability Predicting App using ML and Flask

Memphis Meng — Wed, 13 Jan 2021 23:44:16 +0000

A Full-Stack Machine Learning Web App that Predicts Rebounding Probabilities

In this article, I am going to introduce how I develop an ML-based web application that can predict the odds of individual players and sides to secure a rebound on the court.

First of all, let me show what the app eventually looks like:

Go check out http://okc-thunder-rebounds.herokuapp.com/ (image by author)

As is indicated from the image above, users will be able to get the probability for each player/side to get the rebound, which is useful to Basketball operators when they need to arrange tactics about rebounding. And some of the takeaways from the GIF above are:

We need an interface in the frontend that allows users to place the players on the court by dragging the items;
we need a kernel in the backend that can predict any individual probability for a player to grab the rebound (and the summed probability for a team) based on their location on the court;
we need to have a server that hosts the website 24/7.

In the rest of the article, I’m going to describe my ideas in that order.

Frontend design

There are two main difficulties we need to tackle during the development:

how can we enable users to place the players by dragging the elements on the panel?
how should I transport the data from the frontend to the backend so that our model uses it for prediction?

For Difficulty #1, I fortunately found this useful link which enables any elements are draggable on the web page. What I updated is that I confined the div elements representing the players within the panel:

For difficulty #2, my workaround is to insert an invisible within the , and then all the data in need in machine learning should be assigned here. <div class="wp-block-tds-gist-embed"> <script src="https://gist.github.com/MemphisMeng/b3ed5f6a629e8509baa7a2663b90be5c.js"></script> </div> Up to now, we can let users place the player wherever they want them to be, and the machine is also able to recognize the locations. <hr class="wp-block-separator has-alpha-channel-opacity" /> <h2 class="wp-block-heading">Machine learning model training</h2> To make it clear, we need a model that can predict the odds for any individual player or a team on the court to get a rebound. In other words, the input would be the coordinates of players’ locations while the output is the rebounded player’s name and his side. Let’s take a look at my original data: <figure class="wp-block-image size-large"><figcaption class="wp-element-caption">Input columns(Image by Author)</figcaption></figure> <figure class="wp-block-image size-large"><figcaption class="wp-element-caption">Output column(Image by Author)</figcaption></figure> <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">Note:</blockquote> <ol class="wp-block-list"> <li>The location represents where a player is standing when the shot is thrown, and it is represented by x-y coordinates.</li> <li>The X coordinate is measured in feet and represents the distance from the center of the court, length-wise. -47 represents the baseline of the offensive team’s end. 47 represents the baseline of the defending team’s end.</li> <li>The Y coordinate is measured in feet and represents the distance from the basket, width-wise. -25 represents the right side of the court, 25 represents the left side of the court (for someone facing the offensive basket).</li> <li>In the output, there are empty values which mean that the corresponding shots or free throws are made. Since we only care about the scenarios where there are rebounds, we need to handle this later.</li> </ol> <h3 class="wp-block-heading">Data Cleaning</h3> To begin with, let’s take out the undesirable row values: <pre class="wp-block-prismatic-blocks"><code class="language-"># remove the rows where the shots or free throws were made train = train[train['f.oreb'].isna()==False]</code></pre> Next, match the rebounded player id with his position in his team and the team condition (offending/defending): <pre class="wp-block-prismatic-blocks"><code class="language-"># target columns is a list containing the input columns' names target_columns = [] for event in ['off', 'def']: for i in range(1, 6): target_columns.append('playerid_' + event + '_player_' + str(i))</code></pre> <pre class="wp-block-prismatic-blocks"><code class="language-">reb_player_id_df = train[target_columns].eq(train['reb_player_id'], axis = 0) reb_player_position_df = reb_player_id_df.idxmax(1).where(reb_player_id_df.any(1)).dropna()</code></pre> <pre class="wp-block-prismatic-blocks"><code class="language-"># encode all players on court # 1~5 means a player is an offending one while 6~10 means a defending one position_code = { 'playerid_off_player_1': 1, 'playerid_off_player_2': 2, 'playerid_off_player_3': 3, 'playerid_off_player_4': 4, 'playerid_off_player_5': 5, 'playerid_def_player_1': 6, 'playerid_def_player_2': 7, 'playerid_def_player_3': 8, 'playerid_def_player_4': 9, 'playerid_def_player_5': 10 } output = reb_player_position_df.apply(lambda x: position_code[x])</code></pre> <pre class="wp-block-prismatic-blocks"><code class="language-"># reset the index output = output.reset_index(drop=True)</code></pre> Now, to many degrees, normalized data usually performs better in machine learning because it reduces the influences from outliers and avoids falling into local optimal points. Thus, since we have a certain range for both x and y coordinates, we could try Min-Max normalizer: <pre class="wp-block-prismatic-blocks"><code class="language-">train[[col for col in location_columns if '_y_' in col]] = (25 - train[[col for col in location_columns if '_y_' in col]]) / (25 - (-25)) train[[col for col in location_columns if '_x_' in col]] = (47 - train[[col for col in location_columns if '_x_' in col]]) / (47 - (-47))</code></pre> Now the data is ready to go! <h3 class="wp-block-heading">Model selection</h3> I tried a suite of models that are expected to perform well in probability prediction, including Logistic Regression, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Gaussian Naive Bayesian Classifier, and Multinomial Naive Bayesian Classifier. <pre class="wp-block-prismatic-blocks"><code class="language-"># define models models = [LogisticRegression(n_jobs=-1), LinearDiscriminantAnalysis(), QuadraticDiscriminantAnalysis(), GaussianNB(), MultinomialNB()]</code></pre> Cross-validation: <pre class="wp-block-prismatic-blocks"><code class="language-">names, values = [], []</code></pre> <pre class="wp-block-prismatic-blocks"><code class="language-"># evaluate each model one by one # and store their names and log loss values for model in models: # get a name for the model name = type(model).__name__[:15] scores = evaluate_model(train, LabelEncoder().fit_transform(output), model) # output the results print('>%s %.3f (+/- %.3f)' % (name, np.mean(scores), np.std(scores))) names.append(name) values.append(scores)</code></pre> <figure class="wp-block-image size-large"><figcaption class="wp-element-caption">Image by Author</figcaption></figure> The result shows that Linear Discriminant Analysis outperforms all other counterparts, so I selected it as my kernel algorithm. <pre class="wp-block-prismatic-blocks"><code class="language-"># save the model dump(LDA, 'LDA.joblib')</code></pre> <hr class="wp-block-separator has-alpha-channel-opacity" /> <h2 class="wp-block-heading">Backend Design</h2> With the model in hand, it is time to synthesize it into a model that uses it. In this stage, my major tool is Python and <a href="https://en.wikipedia.org/wiki/Flask_(web_framework)">Flask</a>. <a href="https://towardsdatascience.com/tag/flask/" title="Flask">Flask</a> is a lite web development framework written in Python which is commonly applied in Machine Learning Products. It is simple, flexible, and allows users to decide what to implement and how to control their apps. Basically, we need a web page (Homepage) that shows the panel and other ones when the locations are assigned. The homepage is accessed by "GET" methods when a session starts while the result page by "POST" because it will not show up until a user clicks the "submit" button. <div class="wp-block-tds-gist-embed"> <script src="https://gist.github.com/MemphisMeng/8a320910e5159179e2ea80f01dc5c796.js"></script> </div> Wow! Now we made it and our development is complete! As long as you do the following commands under the working directory, you can see the app at 127.0.0.1:5000 on your browser. <pre class="wp-block-prismatic-blocks"><code class="language-">> set FLASK_APP=app.py (for windows; for linux, it should be "export FLASK_APP=app.py") > flask run</code></pre> <hr class="wp-block-separator has-alpha-channel-opacity" /> <h2 class="wp-block-heading">Server Deployment</h2> There are a lot of options for you to host a web app. Now my choice is Heroku for its simplicity. Usually, I would like to deploy my apps via GitHub and this one is not an exception. So the first step now is to create a new repo for your program, including your main function (app.py), static directory, and templates directory. Here is how your folder should look like now: <pre class="wp-block-prismatic-blocks"><code class="language-">├── README.md ├── app.py ├── LDA.joblib ├── templates │ ├── index.html │ ├── output.html │ └── output2.html ├── static ├── bar.png ├── basketball.jpg └── court.png</code></pre> When you gather these pieces of stuff, two more things should also be included: <ol class="wp-block-list"> <li>requirements.txt, a list of dependencies in need, you can collect them all quickly by:</li> </ol> <pre class="wp-block-prismatic-blocks"><code class="language-">pip freeze > requirements.txt</code></pre> Note that after the freezing, make sure to append gunicorn in requirements.txt. Here is what is needed throughout the development: <pre class="wp-block-prismatic-blocks"><code class="language-">Flask==1.1.2 joblib matplotlib pandas scikit-learn gunicorn</code></pre> <ol class="wp-block-list" start="2"> <li>Procfile, this records the command that will let Heroku server know what to do to activate the application. Because our app is developed using Flask, our command should be:</li> </ol> <pre class="wp-block-prismatic-blocks"><code class="language-">web: gunicorn app:app --log-level debug</code></pre> All set! You can upload this folder to your newly created repo now: <pre class="wp-block-prismatic-blocks"><code class="language-">git init git clone your_github_repo_link git add . git commit -m "Initial Commit" git push</code></pre> When we finish handling the issues on GitHub, it is time to look at Heroku. Make sure you have already <a href="https://signup.heroku.com/">signed up</a> on Heroku and still have available slots. If so, create a new application on your personal dashboard: <figure class="wp-block-image size-large"><figcaption class="wp-element-caption">Image by Author</figcaption></figure> When it is done, it usually redirects to the homepage of the app. So you click on the "Deploy" tab and choose to deploy the app via GitHub: <figure class="wp-block-image size-large"><figcaption class="wp-element-caption">Image by Author</figcaption></figure> And select the repo you want it to connect with: <figure class="wp-block-image size-large"><figcaption class="wp-element-caption">Image by Author</figcaption></figure> Finally, there is only one more step to go: click "Deploy Branch". <figure class="wp-block-image size-large"><figcaption class="wp-element-caption">Image by Author</figcaption></figure> Horray! you made it and now you can see what you have created on the link that the slot provides! <hr class="wp-block-separator has-alpha-channel-opacity" /> <h2 class="wp-block-heading">Conclusions</h2> This is my first time doing something with the traditional frontend tools (i.e. HTML, JS, and CSS), so I would consider it as one of my trials to challenge the limit since I usually stay within the comfort zone using more convenient tools like Python and streamlit. Thus my takeaway from this project is that never let your tools confine you, instead, let them serve your need! And you need the whole source code of this project, here is the link: <a href="https://github.com/MemphisMeng/rebound-app.git">https://github.com/MemphisMeng/rebound-app</a>The post <a href="https://towardsdatascience.com/developing-rebound-probability-predicting-app-using-ml-and-flask-803a6990fd56/">Developing Rebound Probability Predicting App using ML and Flask</a> appeared first on <a href="https://towardsdatascience.com">Towards Data Science</a>. </article> <article> <h1>Sport Analytics – NBA 2K ratings prediction</h1> Bénédicte Rallet — Mon, 04 Jan 2021 17:27:24 +0000 <h3 class="wp-block-heading">What will be next 2K22 ratings? Are some teams over-rated by the editor? Which statistics impact the most ratings? Let’s find out these answers.</h3> <figure class="wp-block-image size-large"><figcaption class="wp-element-caption">Photo by <a href="https://unsplash.com/@neonbrand?utm_source=medium&utm_medium=referral">NeONBRAND</a> on <a href="https://unsplash.com?utm_source=medium&utm_medium=referral">Unsplash</a></figcaption></figure> As a Data Scientist passionate about sport, especially basketball, I am writing a series of articles related to Sport Analytics. It will go from classic regression or classification problems to more advanced computer vision applications. <hr class="wp-block-separator has-alpha-channel-opacity" /> NBA <a href="https://towardsdatascience.com/tag/2k/" title="2k">2k</a> is a series of basketball sport simulation video games developed since 1999 with annual release. The latest edition is 2K21, issued on September 4, 2020. The video games are now published by 2K Sports. In each release, all active players in the NBA and some legends are individually rated on a 99-point scale. Those ratings always lead to discussion, debate, reactions… even from the players themselves. <hr class="wp-block-separator has-alpha-channel-opacity" /> The objective of this post is to build a model to predict this 99-point scale rating for each player, using features related to the player itself and its game statistics from the previous year. As we try to predict a number, it is a regression problem. <h3 class="wp-block-heading">Data Sources</h3> When it comes to NBA data, a large variety of open data exists, the statistics are widely used by the league itself. Data is part of the NBA culture. NBA 2K ratings are available at <a href="https://hoopshype.com/nba2k/">https://hoopshype.com/nba2k/</a> from 2K14 to 2K21. These 8 years would be our project scope. <figure class="wp-block-image size-large"></figure> We observe that the distribution of the ratings are pretty similar since 2015 with a median in the range [74.5 ; 76.9] and rather constant quartiles. In 2K14, the scores were in a broader space with a lower median (71.4). This evolution may be due to a change in the rating strategy by 2K Sports. It is an important information to note as it could introduce bias and worsen model performance. <figure class="wp-block-image size-large"></figure> Regarding NBA player statistics, the data can be retrieved on <a href="https://www.basketball-reference.com/leagues">https://www.basketball-reference.com/leagues</a>. We will only scrap the seasons 2013–2014 to 2020–2021 in order to be aligned with 2K data period. Many views are available on the website, here we focus only on Totals and Advanced tabs. Type of features in the dataset: <ul class="wp-block-list"> <li>Age, position and team</li> <li>Number of games played and started, % of win, minutes played</li> <li>Field goals (3-points, 2-points, free throws…)</li> <li>Rebounds, assists, fouls, blocks, steals…</li> </ul> To enrich our dataset, we will perform feature engineering by combining variables especially with normalization by minutes and games played as well as position and team one-hot encoding. <h3 class="wp-block-heading">Modeling</h3> The dataset is split in training/test/validation with time dependance to avoid leakage in the model. Note that we removed 2014 as the distribution of the rating was too different from following years. <figure class="wp-block-image size-large"></figure> <figure class="wp-block-image size-large"></figure> Many algorithms can be used to perform regression: from basic Linear Regression to advanced tree methods. Comparing different algorithm performances, we will choose XGBoost and do hyper-parameters tuning with Grid Search. On the validation set (that corresponds to 2K21 ratings), we achieve a MSE of 1.75 and a MAE of 1.33 with the following residuals: <figure class="wp-block-image size-large"></figure> The color of the dot represents the number of game played. We observe that the main outliers are players that did not play enough game during the previous season (e.g Stephen Curry in season 2019/20). In that case, the statistics may be hardly representative of the real player value, this bias is corrected by 2K. <h3 class="wp-block-heading">Model interpretation</h3> Now that we have a model with a reasonable performance, we can open the black box and move to interpretable Machine Learning. The objective is to analyze what is behind the model and which are the features that have the higher impact on the rating prediction. In other words, we can highlight which player attributes will affect the most (either in a positive or negative way) their 2K rating. <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> The goal of SHAP (SHapley Additive exPlanations) is to explain the prediction of an instance x by computing the contribution of each feature to the prediction. <a href="https://christophm.github.io/interpretable-ml-book/">Interpretable Machine Learning</a> </blockquote> Using SHAP library, we can plot the top 10 most important features (left) and the combination of feature importance with feature effects (right). <figure class="wp-block-image size-large"></figure> The 2 top features reflects a player importance within his teams with his average play time and the percentage of games in the starting five. We also notice that the general team performance (% of win share) have a positive influence on the individual player ratings. Any position (guard, center…) gives an advantage for higher ratings and it doesn’t seem either that the model favors a specific teams (the first franchise that appears in the feature importance plot is Golden State Warriors in 27th position). Finally, we can explore SHAP explanation force plots for two specifics player to visualize which features had a positive impact (= increase the rating in red) and a negative impact (= decrease the rating in blue). <figure class="wp-block-image size-large"><figcaption class="wp-element-caption">Rudy Gobert, Center in Utah Jazz and two-times Defensive Player of the Year, has a high impact on his team (PER, %GS, %MPPG, DWS) but his low scoring decreases his rating by 0.44</figcaption></figure> <figure class="wp-block-image size-large"><figcaption class="wp-element-caption">Montrezl Harrell Wins 2020 NBA’s Sixth Man of the Year. We observe that %GS decreases his score that is offset by the time he spent on the court and his high offensive impact (FGPM, PTSPM, PER)</figcaption></figure> <hr class="wp-block-separator has-alpha-channel-opacity" /> Conclusion Using player statistics, I built a XGBoost model that predicts 2K ratings with a great performance. This shows that 2K notation is globally impartial and honestly represent the reality on the court. Some exceptions can be found, when a player has been injured during the season for example. Manipulating <a href="https://towardsdatascience.com/tag/basketball/" title="Basketball">Basketball</a> data was a lot of fun and the model interpretation is really interesting to have a proper understanding of each prediction and to deep dive in player attributes. 2K rating prediction is a very specific use case but Sport Analytics is an infinite playground with a lot of under-exploited data that I will continue to explore in my Data Science journey. <hr class="wp-block-separator has-alpha-channel-opacity" /> All the code is available in <a href="https://github.com/benrallet/NBA-2K-rating">Github</a>. <h3 class="wp-block-heading">References</h3> [1] NBA 2K, <a href="https://fr.wikipedia.org/wiki/NBA_2K">Wikipedia</a> (2020) [2] Christoph Molnar, <a href="https://christophm.github.io/interpretable-ml-book/">Interpretable Machine Learning</a> (2021)The post <a href="https://towardsdatascience.com/sport-analytics-nba-2k-ratings-prediction-b7b72e2e72eb/">Sport Analytics – NBA 2K ratings prediction</a> appeared first on <a href="https://towardsdatascience.com">Towards Data Science</a>. </article> </main></body></html>