Sport Analytics - NBA 2K ratings prediction

What will be next 2K22 ratings? Are some teams over-rated by the editor? Which statistics impact the most ratings? Let’s find out these answers.

As a Data Scientist passionate about sport, especially basketball, I am writing a series of articles related to Sport Analytics. It will go from classic regression or classification problems to more advanced computer vision applications.

NBA 2k is a series of basketball sport simulation video games developed since 1999 with annual release. The latest edition is 2K21, issued on September 4, 2020. The video games are now published by 2K Sports.

In each release, all active players in the NBA and some legends are individually rated on a 99-point scale. Those ratings always lead to discussion, debate, reactions… even from the players themselves.

The objective of this post is to build a model to predict this 99-point scale rating for each player, using features related to the player itself and its game statistics from the previous year. As we try to predict a number, it is a regression problem.

Data Sources

When it comes to NBA data, a large variety of open data exists, the statistics are widely used by the league itself. Data is part of the NBA culture.

NBA 2K ratings are available at https://hoopshype.com/nba2k/ from 2K14 to 2K21. These 8 years would be our project scope.

We observe that the distribution of the ratings are pretty similar since 2015 with a median in the range [74.5 ; 76.9] and rather constant quartiles.

In 2K14, the scores were in a broader space with a lower median (71.4). This evolution may be due to a change in the rating strategy by 2K Sports. It is an important information to note as it could introduce bias and worsen model performance.

Regarding NBA player statistics, the data can be retrieved on https://www.basketball-reference.com/leagues. We will only scrap the seasons 2013–2014 to 2020–2021 in order to be aligned with 2K data period.

Many views are available on the website, here we focus only on Totals and Advanced tabs.

Type of features in the dataset:

Age, position and team
Number of games played and started, % of win, minutes played
Field goals (3-points, 2-points, free throws…)
Rebounds, assists, fouls, blocks, steals…

To enrich our dataset, we will perform feature engineering by combining variables especially with normalization by minutes and games played as well as position and team one-hot encoding.

Modeling

The dataset is split in training/test/validation with time dependance to avoid leakage in the model. Note that we removed 2014 as the distribution of the rating was too different from following years.

Many algorithms can be used to perform regression: from basic Linear Regression to advanced tree methods.

Comparing different algorithm performances, we will choose XGBoost and do hyper-parameters tuning with Grid Search.

On the validation set (that corresponds to 2K21 ratings), we achieve a MSE of 1.75 and a MAE of 1.33 with the following residuals:

The color of the dot represents the number of game played. We observe that the main outliers are players that did not play enough game during the previous season (e.g Stephen Curry in season 2019/20). In that case, the statistics may be hardly representative of the real player value, this bias is corrected by 2K.

Model interpretation

Now that we have a model with a reasonable performance, we can open the black box and move to interpretable Machine Learning.

The objective is to analyze what is behind the model and which are the features that have the higher impact on the rating prediction. In other words, we can highlight which player attributes will affect the most (either in a positive or negative way) their 2K rating.

The goal of SHAP (SHapley Additive exPlanations) is to explain the prediction of an instance x by computing the contribution of each feature to the prediction.

Interpretable Machine Learning

Using SHAP library, we can plot the top 10 most important features (left) and the combination of feature importance with feature effects (right).

The 2 top features reflects a player importance within his teams with his average play time and the percentage of games in the starting five. We also notice that the general team performance (% of win share) have a positive influence on the individual player ratings.

Any position (guard, center…) gives an advantage for higher ratings and it doesn’t seem either that the model favors a specific teams (the first franchise that appears in the feature importance plot is Golden State Warriors in 27th position).

Finally, we can explore SHAP explanation force plots for two specifics player to visualize which features had a positive impact (= increase the rating in red) and a negative impact (= decrease the rating in blue).

Rudy Gobert, Center in Utah Jazz and two-times Defensive Player of the Year, has a high impact on his team (PER, %GS, %MPPG, DWS) but his low scoring decreases his rating by 0.44 — Rudy Gobert, Center in Utah Jazz and **two-times Defensive Player** of the Year, has a high impact on his team (PER, %GS, %MPPG, DWS) but his low scoring decreases his rating by 0.44

Montrezl Harrell Wins 2020 NBA's Sixth Man of the Year. We observe that %GS decreases his score that is offset by the time he spent on the court and his high offensive impact (FGPM, PTSPM, PER) — Montrezl Harrell Wins **2020 NBA’s Sixth Man** of the Year. We observe that %GS decreases his score that is offset by the time he spent on the court and his high offensive impact (FGPM, PTSPM, PER)

Conclusion

Using player statistics, I built a XGBoost model that predicts 2K ratings with a great performance. This shows that 2K notation is globally impartial and honestly represent the reality on the court. Some exceptions can be found, when a player has been injured during the season for example.

Manipulating Basketball data was a lot of fun and the model interpretation is really interesting to have a proper understanding of each prediction and to deep dive in player attributes.

2K rating prediction is a very specific use case but Sport Analytics is an infinite playground with a lot of under-exploited data that I will continue to explore in my Data Science journey.

All the code is available in Github.