Predicting NBA MVP Players Using Machine Learning

Ibrahim Kiceci
8 min readJan 12, 2024

--

First of all, I would like to provide a brief information about this award, because it will help us to understand the data set we have. The most valuable player of the season has been chosen by the NBA since 1955. The NBA also gives awards for the most valuable player of the playoffs, but being the most valuable player of the season is considered much more significant. That’s why I focused on the MVP award given before the playoffs.

Photo by Ryan on Unsplash

How is an MVP player chosen?

Until 1980, MVP awards were given based on the votes of NBA players, but after 1980, a change was made in the voting system. Nowadays, the most valuable player is determined by the votes of sports writers and broadcasters in the USA and Canada. Each voter lists their 5 choices for MVP. Each first place gets 10 points, second place gets 7 points, third place gets 5 points, fourth place gets 3 points, and fifth place gets 1 point. So If we calculate this as a percentage;

Share = MVP Voting Points / Total Possible MVP Points

so, the player with the highest percentage or the highest number of votes wins this award.

SET UP THE MODEL

You can access the codes I applied and the datasets on my GitHub account . Here, I will only mention about the results and the methods I applied.

I went through the following 5 steps when building the project:

  1. Finding and Collecting Data
  2. Data Analysis and Preparation
  3. Feature Selection
  4. Model Selection
  5. Prediction

DECIDING OF THE DATASET

We know that MVP awards in the NBA are given based on player performance and team performance. Therefore, when collecting data, I tried to gather information related to both the player’s performance and the team’s performance. The second important point is that, as mentioned above, I did not include data from before 1980 due to the change in the voting system. Data from after 1980 should be sufficient for now.

Data Sources

  1. Almost all advanced statistics are available on the website https://www.basketball-reference.com/.

2. All team data and advanced player statistics that I needed can be found at https://www.kaggle.com/datasets/robertsunderhaft/nba-player-season-statistics-with-mvp-win-share/data.

Both websites provide the required datasets.

Looking of the Data Sets and Glossary Of the Data Features

In the dataset, there are 55 features. Among them, ‘award_share’ represents the player’s share of votes for the MVP award, which is our dependent variable. The remaining features are our independent variables. I predict the dependent variable by training the data. Additionally, award_share is a continuous value; therefore, predicting award_share is a regression problem.

You can access the explanations about the features from: https://www.basketball-reference.com/about/glossary.html

DATA ANALYSIS AND PREPERATION

After cleaning the dataset, we still have 51 features, which is quite a lot. Having too many features can lead to some problems during model training. Therefore, I need to apply one of the feature selection methods. First, let’s analyze the correlation between our dependent variable and independent variables in the correlation matrix. I haven’t shared the entire correlation matrix here, but you can check my GitHub account for the complete matrix. One thing that stands out is the impact of the VORP variable on the award_share. The correlation between them is 47 percent. The VORP variable indicates how valuable a player is compared to a replacement player. A high VORP value means that the player contributes more to the team. We may need to remember this information when evaluating the results in the prediction part.

FEATURE IMPORTANCE METHOD USING BY RANDOM FOREST MODEL

The dataset contains 51 features, and taking all of these features during training may lead to some issues like overfitting. For instance, the abundance of features with low importance can contribute to this. Additionally, performing feature selection can help improve the model’s performance. Therefore, by applying a feature selection method, I will only consider specific features. In feature selection, a choice can be made based on correlation or other methods.

I used the feature importance method and applied a random forest model here. I believe that the random forest model is suitable for this dataset. Here, we can observe how much each feature contributes to the model’s predictions.

X = df.drop('award_share', axis = 1) ## independent features
y = df['award_share'] ## dependent feature

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

feature_importance = model.feature_importances_

After performing feature importance, I chose the threshold value as 0.01 and selected the following features:


Index(['fg_per_g', 'fta_per_g', 'ast_per_g', 'pts_per_g', 'per', 'usg_pct',
'ws', 'ws_per_48', 'vorp', 'mov', 'win_loss_pct'],
dtype='object')

Choosing Of Regression Model

The value we are trying to predict is a continuous value, so we need to apply one of the regression models. Although the Random Forest model seems to be the most suitable for us, it is essential to test many regression models with R2, Mean Squared Error (MSE), and Mean Absolute Error (MAE). This provides me to determine the most appropriate model. I used a pipeline to test some regression models and applied a standard scaler to each.

models = [
('Linear Regression', LinearRegression()),
('Ridge Regression', Ridge()),
('Lasso Regression', Lasso()),
('Decision Tree', DecisionTreeRegressor()),
('Random Forest', RandomForestRegressor()),
('Polynomial Regression', make_pipeline(PolynomialFeatures(degree=2), LinearRegression()))
]

pipelines = []

for model_name, model in models:
pipelines.append((
model_name,
Pipeline([
('scaler', StandardScaler()),
('model', model)
])
))

The Random Forest model has the highest R-squared value and the lowest MSE and MAE values.

R-squared basically indicates how independent variables explain the dependent variable. It is a value between 0 and 1. As this value approaches 1, it signifies that the independent variables have a higher explanatory power over the dependent variable. However, being very close to 1 is not a desired outcome. You can check how R-squared is calculated from this link: https://www.geeksforgeeks.org/ml-r-squared-in-regression-analysis/.

MSE (Mean Squared Error) measures the square of the difference between actual values and predicted values. Therefore, it is desirable for this ratio to be low. MAE (Mean Absolute Error) measures the absolute difference between actual values and predicted values. Essentially, both MSE and MAE measure the same thing, but MAE is more robust against outliers.

OUT OF SAMPLE PREDICTION

Now, we will try to predict the MVP player for the years 2022, 2023 and 2024. As i mentioned, we did not use the data for these years when training the model, meaning the machine has not seen these data before.

Predicting the Winner of the 2022 NBA MVP Award

In 2022, in real life, Nikola Jokic won the MVP award. Now, using the random forest regressor model, I predicted the players’ MVP shares for 2022, and Nikola Jokic came out with the highest award share.

Player at index 289 shows Nikola Jokic in the data set

Player at index 289, Predicted Award Share: 0.7092
Photo by Silvan Arnet on Unsplash

Predicting the Winner of the 2023 NBA MVP Award

In 2023, Joel Embiid won the MVP award. I created a dataset for the top 7 players with the highest chances of winning in 2023 and it is predicted among them. In my prediction model, the winner is again Nikola Jokic Nikola Jokic is represented by index 0 and Joel Embiid is represented by index 2

Player at index 0, Predicted Award Share: 0.5298
Player at index 1, Predicted Award Share: 0.3049
Player at index 2, Predicted Award Share: 0.4485
Player at index 3, Predicted Award Share: 0.3781
Player at index 4, Predicted Award Share: 0.0878
Player at index 5, Predicted Award Share: 0.0293
Player at index 6, Predicted Award Share: 0.0313
Photo by engin akyurt on Unsplash

The model predicted that Nikola Jokic would again be the winner in 2023, but the award was given to Joel Embiid. There could be several reasons for this discrepancy, but two particular aspects come to the forefront. Firstly, the feature importance analysis might not adequately measure our dependent variable. Therefore, it may be necessary to reevaluate the feature importance. Secondly, it’s crucial to remember that voting is a subjective matter.

Predicting the Winner of the 2024 NBA MVP Award

The winner of the 2024 NBA MVP will be announced in early May, but based on the data from the games played so far, we can make a prediction. For the year 2024, I selected the top 5 candidates with the highest probability of winning: Nikola Jokic, Luka Doncic, Joel Embiid, Giannis Antetokounmpo, and Shai Gilgeous Alexander. The model predicts that, with the current data, Joel Embiid is likely to be the winner in 2024.

Photo by Markus Spiske on Unsplash

Index 0 : Nikola Jokic, Index 1 : Luka Doncic, Index 2: Joel Embiid, Index 3: Giannis Antetokounmpo, Index4: Shai Gilgeous-Alexander

Player at index 0, Predicted Award Share: 0.1882
Player at index 1, Predicted Award Share: 0.1379
Player at index 2, Predicted Award Share: 0.4142
Player at index 3, Predicted Award Share: 0.2775
Player at index 4, Predicted Award Share: 0.1546

I will predict the award share again with the updated dataset just before the announcement of the winner for the year 2024. I will update this information here . (I am a Dallas Mavericks fan, so I hope Luka Doncic wins) :))

Conclusion

As a result, while the prediction for the year 2022 was accurate, the one for 2023 was incorrect, as mentioned above, and there could be various reasons for this. One significant observation is that the tendency of the model to show high values, especially for players with a high probability of winning award shares. Different models and features can be used. If I obtain different results with alternative models, I will update this section…

You can access all the codes and data related to this project from my GitHub account.

--

--