Stephen Curry Shot Selection and Predicting His Shots with Machine Learning

Ibrahim Kiceci
12 min readMar 21, 2024

--

Photo by Oleksii S on Unsplash

Many sports organizations analyze players’ performances and matches through data and obtain important results. The NBA is one of the organizations where we can access a lot of match and player data. Although prediction models are not as successful as baseball, we can still make some successful analyzes thanks to the many data provided by the NBA or other basketball organizations. In my previous post, I also tried to predict the NBA MVP player using machine learning, you can find it here.

A few months ago, I found the shots data of Stephen Curry between 2009 and 2023 on Kaggle. This data set are quite detailed. For instance, it includes the location of each shot made by Stephen Curry, the type of shot, whether the shot was successful or not, and the seconds remaining when the shot was made. I thought that I could find meaningful results by visualizing this data and predict Stephen Curry’s shots by applying machine learning methods. Thus, In the first part of this article, I will try to reveal some interesting results by visualizing the data. In the second part, I will try to predict whether their shots are successful or not by using machine learning methods.

EXPLORING AND ANALYSIS OF THE DATA SET

I will not explain in detail how I cleaned the dataset here. You can access these codes from my GitHub account. As I shared below, after cleaning the data, we have 15650 entries and 79 columns.

After exploring the data, I need a basketball court for data visualization. As you can see, there are locations where the shot was made, such as LOC_X, LOC_Y or SHOT_DISTANCE. I need to accurately visualize this data on the basketball court. For this purpose, I found a basketball court previously drawn in Python. Source: http://savvastjortjoglou.com/nba-shot-sharts.html

## source: http://savvastjortjoglou.com/nba-shot-sharts.html

from matplotlib.patches import Circle, Rectangle, Arc

def draw_court(ax=None, color='white', lw=2, outer_lines=False, bg_color = 'darkblue'):
if ax is None:
ax = plt.gca()
ax.set_facecolor(bg_color)

# Creating the basketball hoop
# Diameter of a hoop is 18" so it has a radius of 9", which is a value
# 7.5 in our coordinate system
hoop = Circle((0, 0), radius=7.5, linewidth=lw, color=color, fill=False)

# backboard
backboard = Rectangle((-30, -7.5), 60, -1, linewidth=lw, color=color)

# The paint
#outer box of the paint, width=16ft, height=19ft
outer_box = Rectangle((-80, -47.5), 160, 190, linewidth=lw, color=color,
fill=False)
#inner box of the paint, widt=12ft, height=19ft
inner_box = Rectangle((-60, -47.5), 120, 190, linewidth=lw, color=color,
fill=False)

#free throw top arc
top_free_throw = Arc((0, 142.5), 120, 120, theta1=0, theta2=180,
linewidth=lw, color=color, fill=False)
#free throw bottom arc
bottom_free_throw = Arc((0, 142.5), 120, 120, theta1=180, theta2=0,
linewidth=lw, color=color, linestyle='dashed')
# Restricted Zone, it is an arc with 4ft radius from center of the hoop
restricted = Arc((0, 0), 80, 80, theta1=0, theta2=180, linewidth=lw,
color=color)

# Three point line
# the side 3pt lines, they are 14ft long before they begin to arc
corner_three_a = Rectangle((-220, -47.5), 0, 140, linewidth=lw,
color=color)
corner_three_b = Rectangle((220, -47.5), 0, 140, linewidth=lw, color=color)
# 3pt arc - center of arc will be the hoop, arc is 23'9" away from hoop
# threes
three_arc = Arc((0, 0), 475, 475, theta1=22, theta2=158, linewidth=lw,
color=color)

# Center Court
center_outer_arc = Arc((0, 422.5), 120, 120, theta1=180, theta2=0,
linewidth=lw, color=color)
center_inner_arc = Arc((0, 422.5), 40, 40, theta1=180, theta2=0,
linewidth=lw, color=color)

# List of the court elements to be plotted onto the axes
court_elements = [hoop, backboard, outer_box, inner_box, top_free_throw,
bottom_free_throw, restricted, corner_three_a,
corner_three_b, three_arc, center_outer_arc,
center_inner_arc]

if outer_lines:
# Draw the half court line, baseline and side out bound lines
outer_lines = Rectangle((-250, -47.5), 500, 470, linewidth=lw,
color=color, fill=False)
court_elements.append(outer_lines)

# Add the court elements onto the axes
for element in court_elements:
ax.add_patch(element)

return ax

SHOT ANALYSIS AND DATA VISUALIZATION

Now i can start analyzing the data by visualizing them.

1. An Overview of Successful and Unsuccessful Shots for Stephen Curry

First, let’s get an overview of Curry’s successful and unsuccessful shots using Scatter Plot.

nba_chart = ['SHOT_RESULT', 'LOC_X', 'LOC_Y']
df_chart = df_visual[nba_chart]

## Creating a figure and axis
plt.figure(figsize=(12, 11), facecolor='darkblue')
draw_court(outer_lines=True, bg_color='darkblue')
ax = plt.gca()

## Drawing the basketball court
draw_court(ax)

## visualization of the successful shots
## getting x and y coordinates of successfull shots
plt.scatter(df_chart[df_chart['SHOT_RESULT'] == 1]['LOC_X'], df_chart[df_chart['SHOT_RESULT'] == 1]['LOC_Y'], color='green', alpha=0.5, label='Successful Shot')
## for unsuccessful shot, SHOT_RESULT = 0

# setting the court for half
plt.xlim(-250, 250)
plt.ylim(422.5, -47.5)

# legend
plt.legend()

plt.show()

2. Heat - Map For Curry’s Successful and Unsuccessful Shots

Heat-maps will help us to understand in which areas Curry made more successful or unsuccessful shots.

plt.figure(figsize=(12, 11))
ax = plt.gca()

draw_court(ax,color = 'black')

# successful shots
successful_shots = df_chart[df_chart['SHOT_RESULT'] == 1]
#unsuccessful_shots = df_chart[df_chart['SHOT_RESULT'] == 0]

# Create heatmap for successful shots
sns.kdeplot(x=successful_shots['LOC_X'], y=successful_shots['LOC_Y'], cmap='Blues', fill=True, thresh=0, levels=100, alpha=0.7, ax=ax, cbar=True, cbar_kws={'label': 'Successful Shots'})

plt.xlim(-250, 250)
plt.ylim(422.5, -47.5)

plt.show()

## Darker colors on the heat map indicate areas where shots are concentrated.-- > successful shots

3. Successful Shots with Color Gradient Based on Total Seconds Remaining

I can visualize the relation between the time remaining in the game and Curry’s shots. There are numerous shots as it is seen so it may be difficult to find a clear result. However, I can at least observe how he made shots as time was running out. Additionally, I also chose to use a color gradient to represent the duration.

nba_chart = ['SHOT_RESULT', 'LOC_X', 'LOC_Y', 'TOTAL_SECONDS_REMAINING']
df_chart = df_visual[nba_chart]

# Creating a figure and setting the color
plt.figure(figsize=(12, 11), facecolor='grey')
ax = plt.gca()
draw_court(outer_lines=True, bg_color='grey')

# Drawing the basketball court
draw_court(ax)

# unsuccessful shots with color gradient based on TOTAL_SECONDS_REMAINING
unsuccessful_shots = df_chart[df_chart['SHOT_RESULT'] == 1]
sc = ax.scatter(
unsuccessful_shots['LOC_X'],
unsuccessful_shots['LOC_Y'],
c=unsuccessful_shots['TOTAL_SECONDS_REMAINING'],
cmap='Greens',
alpha=0.5,
label='Successful Shot',
)

# colorbar
cbar = plt.colorbar(sc)
cbar.set_label('Total Seconds Remaining')

# setting the court as a half
plt.xlim(-250, 250)
plt.ylim(422.5, -47.5)

plt.legend()

# title
plt.title('Successful Shots with Color Gradient Based on Total Seconds Remaining')

plt.show()

4. Stephen Curry’s Step Back Jump Shot

As I mentioned above, the data set includes the shot types of each shot made by Curry. I selected the Step Back shot, the most famous shot type, to visualize here. Let’s analyze in detail how Stephen Curry’s shooting performance was with the Step Back Jump.

First, let’s visualize successful step back shots.

nba_chart = ['SHOT_RESULT', 'LOC_X', 'LOC_Y', 'ACTION_TYPE_Step_Back_Jump_shot']
df_chart = df_visual[nba_chart]

# Creating a figure and setiing the color
plt.figure(figsize=(12, 11), facecolor='grey')
ax = plt.gca()
draw_court(outer_lines=True, bg_color='grey')

## the basketball court
draw_court(ax)

# Scatter plot for 'ACTION_TYPE_Step_Back_Jump_shot' with different colors for successfull shots
step_back_jump_shot_df = df_chart[df_chart['ACTION_TYPE_Step_Back_Jump_shot'] == 1]
successful_shots = step_back_jump_shot_df[step_back_jump_shot_df['SHOT_RESULT'] == 1]


plt.scatter(
successful_shots['LOC_X'],
successful_shots['LOC_Y'],
color='green',
alpha=0.5,
label='Successful Step Back Jump Shot',
)


# half court
plt.xlim(-250, 250)
plt.ylim(422.5, -47.5)

plt.legend()

plt.title('Step Back Jump Shot - Successful Shot')

# Display the plot
plt.show()

Now let’s find out in which areas Curry is more successful or unsuccessful when making Step Back shots.

nba_chart = ['SHOT_RESULT', 'LOC_X', 'LOC_Y', 'ACTION_TYPE_Step_Back_Jump_shot']
df_chart = df_visual[nba_chart]

# Creating a figure and color
plt.figure(figsize=(12, 11), facecolor='grey')
ax = plt.gca()
draw_court(outer_lines=True, bg_color='grey')

#the basketball court
draw_court(ax)

# Scatter plot for 'ACTION_TYPE_Step_Back_Jump_shot' with different colors for successful shots
step_back_jump_shot_df = df_chart[df_chart['ACTION_TYPE_Step_Back_Jump_shot'] == 1]
successful_shots = step_back_jump_shot_df[step_back_jump_shot_df['SHOT_RESULT'] == 1]

## grouping 'LOC_X' and 'LOC_Y' columns and calculating the the frequency of successful shots in each location
shot_counts = successful_shots.groupby(['LOC_X', 'LOC_Y']).size().reset_index(name='counts')

# Set up the colormap and normalize based on shot frequency
cmap = plt.get_cmap('Wistia') ## yellow to orange
## getting the min and max values for the heat map
## normalize can convert the data to color scale
normalize = plt.Normalize(vmin=shot_counts['counts'].min(), vmax=shot_counts['counts'].max())

# Scatter plot for the main points
sc_main = plt.scatter(
successful_shots['LOC_X'],
successful_shots['LOC_Y'],
c=successful_shots['LOC_Y'], # Using 'LOC_Y' for color to separate points
cmap=cmap,
s=50, ##size of main points
alpha=0.7, ## transparency
edgecolors='w', ## white
linewidth=0.5, ## width
)

# Scatter plot for the shadow points, it provides to see the highest sucessful step back jump shot
sc_shadow = plt.scatter(
shot_counts['LOC_X'],
shot_counts['LOC_Y'],
c=shot_counts['counts'],
cmap=cmap,
s=50, # size for the shadow points
alpha=0.7,
edgecolors='none', ## none for the edge
norm=normalize, ## adjusting shot frequency according to color
)

# Adding colorbar
cbar = plt.colorbar(sc_shadow, ax=ax)
cbar.set_label('Shot Frequency')

plt.legend([sc_main, sc_shadow], ['The Highest Successful Step Back Jump Shot', 'Shot Frequency'])

plt.title('Step Back Jump Shot - The Most Successful Shot Area')

plt.show()

Now, lets find out how many times Stephen Curry was successful or unsuccessful according to the area with the Step Back Jump Shot

## getting the necessary columns 
nba_chart = ['SHOT_RESULT', 'LOC_X', 'LOC_Y', 'ACTION_TYPE_Step_Back_Jump_shot']
df_chart = df_visual[nba_chart]
## determining axis and colors so we can draw the basketball court
plt.figure(figsize=(12, 11), facecolor='navy')
ax = plt.gca()

# Drawing the basketball court
draw_court(outer_lines=True, bg_color='navy')

### Defining the regions according to x an y axis, and arranged manually
### Determined 4 regions as left_corner, right_corner, center, right_center and left_center
regions = [
{'name': 'Left Corner', 'condition': (df_chart['LOC_X'] <= -200) & (df_chart['LOC_Y'] <=100)},
{'name': 'Right Corner', 'condition': (df_chart['LOC_X'] > 100) & (df_chart['LOC_Y'] < 90)},
{'name': 'Center', 'condition': (df_chart['LOC_X'] > -50) & (df_chart['LOC_X'] <90)},
{'name': 'Right Center', 'condition': (df_chart['LOC_X'] > 90) & (df_chart['LOC_Y'] >= 100)},
{'name': 'Left Center', 'condition': (df_chart['LOC_X'] < -50) & (df_chart['LOC_Y'] >= 100)},
]

# choosing a colormap and normalize based on region
cmap = plt.get_cmap('Set1')
normalize = plt.Normalize(vmin=0, vmax=len(regions) - 1)

for region_index, region in enumerate(regions):
# Filtering unsuccessful shots for 'ACTION_TYPE_Step_Back_Jump_shot' in the specific region
## so i need to filter every region according to shot results and action_type
## enumerate function provides me to correspond the region with indexes
unsuccessful_shots_region = df_chart[
(df_chart['SHOT_RESULT'] == 1) &
(df_chart['ACTION_TYPE_Step_Back_Jump_shot'] == 1) &
region['condition']
]

# Scatter plot for the region
sc_region = plt.scatter(
unsuccessful_shots_region['LOC_X'],
unsuccessful_shots_region['LOC_Y'],
color=cmap(region_index), ## different color for each region
alpha=0.7, ## transparency
label=region['name'], ## name of the regions
)


legend = plt.legend()

# adding the count for each region next to the color legend
for region_index, region in enumerate(regions):
count = len(df_chart[
(df_chart['SHOT_RESULT'] == 1) &
(df_chart['ACTION_TYPE_Step_Back_Jump_shot'] == 1) &
region['condition']
])
legend.get_texts()[region_index].set_text(f"{region['name']}: {count}") ## adding the total unsuccessful total shot


plt.title('Step Back Jump Shot - Unsuccessful Shot Regions', color = 'White')

plt.show()

5. How Stephen Curry Shot Success is Performed According to Away or Home

Let’s take a look at Stephen Curry’s performance in home and away games.

plt.figure(figsize=(10, 6))

## using of countplot
sns.countplot(x='GSW_HOME', hue='SHOT_RESULT', data=df_visual, palette='Set1')

plt.xlabel('GSW_AWAY AND GSW_HOME')
plt.ylabel('Count')
plt.title('Shot Success by GSW_AWAY & GSW_HOME ')
plt.xticks(ticks=[0, 1], labels=['Away', 'Home'])
plt.show()

6. Relationship Between Period and Shot Results for Stephen Curry

Even though there is no correlation between the match period and shot results, we can still deduce from the graph that Stephen Curry made more shots in the 1st and 3rd periods.

USING OF MACHINE LEARNING METHODS TO PREDICT CURRY’S SHOT RESULT

In this part, I will try to predict whether Stephen Curry’s shots will be successful or not using Machine Learning methods.

Firstly, I have a lot of independent features, and many of them have minimal impact on the dependent variable(SHOT_RESULT), as we can see in the correlation matrix, which you can check on my GitHub account. Therefore, I will use one of the Feature Selection methods to eliminate the features that have the least impact on the dependent variable.

I will decide which feature selection method, scaling method, and classifier method to use by using a Pipeline.

Let’s go to the coding part.

FEATURE SELECTION, SCALING AND BEST CLASSIFIER METHOD

## Separating features (X) and target (y)

X = df.drop("SHOT_RESULT", axis=1) ## independent features
y = df["SHOT_RESULT"] ## dependent feature

# Splitting the data

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=42)
## Defining Classifiers
classifiers = [LogisticRegression(max_iter = 1000), KNeighborsClassifier(n_neighbors=3),RandomForestClassifier(n_estimators=10, max_depth=5, random_state=42),
AdaBoostClassifier(learning_rate=0.1), DecisionTreeClassifier(max_depth=5), SVC(kernel = 'linear', C=0.025, probability=True),
XGBClassifier()
]

feature_selection_methods = [SelectFromModel(RandomForestClassifier(n_estimators=10,max_depth=5,random_state=42)), VarianceThreshold(threshold=0.25),
SelectKBest(score_func=mutual_info_classif, k=10), SelectFromModel(LinearSVC(dual = 'auto'))]
scaling_methods = [MinMaxScaler(), StandardScaler()]

best_accuarcy = 0
best_params = {}

## Creating 5 fold cross validation objects

cv = KFold(n_splits=5, shuffle = True, random_state=42)

for feature_selection_method in feature_selection_methods:
for scaling_method in scaling_methods:
for classifier in classifiers:
## Creating pipeline
pipeline = Pipeline([
('feature_selection', feature_selection_method),
('scaling', scaling_method),
('classifier', classifier)
])

#pipeline.fit(X, y)
#selected_features = X.columns[pipeline.named_steps['feature_selection'].get_support()]

## Cross Validation Accurancy
accuracy = np.mean(cross_val_score(pipeline, X,y,cv = cv, scoring='accuracy'))

if accuracy > best_accuarcy:
best_accuarcy = accuracy
best_params = {
'feature_selection':feature_selection_method,
'scaling':scaling_method,
'classifier':classifier,
#'selected_features': selected_features
}

print('Best Parameters:', best_params)
print('Best Accuracy:', best_accuarcy)

The best accuracy value is 0.6344

feature_selection: SelectFromModel(estimator=LinearSVC(dual=’auto’)), scaling: StandardScaler

classifier: SVC(C=0.025, kernel=’linear’, probability=True)

According to our best accuracy value, we select LinearSVC as the feature selection method, Standard Scaler as the scaling method, and SVC (kernel = ‘linear’) as the classifier method.

I used cross-validation to accurately evaluate the performance of the model and prevent overfitting. It provides me to get the parameters of the model with the highest accuracy and also it provides me to select the best model.

PREDICTION

Now let’s fit the selected models to our training datasets (X_train and y_train).

## getting the best params
best_feature_selection = best_params['feature_selection']
best_scaling = best_params['scaling']
best_classifier = best_params['classifier']

# creating pipeline with the best parameters
best_pipeline = Pipeline([
('feature_selection', best_feature_selection),
('scaling', best_scaling),
('classifier', best_classifier)
])

best_pipeline.fit(X_train, y_train)

Now our best pipeline is ready. We can predict on the X_test dataset. Then, we can compare the predictions we made with the actual values (y_test). To test this comparison, I will use the accuracy_score function

y_pred = best_pipeline.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("Test Set Accuracy: ", test_accuracy)

The test accuracy score is 0.6291. This means that approximately 63% of the shots in our test dataset were predicted correctly by the model

Finally, i will evaluate the model with confusion matrix, classification report and roc curve.

EVALUATION

  • Confusion Matrix

I used a confusion matrix to compare the predicted values with the actual values.

data = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(data, columns=np.unique(y_test), index=np.unique(y_test))
df_cm.index.name = "Actual"
df_cm.columns.name = "Predicted"

plt.figure(figsize=(4, 4))
sns.set(font_scale=1.2)
sns.heatmap(df_cm, annot=True, annot_kws={'size': 12}, cbar=False, square=True, fmt="d", cmap='Reds')

plt.show()

Now, there are 1272 shots where the actual value is 0 (unsuccessful shot) and the predicted value is also 0. There are 1190 shots where the actual value is 1 (successful shot) and the predicted value is also 1. There are 732 shots where the actual value is 0 and the predicted value is 1. Finally, there are 719 shots where the actual value is 1 and the predicted value is 0

  • Classification Report
print(classification_report(y_test, y_pred))

Precision, recall and accuracy values are important for evaluating the model. In this model, the classifications can be considered balanced. Therefore, our precision and recall values are quite close to or same as our accuracy value. Our accuracy value indicates that the model can be further improved. You can access evaluations regarding accuracy, recall, and precision from the source provided here.

  • ROC Curve

The ROC AUC score is 63%. This still indicates that the model can be improved. You can find more information about ROC AUC from here.

CONCLUSION

I tried various classification models to achieve the best accuracy. However, reaching the best accuracy does not mean that the model cannot be improved. As I continue to develop the accuracy with different models, I will continue to update this post.

Photo by Mike van den Bos on Unsplash

--

--