Is LeBron James really a difference-maker?

8 min readApr 14, 2021

The NBA is “Superstar Driven” and is one reason it is fun to watch. One of my favourite past-times is hearing my husband quote an unbelievable amount of sports stats, often centered around the incredible achievements of the GOATs. It’s impossible to deny the formidable talent (both my husband’s ability to recall with great detail these stats and the stats themselves) and it’s amazing to be an audience to it.

Just for fun, I often will take the stance that LeBron James is overrated; this never fails to spur my husband into effectively presenting the arguments as to why he is most certainly not. (I similarly take the stance that Connor McDavid is overrated, but we’ll save that for another post).

He carried a team of garbage to the finals in 2015. Irving was injured. Have you heard of Timofey Mozgov? Because you never should of. LeBron just took the ball and drove the basket. Is he a difference-maker? I take offense to that question. That’s not even a question. — My husband

Many of us watched Michael Jordan’s Netflix special, The Last Dance, over the pandemic. In the 90s, I was a bit too young to really appreciate just how special he was, so it was awesome to watch the highlights and commentary. 10/10 recommend this program if you haven’t watched it yet! But it left me wondering, who are the current “Air Jordans” that I am now old enough to appreciate? Who am I sleeping on? Could it be Lebron?

Who am I sleeping on? Could it be Lebron?

Google Trends: The Last Dance (Michael Jordan Netflix)

To help answer this question for myself, I have shared this simple analysis on the Cleveland Cavalier’s performance over the 2013–2016 seasons. Sports analytics is a well-established field used to find an edge to enhance the performance of teams and find emerging talent.

Let me show you one such way we can harness Data in the Sports World:

Objective: Find statistical support for difference-makers in the NBA.

Null hypothesis = every one is the same

Load the Packages

# Load Packages
import numpy as np
import pandas as pdimport matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlineimport scipy.stats as st

Load the Data

df1 = pd.read_csv('nba_games_2013_2015.csv',sep=';')
df2 = pd.read_csv('nba_playoff_games_2016.csv',sep=';')# Join the tables
frames = [df1, df2]
df = pd.concat(frames)
df.sample(50)

Good news! This dataset is complete and does not require any cleaning.

Create DataFrames per Season

The dates are a bit misleading because the season spans 2 calendar years (e.g. 2013 season = Oct 2013 — April 2014). The regular season ended on April 17, 2013, and the playoffs began on April 20, 2013, and ended on June 20, 2013.

df_2013 = df[df['SEASON_ID']==22013]
df_2014 = df[df['SEASON_ID']==22014]
df_2015 = df[df['SEASON_ID']==22015]
df_2015PO = df[df['SEASON_ID']==42015]

Data Exploration

fig_dims = (36, 24)
fig, ax = plt.subplots(figsize=fig_dims)bplot = sns.boxplot(x='TEAM_ABBREVIATION', y='PTS', data=df,ax=ax)
                    
bplot.axes.set_title("Points per Game for all NBA Teams 2013-2015 Seasons",
                    fontsize=30)
bplot.set_xlabel("Team Name", 
                fontsize=20)
plt.xticks(rotation=45, fontsize=20)
plt.yticks(fontsize=20)
 
bplot.set_ylabel("Points per Game",
                fontsize=20)# output file name
plot_file_name="boxplot.jpg"
 
# save as jpeg
bplot.figure.savefig(plot_file_name,
                    format='jpeg',
                    dpi=100)

The boxplot gives an interesting visual that shows that points per team seem to be equally distributed — except for hey who’s that? Golden State Warriors (GSW) and Oklahoma City (OKC).

Follow-up question: Did you know that GSW has more 3-point attempts than other teams? It’s a question of offensive philosophies.

Me: Why is OKC above the others?
Him: What years are you looking at?
Me: 2013–2016
Him: Who did they (OKC)have? (rehtorical)
Kevin Durant! Of course. The 2nd greatest basketball player in the world. //

sns.heatmap(abs(df.corr()), cmap='coolwarm',linewidths=.5,vmin = 0,vmax=1,annot=False)
sns.set(rc={'figure.figsize':(10,10)})
plt.title('NBA Correlation')

What insights can you see in the heatmap? It would be interesting to do a similar analysis with the NHL data on puck possession. The more you have the ball/puck, the more you score, the more you win. Swish.

Setting Up the Problem Statement

Does the data support the common assertion that LeBron James is a difference-maker?

LeBron: Destroying hopes and dreams since 2003 (SI, 2013).

**Disclaimer — this is a tongue-in-cheek question because you KNOW he is the MVP; what is interesting though is that this helps us set up the workflow to expand this analysis to less clear areas of sport. Don’t come @ me.

Analysis

Test the hypothesis that offensive production of Cleveland Cavaliers and Golden State Warriors (teams from finals) were distributed equally in 2015/2016, by doing two separate tests for PTS (Points) and FG_PCT (Field Goal Percentage).

FYI in basketball, a field goal is a basket scored on any shot other than a free throw, worth two or three points depending on the distance of the attempt from the basket. The better the player’s skills then the higher their field goal percentage will be. A player with good shooting skills usually averages about 40% from the field. The term “field” or “the basketball field” refers to the court, hence the name field goal.

I will use a t-test to test the null hypothesis.

# Null hypotheis, Cavs points and GSW points are equally distributed during the regular seasonCavs = df_2015[df_2015.TEAM_NAME=='Cleveland Cavaliers'].PTS.values
GSW = df_2015[df_2015.TEAM_NAME=='Golden State Warriors'].PTS.valuesttest,pval = st.ttest_ind(Cavs,GSW)print(pval)if pval <0.05:
  print("We reject the null hypothesis")
else:
  print("We accept the null hypothesis")

>>> 1.4233420547764935e-08
>>> We reject the null hypothesis

Based on the PTS per team for the 2015 season, we cannot conclude that the Cavaliers and the Golden State Warriors are equally distributed. From the boxplot above, we see in fact that GSW is better.

# Null hypotheis, Cavs points and GSW Field Goal Percentages are equally distributed during the regular seasonCavs = df_2015[df_2015.TEAM_NAME=='Cleveland Cavaliers'].FG_PCT.values
GSW = df_2015[df_2015.TEAM_NAME=='Golden State Warriors'].FG_PCT.valuesttest,pval = st.ttest_ind(Cavs,GSW)print(pval)if pval <0.05:
  print("We reject the null hypothesis")
else:
  print("We accept the null hypothesis")

>>> 0.00206097581047554
>>> We reject the null hypothesis

Based on the FG_PCT per team for the 2015 season, we cannot conclude that the Cavaliers and the Golden State Warriors are equally distributed. From the boxplot above, we see in fact that GSW is better. We also happen to know that GSW is known for its ability to score 3 pointers.

On the surface, shouldn’t GSW be better than the Cavs? What else could be ‘at play’ here?

Test the hypothesis that points per game (PTS) are equally distributed in all 3 seasons for Cleveland. I will use a one-way ANOVA is a hypothesis test to compare more than 2 distributions

Cavs_2013 = df_2013[df_2013.TEAM_NAME=='Cleveland Cavaliers'].PTS.values
Cavs_2014 = df_2014[df_2014.TEAM_NAME=='Cleveland Cavaliers'].PTS.values
Cavs_2015 = df_2015[df_2015.TEAM_NAME=='Cleveland Cavaliers'].PTS.valuesstat, pval = st.f_oneway(Cavs_2013,Cavs_2014,Cavs_2015)print("F statistic:",stat,"p-value from the F distribution:", pval)
if pval > 0.05:
    print("There were no statistically significant differences between group means as determined by one-way ANOVA")
else:
    print("There were statistically significant differences between group means as determined by one-way ANOVA")

>>>F statistic: 5.9200250318080885 p-value from the F distribution: 0.003087727119983984
>>> There were statistically significant differences between group means as determined by one-way ANOVA

What was the difference between these seasons? LeBron James was on the Cavaliers for the 2014–2015 and 2015–2016 seasons (all that is included in this dataset).

To test if this mattered, test between which seasons there was a significant difference:

# Null hypothesis that 2013 and 2014 were equally distributedttest,pval = st.ttest_ind(Cavs_2013,Cavs_2014)print(pval)if pval <0.05:
  print("We reject the null hypothesis")
else:
  print("We accept the null hypothesis")

>>> 0.013091680534336523
>>> We reject the null hypothesis

2013 (LeBron_No) and 2014 (LeBron_Yes) seasons are not equally distributed. Perhaps LeBron's re-joining in 2014 made a difference?

# Null hypothesis that 2014 and 2015 were equally distributedttest,pval = st.ttest_ind(Cavs_2014,Cavs_2015)print(pval)if pval <0.05:
  print("We reject the null hypothesis")
else:
  print("We accept the null hypothesis")

>>> 0.5203507617734474
>>> We accept the null hypothesis

2014 (LeBron_Yes) and 2015 (LeBron_Yes) seasons are equally distributed. Perhaps LeBron’s consistent presence leads to consistency?

Conclusion

Performing a simple statistical analysis on the Cavalier’s performance with and without LeBron James seems to suggest that he did indeed matter.

Follow-Up Thoughts

The NBA is a game of Super Stars. The data supports that if you spend the money on a Super Star, they will deliver.

“You want to talk about practice?” — The Answer, Allen Iverson

The bigger question: Do you have a market to attract the Super Star?

This is a good place for the next exploration because this could be your limiting reagent. What does the market analysis tell me about your fan base and your brand culture? What is advantageous about your city? Disadvantageous?

Let’s connect to start exploring those insights → Stacey.Waldal@gmail.com

Bonus

What about the removal of Coach Blatt on the 24th of Jan, 2016?

What you need to know: In the 2014/2015 Season, under Blatt, the Cavaliers lost the finals. In the 2015/2016 Season, after the removal of Blatt in Jan, the Cavaliers won the finals. It is believed that Blatt + LeBron was not the right fit, does the data show the removal of Blatt was a difference-maker?

df_2015['GAME_DATE']=pd.to_datetime(df_2015['GAME_DATE'])
team1_15 = df_2015[df_2015.TEAM_NAME=='Cleveland Cavaliers']# take all of 2014 season points
team1_14 = df_2014[df_2014.TEAM_NAME=='Cleveland Cavaliers'].PTS.values#take 2015 season before Jan 25, 2016 
team1_15_before = df_2015[(df_2015['GAME_DATE'] < pd.to_datetime('2016-01-25'))].PTS.values#PTS after Jan 25, 2016
blatt_no = team1_15[(team1_15['GAME_DATE'] >= pd.to_datetime('2016-01-25')) & (team1_15['GAME_DATE'] <= pd.to_datetime('2016-04-13'))].PTS.values# dataset where Blatt was coaching
blatt_yes = np.concatenate((team1_14, team1_15_before))

The Kruskal-Wallis H-test tests the null hypothesis that the population median of all of the groups are equal. It is a non-parametric version of ANOVA. The test works on 2 or more independent samples, which may have different sizes. Note that rejecting the null hypothesis does not indicate which of the groups differs. Post hoc comparisons between groups are required to determine which groups are different.

# Null hypothesis is there the team was the same regardless of who was the coachstatistic,pvalue = st.kruskal(blatt_yes,blatt_no)
if pvalue <0.05:
  print("We reject the null hypothesis")
else:
  print("We accept the null hypothesis")

>>> We reject the null hypothesis

Coaches appear to matter too.

This analysis is adapted from the Lighthouse Labs curriculum for the Data Science Full Time Bootcamp.

You can find the dataset and Jupyter notebook can be found on my GitHub.