Empirical Bayes-ketball

Applying Empirical Bayes to basketball

Jesse Fischer • September 6, 2020
Photo: daren_pics

How much can you learn about the three-point shooting of a player when data is limited?

Potential lottery picks Cole Anthony and Tyrese Haliburton only played 22 games before injuries cut their season short in 2020 (finishing with 141 and 124 three-point attempts respectively). Potential #1 draft pick LaMelo Ball had 80 three-point attempts in his 12 NBL games. The top high school recruit in the country James Wiseman missed his only three-point attempt in just 69 minutes at Memphis.

How do you confidently evaluate shooting performance based on such limited data?

This is especially important in a year where numerous NBA lottery picks will be selected based on so few data points.

Trying to find a signal in the noise with such small sample sizes is not a new problem in sports analytics. This comes up every year in the MLB after a player starts the season on a hot streak, giving hope that they will be the first person to hit .400 since Ted Williams last did in 1941. As a good Bayesian we know that a player batting over .400 after 100 at bats has a better shot than another player after only 10 at bats, but neither are likely to top .400.

One simple approach to estimate a player's end of season batting average, is to regress their current average to the mean. For example, take a weighted average of current batting average with the league average, weighting based on how far along in the season they are (or better yet regress against their career average).

This simple concept is the basis for approaches referred to as the "stabilization rate" or "padding method" (often used by @Tangotiger). You may have also heard of this in relation to a concept called "Empirical Bayes", as there is a whole series of blog posts that apply Bayes to batting averages (along with many other interesting extensions).

Which NBA player has had the best three-point shooting season of all time?

The easiest way to answer this question is to look at single season 3P% leaders. Note that for the sake of simplicity we will overlook shot difficulty and league-wide changes in three-point shooting trends.

rank   name team   year    3p%   3pm   3pa 
1 Jamie Feick NJN 2000 1.000 3 3
2 Raja Bell GOS 2010 1.000 3 3
3 Antonius Cleveland ATL 2018 1.000 3 3
4 Beno Udrih MEM 2014 1.000 2 2
5 Don MacLean MIA 2001 1.000 2 2

Looks like the top seasons are all from players shooting 100% on only a few shot attempts - therefore it doesn't look like this approach is particularly informative.

As a next step, we can apply a filter to exclude seasons with <X three-point attempts (I choose an arbitrary threshold of 100 attempts below).

rank   name team   year    3p%   3pm   3pa 
1 Pau Gasol SAN 2017 0.538 56 104
2 Kyle Korver UTH 2010 0.536 59 110
3 Jason Kapono MIA 2007 0.514 108 210
4 Luke Babbitt NOP 2015 0.513 59 115
5 Kyle Korver ATL 2015 0.492 221 449
6 Hubert Davis DAL 2000 0.491 82 167
7 Kyle Korver CLE 2017 0.485 97 200
8 Troy Daniels CHA 2016 0.484 59 122
9 Fred Hoiberg MIN 2005 0.483 70 145
10 Jason Kapono TOR 2008 0.483 57 118

This list is more intuitive but with Pau Gasol leading the pack and Steph Curry not even on the list…we remain skeptical. This approach is also sensitive to the specific threshold chosen which adds undesirable subjectivity to the process. There must be a better way…Empirical Bayes!

Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data.

Since drob@ does a better job explaining these concepts than I ever will, I highly recommend reading his Empirical Bayes book which is a compilation of baseball themed blog posts on the topic. He does an excellent job explaining concepts using practical examples and even includes sample code to follow along.

The only thing missing is Python specific code (thanks stackoverflow!) - which is why I've included a code snippet to aid others in performing their own Empirical Bayes! This code assumes a beta-binomial distribution, which is great for sports analytics because it can be applied to any "success/attempt statistic".

from scipy.stats import betabinom
from scipy.optimize import minimize

def betabinom_func(params, *args):
a, b = params[0], params[1]
k = args[0] # hits
n = args[1] # at_bats
return -np.sum(betabinom.logpmf(k, n, a, b))

def solve_a_b(hits, at_bats, max_iter=250):
result = minimize(betabinom_func, x0=[1, 10],
args=(hits, at_bats), bounds=((0, None), (0, None)),
method='L-BFGS-B', options={'disp': True, 'maxiter': max_iter})
a, b = result.x[0], result.x[1]
return a, b

# Sanity check your data to ensure hits <= at_bats, at_bats > 0, and both are type int
def estimate_eb(hits, at_bats):
a, b = solve_a_b(hits, at_bats)
return ((hits+a) / (at_bats+a+b))

df['3p%_eb'] = estimate_eb(df['3pm'], df['3pa'])

The results look much better after applying Empirical Bayes - many great shooters and multiple Curry sightings!

rank   name team   year   3p% (eb)  3p%   3pm   3pa 
1 Kyle Korver ATL 2015   0.446   0.492 221 449
2 Stephen Curry GOS 2016 0.433 0.454 402 886
3 J.J. Redick LAC 2016 0.433 0.475 200 421
4 Joe Johnson PHX 2005 0.431 0.478 177 370
5 Jason Kapono MIA 2007 0.431 0.514 108 210
6 Glen Rice CHA 1997 0.431 0.47 207 440
7 Joe Harris BRK 2019 0.429 0.474 183 386
8 Kyle Korver ATL 2014 0.428 0.472 185 392
9 Steve Nash PHX 2008 0.426 0.47 179 381
10 Stephen Curry GOS 2013 0.426 0.453 272 600

It is also interesting to look at how strongly results are regressed towards the mean depending on how many attempts a player has (explore the drop-down for the same approach applied to other stats!).

I included the optimal alpha/beta values in a table below so you can regress statistics on your own1. I'll leave it to the reader to compare these results with other techniques like NBA stabilization rates (recent work by @kmedved).

stat     success   attempt  alpha   beta    avg  
3p% 3pm 3pa 73.2 137.3 0.348
2p% 2pm 2pa 54.9 60.7 0.475
fg% fgm fga 44.4 55.2 0.446
ft% ftm fta 15.5 5.5 0.736
ast% ast ast_opp 2.1 13.8 13.5
blk% blk opp_2p_fga 0.7 20.7 3.1
drb% drb drb_opp 6.0 35.4 14.5
orb% orb orb_opp 2.0 33.0 5.7
stl% stl poss 8.4 508.0 1.6
pf% pf poss 8.5 159.8 5.0
tov% tov poss 13.3 83.6 13.8
usg% usg_num poss 12.0 52.2 18.7
efg% efg_num fga 60.4 63.8 0.486
ftr2 fta fga 2.8 6.7 0.292
3par 3pa fga 0.5 1.9 0.214

Note that these numbers are based on NBA data dating back to the 1996-97 season. The game is continually evolving, which means different time periods can change results slightly. With additional complexity it is possible to enhance the approach by calculating different values for each season or decade.

When dealing with limited data (as is often the case in sports analytics), Empirical Bayes is a powerful tool. By objectively regressing towards the mean, we can avoid outlier data points and more accurately evaluate small sample performances.

In my next post, I will discuss a related topic of "hierarchical modeling" and look at some specific examples from the 2020 draft class.

  1. As a reminder the calculation is "(success + alpha) / (attempt + alpha + beta)"

  2. Note that the traditional free-throw rate metric (ft/fga) isn't a true rate statistic (but instead a proportion) so technically this isn't correct but since it is very rare to have a rate >1.0 the results still make sense - to make it fool proof we could instead change the statistic to "fta/(fta+fga)".