A Bayesian Variable Selection Approach to Major League Baseball Hitting Metrics

Posted: October 27th, 2011 | Author: | Filed under: Python, R, Statistics | No Comments »

I’m happy to announce my most recent publication “A Bayesian Variable Selection Approach to Major League Baseball Hitting Metrics” in the Journal of Quantitative Analysis in Sports. Though this might sound boring unless you are a baseball fan and/or a Bayesian (and perhaps even then), the paper is fundamentally about how to choose which metrics are predictive, a topic anyone in statistics, analytics, or any other data driven field should care deeply about.

I’ll try to motivate this in as general a setting as possible. Suppose you have some metric (batting average, earnings, engagement) for a population of individuals (baseball players, businesses, users of your product) over several time periods. A traditional Random effects model estimates an intercept term for every individual. In some situations the assumption is unrealistic.

Often populations contain individuals who are indistinguishable from average, meaning its better to estimate their value with the overall mean rather than with their own data. This implies the metric is not predictive for that player. By definition, those not in the previous group are systematically high or low relative to the average. Examples of the second group include Barry Bonds, who always hit more home runs and took more steroids than average, Warren Buffet and Berkshire Hathaway, who always made better investments than average, or my Google friends’ use of Facebook since the release of Google+, which is systematically lower than average. This is best visualized by two distributions, the black spike with all its probability at the overall mean (average individuals), and the red distribution with most of its probability far above or below this value (non-average individuals).

Once we find the probability each player is a member of the two categories for each metric, we can tell if a metric is predictive if: a) most individuals are systematically different from the average and b) most of the metric’s variance is explained by the model. Finally, the obligatory plot showing our method performs at least as well on a holdout sample as other methods for the 50 metrics tested:

For those interested (which should be everyone but in practice is almost no one), our method also automatically controls for multiple testing, as we perform 1,575 tests in our analysis.

This paper was co-authored with Blake McShane, James Piette, and Shane Jensen, and can be viewed as a more technical companion piece to our previous paper “A Point-Mass Mixture Random Effects Model for Pitching Metrics” which can be downloaded here. The poorly commented python code for the MCMC sampler can be found here. If you’re interested in implementing or tweaking our methodology, feel free to send me an email or reach out on Twitter.



Leave a Reply