A Bayesian Variable Selection Approach to Major League Baseball Hitting Metrics

Posted: October 27th, 2011 | Author: | Filed under: Python, R, Statistics | No Comments »

I’m happy to announce my most recent publication “A Bayesian Variable Selection Approach to Major League Baseball Hitting Metrics” in the Journal of Quantitative Analysis in Sports. Though this might sound boring unless you are a baseball fan and/or a Bayesian (and perhaps even then), the paper is fundamentally about how to choose which metrics are predictive, a topic anyone in statistics, analytics, or any other data driven field should care deeply about.

I’ll try to motivate this in as general a setting as possible. Suppose you have some metric (batting average, earnings, engagement) for a population of individuals (baseball players, businesses, users of your product) over several time periods. A traditional Random effects model estimates an intercept term for every individual. In some situations the assumption is unrealistic.

Often populations contain individuals who are indistinguishable from average, meaning its better to estimate their value with the overall mean rather than with their own data. This implies the metric is not predictive for that player. By definition, those not in the previous group are systematically high or low relative to the average. Examples of the second group include Barry Bonds, who always hit more home runs and took more steroids than average, Warren Buffet and Berkshire Hathaway, who always made better investments than average, or my Google friends’ use of Facebook since the release of Google+, which is systematically lower than average. This is best visualized by two distributions, the black spike with all its probability at the overall mean (average individuals), and the red distribution with most of its probability far above or below this value (non-average individuals).

Once we find the probability each player is a member of the two categories for each metric, we can tell if a metric is predictive if: a) most individuals are systematically different from the average and b) most of the metric’s variance is explained by the model. Finally, the obligatory plot showing our method performs at least as well on a holdout sample as other methods for the 50 metrics tested:

For those interested (which should be everyone but in practice is almost no one), our method also automatically controls for multiple testing, as we perform 1,575 tests in our analysis.

This paper was co-authored with Blake McShane, James Piette, and Shane Jensen, and can be viewed as a more technical companion piece to our previous paper “A Point-Mass Mixture Random Effects Model for Pitching Metrics” which can be downloaded here. The poorly commented python code for the MCMC sampler can be found here. If you’re interested in implementing or tweaking our methodology, feel free to send me an email or reach out on Twitter.


Hey Klout, Adding More Decimal Places Does Not Make Your Score More Accurate

Posted: October 26th, 2011 | Author: | Filed under: Klout, PeerIndex, Statistics | No Comments »

Klout has been hyping up their score changes for a week now. The CEO Joe Fernandez has claimed that this makes the score more accurate, more transparent, and may cure some forms of cancer (well maybe not the last claim). Let’s just say I haven’t been this disappointed since the 2000 election. Let’s start with their first claim: accuracy. See figure 1, my new score

It’s exactly the same graphic as before, but with two decimal places. While my 8th grade Chemistry teacher may be glad that they are using more significant digits, I honestly don’t care. They were there before, just not displayed. Lame.

In their blog post, they claim: “This project represents the biggest step forward in accuracy, transparency and our technology in Klout’s history.” They support this vague claim with the histogram below, showing the differences in Klout scores, before and after the change:

This histogram leaves tons of open questions. Is this different than your normal daily shift in scores? The histogram reminds me of a t-distribution with a fatter positive tail. If more people are signing up for Klout than are leaving, thats probably what it should look like anyways as users hookup more networks and gradually become more active online. The graphic doesn’t show that your score is any better, just that it changed. That’s not impressive at all.

My beef with Klout remains simply that the service provides us with no real validation or explanation of our scores. They don’t show us how many times we have been RT’ed, mentioned, etc. On Google, you can look up your page rank, on app stores you can see your average rating and number of ratings, on Klout, you are told that your true reach has increased, but not told what that implies or how you can verify it.

Klout is still the social influence measurement leader, but with Peerindex rapidly improving (and better in many ways in my opinion), and new competitors such as Proskore and Kred popping up, Klout should be worried. I’ll have a review of both Proskore and Kred up shortly as well so you can easily compare them for yourself.


Honey Cake Recipe

Posted: October 18th, 2011 | Author: | Filed under: Uncategorized | No Comments »

In addition to hating on Klout and writing random code blog posts, I’ve decided to branch out a bit with my posts and write about one of my other passions: baking and cooking. I recently had a Rosh Hashanah (Jewish New Year) dinner party with my roommates and made several desserts. The traditional dessert for the holiday is a Honey Cake, for a “sweet” new year. I’ve tried about 5 recipes and this is my favorite. Pictures, ingredient list and directions can be found below.

Ingredients:

  • 3 1/2 cups all-purpose flour
  • 1 tablespoon baking powder
  • 1 teaspoon baking soda
  • 1/2 teaspoon kosher salt
  • 4 teaspoons ground cinnamon
  • 1/2 teaspoon ground cloves
  • 1/2 teaspoon ground allspice
  • 1 cup vegetable oil
  • 1 cup honey
  • 1 1/2 cups granulated sugar
  • 1/2 cup brown sugar
  • 3 large eggs at room temperature
  • 1 teaspoon vanilla extract
  • 1 cup coffee
  • 1/2 cup fresh orange juice
  • 1/4 cup bourbon

Preparation Instructions:

  1. In a large bowl, whisk together the flour, baking powder, baking soda, salt, cinnamon, cloves and allspice
  2. Make a well in the center, and add oil, honey, granulated sugar, brown sugar, eggs, vanilla, coffee, orange juice and bourbon
  3. Mix well and then portion into tins. Do not overbeat! If you are using an electric mixer, do so at a low setting.

This is a huge recipe and its into FIVE 7.75 Inch x 3.75 Inch x 2.75 loaf pans. I normally just pick up a pack of 6 disposable aluminum tins at the grocery store.

The batter should look like this once combined:

The cakes keep for up to 5 days and honestly are better after sitting for a day, just cover with foil or cling wrap in the tin to store. Enjoy!


Installing MACS (Markovian Coalescent Simulator) on OS X 10.7 Lion

Posted: October 15th, 2011 | Author: | Filed under: Uncategorized | No Comments »

After a few years break from my dissertation research on coalescent modeling of HIV sequences, I’ve decided to dive in again. I am interested in generating some sequences from coalescent models with various characterestics. Though ms and msHOT from the Hudson lab are what I’ve used previously, I decided to see what else is out there.

The paper describing their methodology can be found here. The files are available here. In the extracted directory simply run “make all”.

I ran into two compilation errors:

g++ -Wall -g -I /Users/garychen/software/boost_1_36_0 -c algorithm.cpp
algorithm.cpp: In member function ‘void GraphBuilder::build()’:
algorithm.cpp:1272: error: ‘uint’ was not declared in this scope
algorithm.cpp:1272: error: expected `;’ before ‘iSegLength’
algorithm.cpp:1273: error: ‘iSegLength’ was not declared in this scope
make: *** [algorithm.o] Error 1

I changed uint to int on line 1272 and combined the cout onlines 1273 and 1274 into one line, which fixed the complaints. Then the build complained:

g++ -o macs simulator.o algorithm.o datastructures.o -static
ld: library not found for -lcrt0.o

Simply comment out “LINKFLAGS = -static” and everything should happily compile.