A Bayesian Variable Selection Approach to Major League Baseball Hitting Metrics

Posted: October 27th, 2011 | Author: | Filed under: Python, R, Statistics | No Comments »

I’m happy to announce my most recent publication “A Bayesian Variable Selection Approach to Major League Baseball Hitting Metrics” in the Journal of Quantitative Analysis in Sports. Though this might sound boring unless you are a baseball fan and/or a Bayesian (and perhaps even then), the paper is fundamentally about how to choose which metrics are predictive, a topic anyone in statistics, analytics, or any other data driven field should care deeply about.

I’ll try to motivate this in as general a setting as possible. Suppose you have some metric (batting average, earnings, engagement) for a population of individuals (baseball players, businesses, users of your product) over several time periods. A traditional Random effects model estimates an intercept term for every individual. In some situations the assumption is unrealistic.

Often populations contain individuals who are indistinguishable from average, meaning its better to estimate their value with the overall mean rather than with their own data. This implies the metric is not predictive for that player. By definition, those not in the previous group are systematically high or low relative to the average. Examples of the second group include Barry Bonds, who always hit more home runs and took more steroids than average, Warren Buffet and Berkshire Hathaway, who always made better investments than average, or my Google friends’ use of Facebook since the release of Google+, which is systematically lower than average. This is best visualized by two distributions, the black spike with all its probability at the overall mean (average individuals), and the red distribution with most of its probability far above or below this value (non-average individuals).

Once we find the probability each player is a member of the two categories for each metric, we can tell if a metric is predictive if: a) most individuals are systematically different from the average and b) most of the metric’s variance is explained by the model. Finally, the obligatory plot showing our method performs at least as well on a holdout sample as other methods for the 50 metrics tested:

For those interested (which should be everyone but in practice is almost no one), our method also automatically controls for multiple testing, as we perform 1,575 tests in our analysis.

This paper was co-authored with Blake McShane, James Piette, and Shane Jensen, and can be viewed as a more technical companion piece to our previous paper “A Point-Mass Mixture Random Effects Model for Pitching Metrics” which can be downloaded here. The poorly commented python code for the MCMC sampler can be found here. If you’re interested in implementing or tweaking our methodology, feel free to send me an email or reach out on Twitter.


A simple MySQLdb example python script

Posted: August 17th, 2011 | Author: | Filed under: MySQL, Python | No Comments »

I mostly stick to mongodb nowadays, but every now and again I need to access data stored in a MySQL table. In my last post I talked about a MySQLdb error. This is a variant of the script which induced the error. It takes a .csv file with application ids piped to the script and joins them with price, category and name data from a db. This script uses the simplejson and MySQLdb

#!/usr/bin/env python

import sys
import simplejson
import MySQLdb
import re

def connect_db(host, port, user, password, db):
try:
return MySQLdb.connect(host=host, port=port, user=user, passwd=password, db=db)
except MySQLdb.Error, e:
sys.stderr.write(“[ERROR] %d: %s\n” % (e.args[0], e.args[1]))
return False

def main():
# the line below won’t work for you unless you put in your working credentials
# you didn’t think I’d put working credentials on my blog did you?
dbconn = connect_db(ip, port, user, password, db)

for line in sys.stdin.readlines():
app_id = line.split(“,”)[0]
sql = “SELECT info FROM apps WHERE id = ‘%s'” % app_id
try:
cursor = dbconn.cursor()
cursor.execute(sql)
result = cursor.fetchone()
except MySQLdb.Error, e:
sys.stderr.write(“[ERROR] %d: %s\n” % (e.args[0], e.args[1]))
continue

data = simplejson.loads(result[0])
price = data[“price”] if data[“price”] else “null”
categories = data[“categories”] if data[“categories”] else “null”
name = data[“appName”] if data[“appName”] else “null”
print “%s,%s,%s,%s” % (name, line.strip(), price, categories)

if __name__ == “__main__”:
main()


Library not loaded: libmysqlclient.18.dylib

Posted: August 12th, 2011 | Author: | Filed under: MySQL, Python | 6 Comments »

Last week I upgraded to Lion from Snow Leopard. While I love the subtle touches of the new OS and even the natural (inverted) scrolling, it seriously screwed with my existing python packages and torched gcc until I installed X Code 4. I have a python script which uses the MySQLdb package. After a supposedly successful installation of MySQLdb, running my python script yielded the error:

ImportError: dlopen(/Users/alex/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.7-intel.egg-tmp/_mysql.so, 2): Library not loaded: libmysqlclient.18.dylib
Referenced from: /Users/alex/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.7-intel.egg-tmp/_mysql.so
Reason: image not found

For some reason the install pointed itself to the wrong place. Adding the following to your ~/.profile or ~/.bash_profile should fix the issue (assuming this is where you MySQL installation sits):

export DYLD_LIBRARY_PATH=/usr/local/mysql/lib:$DYLD_LIBRARY_PATH

Open up a new terminal and you should be good to go.

Update! This also fixes some ruby 1.9 and rails 3 installation issues on OSX Lion. Thanks to Mauro Morales (@noiz777 for finding this!


Pymongo: distinct items and an example map reduce on subset of db

Posted: June 7th, 2011 | Author: | Filed under: mongodb, pymongo, Python | 6 Comments »

I’ve been playing around with Pymongo for a few weeks now, and I’m slowly discovering quirks and differences in syntax relative to the mongodb shell. The two I’ll cover in this post are:

  1. Using distinct in pymongo
  2. map reduce example in pymongo on a queried subset of the db

Using distinct in pymongo

Let’s say you want to group users by some sort of id on a day (I’ll use May 21 as an example). From the mongodb shell this command is simply:

db.raw_data.distinct(“id”, {“_date”: “2011-05-21″})

Running this command from within a Python file yields the following error

File “test.py”, line 13, in
foo = db.raw_data.distinct(“id”, {“_date”: “2011-05-21″})
TypeError: distinct() takes exactly 2 arguments (3 given)

It turns out Pymongo makes you do the find and then distinct the records

foo = db.raw_data.find({“_date”: “2011-05-21″}).distinct(“id”)

This is exactly what the mongodb shell interpreter is doing, its just annoying that the syntax is different.

map reduce example in pymongo on a queried subset of the db

Following the example from my previous post, you simply add query = {} and add out = for the collection in which you want your results to end up. Most of the examples I found on stack overflow or personal blogs were wrong or tried to pass these parameters in together. I tried roughly 865868 variations and what I have below is the only combination that worked.

#!/usr/bin/env python
from bson.code import Code # for some this needs to be pymongo.bson
from pymongo import Connection

# code for example map/reduce
db = Connection().map_reduce_example
map = Code(“function () {”
“emit(this.id, 1)”
“}”)
reduce = Code(“function (key, values) {”
” var total = 0;”
” for (var i = 0; i < values.length; i++) {" " total += values[i];" " }" " return total;" "}") # code without query result = db.things.map_reduce(map, reduce, "map_reduce_example") # code with simple query result = db.raw_data.map_reduce(map, reduce, out = "map_reduce_example", query = {"date": "2011-05-21"}) # code with query that grabs all records from May 2011 result = db.raw_data.map_reduce(map, reduce, out = "map_reduce_example2", query = {"date": {"$regex": "^2011-05"}})

This may seem like a silly and simple blog post to write, but this wasn’t documented anywhere else online so I wanted to save anyone else with the distinct error or trying to run a map reduce with a query 5-10 minutes.


Bayesian computation is so hot right now!

Posted: March 23rd, 2011 | Author: | Filed under: Python, Statistics | No Comments »

For everyone really into baseball AND Bayes’ Theorem, this post is for you. I finally got around to posting the python code implementing the algorithm described in my paper, A Point-Mass Mixture Random Effects Model for Pitching Metrics. Sabrmetrics, the study of statistical patterns in baseball, is a huge mess. Everyone is proposing new metrics and those evaluating them usually don’t have the credentials or skills to do so correctly. In this paper, we take a statistically rigorous approach and argue that metrics must (i) have a large fraction of players which are different from the league average and (ii) give high confidence about which players are not league average. We of course rigorously define these requirements within the paper.

The .tar.gz contains 5 files:

  1. bb-main.py – the main function that runs the sampler
  2. bb.py – class and function definitions
  3. BABIP.csv – a data file for the BABIP (batting average on balls in play) metric
  4. pitching_column_info.csv – index file needed because we were doing all runs concurrently on the Wharton Grid
  5. BABIP.sh – shell script for running the sampler under some default parameters for BABIP

A companion manuscript, A Bayesian Variable Selection Approach to Major League Baseball Hitting Metrics, is currently under review.