I’m happy to announce my most recent publication “A Bayesian Variable Selection Approach to Major League Baseball Hitting Metrics” in the Journal of Quantitative Analysis in Sports. Though this might sound boring unless you are a baseball fan and/or a Bayesian (and perhaps even then), the paper is fundamentally about how to choose which metrics are predictive, a topic anyone in statistics, analytics, or any other data driven field should care deeply about.
I’ll try to motivate this in as general a setting as possible. Suppose you have some metric (batting average, earnings, engagement) for a population of individuals (baseball players, businesses, users of your product) over several time periods. A traditional Random effects model estimates an intercept term for every individual. In some situations the assumption is unrealistic.
Often populations contain individuals who are indistinguishable from average, meaning its better to estimate their value with the overall mean rather than with their own data. This implies the metric is not predictive for that player. By definition, those not in the previous group are systematically high or low relative to the average. Examples of the second group include Barry Bonds, who always hit more home runs and took more steroids than average, Warren Buffet and Berkshire Hathaway, who always made better investments than average, or my Google friends’ use of Facebook since the release of Google+, which is systematically lower than average. This is best visualized by two distributions, the black spike with all its probability at the overall mean (average individuals), and the red distribution with most of its probability far above or below this value (non-average individuals).
Once we find the probability each player is a member of the two categories for each metric, we can tell if a metric is predictive if: a) most individuals are systematically different from the average and b) most of the metric’s variance is explained by the model. Finally, the obligatory plot showing our method performs at least as well on a holdout sample as other methods for the 50 metrics tested:
For those interested (which should be everyone but in practice is almost no one), our method also automatically controls for multiple testing, as we perform 1,575 tests in our analysis.
This paper was co-authored with Blake McShane, James Piette, and Shane Jensen, and can be viewed as a more technical companion piece to our previous paper “A Point-Mass Mixture Random Effects Model for Pitching Metrics” which can be downloaded here. The poorly commented python code for the MCMC sampler can be found here. If you’re interested in implementing or tweaking our methodology, feel free to send me an email or reach out on Twitter.
Klout has been hyping up their score changes for a week now. The CEO Joe Fernandez has claimed that this makes the score more accurate, more transparent, and may cure some forms of cancer (well maybe not the last claim). Let’s just say I haven’t been this disappointed since the 2000 election. Let’s start with their first claim: accuracy. See figure 1, my new score
It’s exactly the same graphic as before, but with two decimal places. While my 8th grade Chemistry teacher may be glad that they are using more significant digits, I honestly don’t care. They were there before, just not displayed. Lame.
In their blog post, they claim: “This project represents the biggest step forward in accuracy, transparency and our technology in Klout’s history.” They support this vague claim with the histogram below, showing the differences in Klout scores, before and after the change:
This histogram leaves tons of open questions. Is this different than your normal daily shift in scores? The histogram reminds me of a t-distribution with a fatter positive tail. If more people are signing up for Klout than are leaving, thats probably what it should look like anyways as users hookup more networks and gradually become more active online. The graphic doesn’t show that your score is any better, just that it changed. That’s not impressive at all.
My beef with Klout remains simply that the service provides us with no real validation or explanation of our scores. They don’t show us how many times we have been RT’ed, mentioned, etc. On Google, you can look up your page rank, on app stores you can see your average rating and number of ratings, on Klout, you are told that your true reach has increased, but not told what that implies or how you can verify it.
Klout is still the social influence measurement leader, but with Peerindex rapidly improving (and better in many ways in my opinion), and new competitors such as Proskore and Kred popping up, Klout should be worried. I’ll have a review of both Proskore and Kred up shortly as well so you can easily compare them for yourself.
In addition to hating on Klout and writing random code blog posts, I’ve decided to branch out a bit with my posts and write about one of my other passions: baking and cooking. I recently had a Rosh Hashanah (Jewish New Year) dinner party with my roommates and made several desserts. The traditional dessert for the holiday is a Honey Cake, for a “sweet” new year. I’ve tried about 5 recipes and this is my favorite. Pictures, ingredient list and directions can be found below.
3 1/2 cups all-purpose flour
1 tablespoon baking powder
1 teaspoon baking soda
1/2 teaspoon kosher salt
4 teaspoons ground cinnamon
1/2 teaspoon ground cloves
1/2 teaspoon ground allspice
1 cup vegetable oil
1 cup honey
1 1/2 cups granulated sugar
1/2 cup brown sugar
3 large eggs at room temperature
1 teaspoon vanilla extract
1 cup coffee
1/2 cup fresh orange juice
1/4 cup bourbon
In a large bowl, whisk together the flour, baking powder, baking soda, salt, cinnamon, cloves and allspice
Make a well in the center, and add oil, honey, granulated sugar, brown sugar, eggs, vanilla, coffee, orange juice and bourbon
Mix well and then portion into tins. Do not overbeat! If you are using an electric mixer, do so at a low setting.
This is a huge recipe and its into FIVE 7.75 Inch x 3.75 Inch x 2.75 loaf pans. I normally just pick up a pack of 6 disposable aluminum tins at the grocery store.
The batter should look like this once combined:
The cakes keep for up to 5 days and honestly are better after sitting for a day, just cover with foil or cling wrap in the tin to store. Enjoy!
After a few years break from my dissertation research on coalescent modeling of HIV sequences, I’ve decided to dive in again. I am interested in generating some sequences from coalescent models with various characterestics. Though ms and msHOT from the Hudson lab are what I’ve used previously, I decided to see what else is out there.
The paper describing their methodology can be found here. The files are available here. In the extracted directory simply run “make all”.
I ran into two compilation errors:
g++ -Wall -g -I /Users/garychen/software/boost_1_36_0 -c algorithm.cpp
algorithm.cpp: In member function ‘void GraphBuilder::build()’:
algorithm.cpp:1272: error: ‘uint’ was not declared in this scope
algorithm.cpp:1272: error: expected `;’ before ‘iSegLength’
algorithm.cpp:1273: error: ‘iSegLength’ was not declared in this scope
make: *** [algorithm.o] Error 1
I changed uint to int on line 1272 and combined the cout onlines 1273 and 1274 into one line, which fixed the complaints. Then the build complained:
g++ -o macs simulator.o algorithm.o datastructures.o -static ld: library not found for -lcrt0.o
Simply comment out “LINKFLAGS = -static” and everything should happily compile.
Was recently visiting Washington DC to see some friends from college and I had a notary emergency. It was late on Saturday, but My DC Notary came through. I’ve had to notarize a few things in DC and this was by far the best experience. Before this time I didn’t even know that mobile 24/7 notary services existed. If you need a notary in dc you should absolutely use this service. They also have a second service in Virginia, specifically Alexandria and Arlington. If you need a notary in either of these towns you should use notary in dc and Virginia.
While I hope I never have another notarizing emergency, I wanted to write them up because I know they happen often and this was as stress free as any emergency notary experience can be. The Yelp page of Mydcnotary is here.
Today Chomp released its monthly App Search Analytics Report. Along with the standard set of analytics, we dug into search traffic for the queries hurricane and earthquake. Both saw huge temporal spikes in traffic, which are outlined in the highlights below:
Query traffic for the term hurricane spiked >2000% and Hurricane Irene’s formation, warning, and landfall were all correlated with movement in search traffic.
A 250% spike was seen during the formation of Tropical Storm Emily.
Search traffic for earthquake apps was up 2000% immediately after the 5.8 magnitude event in Virginia.
Paid downloads dropped by 4% on Android, the first drop after three month of consecutive gains
After taking a few weeks off from reaming Klout, their newest “improvements” have left me with no choice but to write a sardonic and snarky response. Klout has added 5 new services (Instagram, Flickr, tumblr, Last.fm, and Blogger) and removed ANY secondary statistics from our profile pages. I’m still not sure which is worse, just that both are stupid. I’ll start by criticizing the addition of new services with a simulated conversation between Klout and myself.
Part 1: A conversation with Klout about their new signals
Alex: This brings the total services to 10. Really Klout, you need 10 services? Klout: Of course this will help make your Klout score even better! Alex: But you didn’t do a good job with just Twitter and Facebook, how can I expect you to do a good job with 10? Klout: More data always improves the performance of complicated, black box, machine learning algorithms like our own. Alex: That’s actually false. Klout: Ummmm, look dude, I’m just a data whore and want to sell your data to the man. Alex: So you just want all of my data to sell it to the man and give me nothing in return? Klout: We actually have a terrific Klout perks program. I see you’ve received two Klout perks. Alex: Yup, you sent me a Rizzoli and Isles gift pack, a TV show on a network I don’t have and literally hadn’t heard of before receiving the gift pack. Did I mention that the gift pack came with handcuff earrings? Klout: But what about your other Klout perk, a party at Rolo, a store in SF that sells jeans. Careful analysis of your Twitter, Facebook, Foursquare and LinkedIn data led us to believe that you like or wear jeans. Alex: Everyone wears jeans. That’s similar to predicting that I like to go on vacation or eat tasty food. These jeans happened to be $175, which doesn’t sound like much of a perk to me.
On top of this, android users actually can’t even connect their Klout accounts to Instagram because the app is iPhone only. Ironically, the Klout blog just posted about the average Klout of iPhone and Android users, finding the former beat out the latter 42.0 to 40.6. Perhaps the comparison would be more equal if Android users were allowed to connect 10 services rather than 9? Does MG Siegler actually need more Klout?
Part 2: Klout removes any accountability from website
Finally, let’s discuss, the complete lack of transparency imposed by their recent banishment of the majority of profile stats. Here is a screen shot of my “Network Influence” before:
You will notice that the supporting stats are gone. Though this absence makes it much harder for me to criticize the inconsistencies in their score, it also takes away most of the utility I received from Klout. Unless you have your own Twitter analytics, most people don’t have access to this info, thats one of the reason Klout was cool. It indulges my latent nerd narcissism. How many RTs do I get? How many @ mentions? How many unique ones? Now I just get a number with little explanation. Luckily, Klout competitor, Peerindex, still has much of that info:
From Klout’s point of view, I completely understand why they would want to add more services: greater reach, more data, more partners, etc. I suppose they could justify the removal of more specific stats by saying that things could get too crowded on the main page, but then put the data on another page, don’t take it away. Twitter and Facebook still drive the large majority of usage. Do you really think Blogger cares if their stats aren’t on the main page? Seems nefarious to me.
I like to develop for mongodb on my local machine to make sure everything is fast and won’t nuke one of our production dbs. To do this I often use the mongodump command to grab a collection and then load it locally so I can work with a snapshot of production data. Here is a problem I ran into recently that I thought would be worth blogging about. First I dumped the collection rob_job_metric:
mongodump –db data –collection rob_job_metric –out – > rob_job_metric.mongodump
Aftering moving it to my local machine, I tried mongorestore:
alex@Alexander-Braunsteins-iMac-2:scripts> ~/Downloads/mongodb-osx-x86_64-1.8.1/bin/mongorestore ~/Desktop/scripts/rob_job_metric.mongodump
connected to: ###########
don’t know what to do with [/Users/alex/Desktop/scripts/rob_job_metric.mongodump]
Really mongodb? I’m pretty sure you know what to do with this file as you just created it. After messing around a bit I discovered that mongorestore requires the file to end with .bson, even though the file was already in that format, just not named so:
alex@Alexander-Braunsteins-iMac-2:scripts> mv rob_job_metric.mongodump rob_job_metric.mongodump.bson
alex@Alexander-Braunsteins-iMac-2:scripts> ~/Downloads/mongodb-osx-x86_64-1.8.1/bin/mongorestore ~/Desktop/scripts/rob_job_metric_unique.mongodump.bson
connected to: ###########
Wed Aug 17 10:29:29 /Users/alex/Desktop/scripts/rob_job_metric.mongodump.bson
Wed Aug 17 10:29:29 going into namespace [scripts.rob_job_metric.mongodump]
Wed Aug 17 10:29:32 113872 objects found
I mostly stick to mongodb nowadays, but every now and again I need to access data stored in a MySQL table. In my last post I talked about a MySQLdb error. This is a variant of the script which induced the error. It takes a .csv file with application ids piped to the script and joins them with price, category and name data from a db. This script uses the simplejson and MySQLdb
def main(): # the line below won’t work for you unless you put in your working credentials # you didn’t think I’d put working credentials on my blog did you? dbconn = connect_db(ip, port, user, password, db)
for line in sys.stdin.readlines(): app_id = line.split(“,”) sql = “SELECT info FROM apps WHERE id = ‘%s'” % app_id try: cursor = dbconn.cursor() cursor.execute(sql) result = cursor.fetchone() except MySQLdb.Error, e: sys.stderr.write(“[ERROR] %d: %s\n” % (e.args, e.args)) continue
data = simplejson.loads(result) price = data[“price”] if data[“price”] else “null” categories = data[“categories”] if data[“categories”] else “null” name = data[“appName”] if data[“appName”] else “null” print “%s,%s,%s,%s” % (name, line.strip(), price, categories)