Reading big files in R

Posted: April 12th, 2012 | Author: | Filed under: R, Statistics | No Comments »

As the lone statistician in my workplace, I end up introducing many people to R. After the inevitable pirate jokes, my coworkers who program in real languages (C++, Java, Python, PHP, etc.) ultimately end up complaining about R, which does a couple of things very well and a lot of things VERY poorly. Each has complained about data tables and reading data into R.

For those that don’t know, the default data type for a .csv is not an array or list, its a data table, which takes up far more memory than it should and converts all string to factors for easier use in regression. This is dumb and in my experience will 5x-10x the time it takes to read in the file and memory it takes up. For a quick fix, set stringsAsFactors=F. If you don’t have column headings, which I normally don’t, set header=F as well:

data = read.csv(“datafile.csv”,header=F,stringsAsFactors = F)


Chomp App Search Analytics Year End Summary 2011

Posted: February 7th, 2012 | Author: | Filed under: apps, Chomp, Google, Statistics | No Comments »

My most recent App Search Analytics report from Chomp was written up in TechCrunch. Sarah Perez wrote a fantastic summary in her article, Games Decreasing In Popularity On Android, Entertainment Apps On The Rise, but I wanted to emphasize the most interesting points.

First, as implied by the title of the article, games are decreasing in popularity on Android as a share of total downloads, while that same share is increasing on iOS. In December, games were 36.1% of iTunes downloads and 22% of Android downloads.

Next, I wanted to tackle some misconceptions about app pricing on the two platforms. As a proportion, paid apps are an almost negligible proportion of downloads on Android (where they hover around 3-4%). Consequently, average “app purchase price” (shown below) is quite low compared to iOS, where the proportion of paid app downloads is between 6 and 10 times as high.

The above plot is misleading because it hides two important facts:

  • $.99 apps are a VERY large proportion of iOS app downloads
  • a relatively larger proportion of app downloads on Android are at “premium” price points due to this relative lack of apps at price points less than $1

As a result, average app price, conditional on non-free apps, is actually higher on Android.

The point of this article isn’t to steer developers of apps (premium or otherwise) to or away from either platform, each of which has its strengths. You can read up on monetization of platforms here and here (one article is very pro-iOS, the other very pro-Android). Rather, I wanted to reinforce a basic lesson from Stat 101: averages can be very misleading.


Quotes from Alex on Social Media and the Super Bowl

Posted: February 7th, 2012 | Author: | Filed under: shameless self promotion | No Comments »

I had the privilege of being interviewed for an article on titled Sports are going social, and it’s a winning combination. It’s not really super relevant to anything else on my blog, but I have a quote in there about why sports generate so much intense interest/conversation:

“We endear ourselves to the teams and even use terms such as ‘we’ and ‘us’ when talking about their most recent triumph or failure. It’s natural that something which creates such a deep emotional reaction yields so much conversation both in person and in social networks and social apps.”

and another one on why athletes and particularly fans engage with social networks:

“Rather than simply reading about players online, collecting cards, or wearing their jerseys, fans can follow their favorite players on Twitter, subscribe to their Facebook feed, or even see where they check-in on Foursquare, I see this trend not only continuing, but accelerating in the coming year as social media continues to become more ubiquitous.”

Just thought it was cool and would make my mom proud.


November App Search Analytics

Posted: December 6th, 2011 | Author: | Filed under: apps, Chomp, Statistics | No Comments »

The November Chomp App Search Analytics report is out! The official Chomp blog post can be found here. The quick summary is:

  • search traffic for the terms shopping and discounts spiked more than 1300% and 3000%, respectively on Black Friday and returned to normal on Cyber Monday
  • I highlight tons of great Holiday Apps and Games to keep you busy through the season
  • paid downloads were up 7% on iPhone

More details can be found in the report below:

Chomp Charts November


Legends and Dates in R plots

Posted: December 5th, 2011 | Author: | Filed under: Uncategorized | No Comments »

After looking up how to create a legend using ?legend and searching the R forums for the 86586586th time, I’ve decided to write my own post with a few examples and tricks I’ve picked up. I also provide example code for using dates as an x-axis.

I do most of my heavy computation in Python, leaving R for primarily making pretty plots, exploratory data analysis (EDA) when I first get my hands on a data set, and using my favorite R packages/functions that I’ll never implement on my own (ie Random Forests, CART, SVM). Below is an example plot, with two sets of numbers, a legend, and dates on the axis. Hopefully this is more helpful than the R documentation.

# pick a length, and generate two random normals of this length
len = 43
vals = rnorm(len,0,1)
vals2 = rnorm(len,0,.5)

# pick an initial date in form YYYYMMDD, then generate a years worth of weekly dates
date = 20110201
mydates<-as.Date(as.character(date),"%Y%m%d")
for(i in 1:52){ #
mydates = c(mydates,mydates[length(mydates)]+7)
}
# rename something shorter
x=mydates[2:(len+1)]

# set graphical parameters and plot both random normals, use xaxt="n" to eliminate the x-axis
par(mfrow=c(1,1))
plot(vals,type="l",col="blue",xaxt="n",ylab="y axis label",xlab="",main="Plot Title")
lines(vals2,col="red")
# add back an x-axis with dates, las and cex.axis set direction and size of dates
axis(1, at=1:len,x,las=2,cex.axis=.9)
# add a legend, lwd sets line width, you can use x,y coordinates instead of "bottomleft"
legend("bottomleft", c("thing1","thing2"), col = c("blue", "red"), lwd = 1, title="legend title")


Best Noodle Kugel Recipe

Posted: December 4th, 2011 | Author: | Filed under: recipes | No Comments »

I’m still powering through all the recipe posts I wanted to do after Rosh Hashanah, so here’s the latest, the best noodle kugel recipe ever. It’s my grandmother’s, so I know that slightly biases me, but I’ve had at least 10 people tell me it’s better than their grandmothers’. First here’s a pic of all the ingredients:

The full ingredient list is:

  • 1 pkg egg noodles.
  • cook, drain, put in large pot and add
  • 1/2 cup sugar
  • 1 cup sour cream
  • 1 container cottage cheese
  • 3 eggs beaten
  • 1 sm can crushed pineapple with juice
  • 1 stick melted margarine (butter)
  • 1 apple grated (thick)
  • 1 teaspoon vanilla
  • 1.5 teaspoon cinnamon

Beat eggs then add all other ingredients, cottage cheese last. When fully mixed the ingredients should smell like kugel and have roughly this consistency:

Mix well put in large baking pan sprinkle with crushed cornflake crumbs and dot with butter. Bake 1 hour at 350 and it should look like this:

Last, but certainly not least, here’s a pic of my grandmother (right) and me as its her recipe:


A Bayesian Variable Selection Approach to Major League Baseball Hitting Metrics

Posted: October 27th, 2011 | Author: | Filed under: Python, R, Statistics | No Comments »

I’m happy to announce my most recent publication “A Bayesian Variable Selection Approach to Major League Baseball Hitting Metrics” in the Journal of Quantitative Analysis in Sports. Though this might sound boring unless you are a baseball fan and/or a Bayesian (and perhaps even then), the paper is fundamentally about how to choose which metrics are predictive, a topic anyone in statistics, analytics, or any other data driven field should care deeply about.

I’ll try to motivate this in as general a setting as possible. Suppose you have some metric (batting average, earnings, engagement) for a population of individuals (baseball players, businesses, users of your product) over several time periods. A traditional Random effects model estimates an intercept term for every individual. In some situations the assumption is unrealistic.

Often populations contain individuals who are indistinguishable from average, meaning its better to estimate their value with the overall mean rather than with their own data. This implies the metric is not predictive for that player. By definition, those not in the previous group are systematically high or low relative to the average. Examples of the second group include Barry Bonds, who always hit more home runs and took more steroids than average, Warren Buffet and Berkshire Hathaway, who always made better investments than average, or my Google friends’ use of Facebook since the release of Google+, which is systematically lower than average. This is best visualized by two distributions, the black spike with all its probability at the overall mean (average individuals), and the red distribution with most of its probability far above or below this value (non-average individuals).

Once we find the probability each player is a member of the two categories for each metric, we can tell if a metric is predictive if: a) most individuals are systematically different from the average and b) most of the metric’s variance is explained by the model. Finally, the obligatory plot showing our method performs at least as well on a holdout sample as other methods for the 50 metrics tested:

For those interested (which should be everyone but in practice is almost no one), our method also automatically controls for multiple testing, as we perform 1,575 tests in our analysis.

This paper was co-authored with Blake McShane, James Piette, and Shane Jensen, and can be viewed as a more technical companion piece to our previous paper “A Point-Mass Mixture Random Effects Model for Pitching Metrics” which can be downloaded here. The poorly commented python code for the MCMC sampler can be found here. If you’re interested in implementing or tweaking our methodology, feel free to send me an email or reach out on Twitter.


Hey Klout, Adding More Decimal Places Does Not Make Your Score More Accurate

Posted: October 26th, 2011 | Author: | Filed under: Klout, PeerIndex, Statistics | No Comments »

Klout has been hyping up their score changes for a week now. The CEO Joe Fernandez has claimed that this makes the score more accurate, more transparent, and may cure some forms of cancer (well maybe not the last claim). Let’s just say I haven’t been this disappointed since the 2000 election. Let’s start with their first claim: accuracy. See figure 1, my new score

It’s exactly the same graphic as before, but with two decimal places. While my 8th grade Chemistry teacher may be glad that they are using more significant digits, I honestly don’t care. They were there before, just not displayed. Lame.

In their blog post, they claim: “This project represents the biggest step forward in accuracy, transparency and our technology in Klout’s history.” They support this vague claim with the histogram below, showing the differences in Klout scores, before and after the change:

This histogram leaves tons of open questions. Is this different than your normal daily shift in scores? The histogram reminds me of a t-distribution with a fatter positive tail. If more people are signing up for Klout than are leaving, thats probably what it should look like anyways as users hookup more networks and gradually become more active online. The graphic doesn’t show that your score is any better, just that it changed. That’s not impressive at all.

My beef with Klout remains simply that the service provides us with no real validation or explanation of our scores. They don’t show us how many times we have been RT’ed, mentioned, etc. On Google, you can look up your page rank, on app stores you can see your average rating and number of ratings, on Klout, you are told that your true reach has increased, but not told what that implies or how you can verify it.

Klout is still the social influence measurement leader, but with Peerindex rapidly improving (and better in many ways in my opinion), and new competitors such as Proskore and Kred popping up, Klout should be worried. I’ll have a review of both Proskore and Kred up shortly as well so you can easily compare them for yourself.


Honey Cake Recipe

Posted: October 18th, 2011 | Author: | Filed under: Uncategorized | No Comments »

In addition to hating on Klout and writing random code blog posts, I’ve decided to branch out a bit with my posts and write about one of my other passions: baking and cooking. I recently had a Rosh Hashanah (Jewish New Year) dinner party with my roommates and made several desserts. The traditional dessert for the holiday is a Honey Cake, for a “sweet” new year. I’ve tried about 5 recipes and this is my favorite. Pictures, ingredient list and directions can be found below.

Ingredients:

  • 3 1/2 cups all-purpose flour
  • 1 tablespoon baking powder
  • 1 teaspoon baking soda
  • 1/2 teaspoon kosher salt
  • 4 teaspoons ground cinnamon
  • 1/2 teaspoon ground cloves
  • 1/2 teaspoon ground allspice
  • 1 cup vegetable oil
  • 1 cup honey
  • 1 1/2 cups granulated sugar
  • 1/2 cup brown sugar
  • 3 large eggs at room temperature
  • 1 teaspoon vanilla extract
  • 1 cup coffee
  • 1/2 cup fresh orange juice
  • 1/4 cup bourbon

Preparation Instructions:

  1. In a large bowl, whisk together the flour, baking powder, baking soda, salt, cinnamon, cloves and allspice
  2. Make a well in the center, and add oil, honey, granulated sugar, brown sugar, eggs, vanilla, coffee, orange juice and bourbon
  3. Mix well and then portion into tins. Do not overbeat! If you are using an electric mixer, do so at a low setting.

This is a huge recipe and its into FIVE 7.75 Inch x 3.75 Inch x 2.75 loaf pans. I normally just pick up a pack of 6 disposable aluminum tins at the grocery store.

The batter should look like this once combined:

The cakes keep for up to 5 days and honestly are better after sitting for a day, just cover with foil or cling wrap in the tin to store. Enjoy!


Installing MACS (Markovian Coalescent Simulator) on OS X 10.7 Lion

Posted: October 15th, 2011 | Author: | Filed under: Uncategorized | No Comments »

After a few years break from my dissertation research on coalescent modeling of HIV sequences, I’ve decided to dive in again. I am interested in generating some sequences from coalescent models with various characterestics. Though ms and msHOT from the Hudson lab are what I’ve used previously, I decided to see what else is out there.

The paper describing their methodology can be found here. The files are available here. In the extracted directory simply run “make all”.

I ran into two compilation errors:

g++ -Wall -g -I /Users/garychen/software/boost_1_36_0 -c algorithm.cpp
algorithm.cpp: In member function ‘void GraphBuilder::build()’:
algorithm.cpp:1272: error: ‘uint’ was not declared in this scope
algorithm.cpp:1272: error: expected `;’ before ‘iSegLength’
algorithm.cpp:1273: error: ‘iSegLength’ was not declared in this scope
make: *** [algorithm.o] Error 1

I changed uint to int on line 1272 and combined the cout onlines 1273 and 1274 into one line, which fixed the complaints. Then the build complained:

g++ -o macs simulator.o algorithm.o datastructures.o -static
ld: library not found for -lcrt0.o

Simply comment out “LINKFLAGS = -static” and everything should happily compile.