Reading big files in R

Posted: April 12th, 2012 | Author: | Filed under: R, Statistics | No Comments »

As the lone statistician in my workplace, I end up introducing many people to R. After the inevitable pirate jokes, my coworkers who program in real languages (C++, Java, Python, PHP, etc.) ultimately end up complaining about R, which does a couple of things very well and a lot of things VERY poorly. Each has complained about data tables and reading data into R.

For those that don’t know, the default data type for a .csv is not an array or list, its a data table, which takes up far more memory than it should and converts all string to factors for easier use in regression. This is dumb and in my experience will 5x-10x the time it takes to read in the file and memory it takes up. For a quick fix, set stringsAsFactors=F. If you don’t have column headings, which I normally don’t, set header=F as well:

data = read.csv(“datafile.csv”,header=F,stringsAsFactors = F)


A Bayesian Variable Selection Approach to Major League Baseball Hitting Metrics

Posted: October 27th, 2011 | Author: | Filed under: Python, R, Statistics | No Comments »

I’m happy to announce my most recent publication “A Bayesian Variable Selection Approach to Major League Baseball Hitting Metrics” in the Journal of Quantitative Analysis in Sports. Though this might sound boring unless you are a baseball fan and/or a Bayesian (and perhaps even then), the paper is fundamentally about how to choose which metrics are predictive, a topic anyone in statistics, analytics, or any other data driven field should care deeply about.

I’ll try to motivate this in as general a setting as possible. Suppose you have some metric (batting average, earnings, engagement) for a population of individuals (baseball players, businesses, users of your product) over several time periods. A traditional Random effects model estimates an intercept term for every individual. In some situations the assumption is unrealistic.

Often populations contain individuals who are indistinguishable from average, meaning its better to estimate their value with the overall mean rather than with their own data. This implies the metric is not predictive for that player. By definition, those not in the previous group are systematically high or low relative to the average. Examples of the second group include Barry Bonds, who always hit more home runs and took more steroids than average, Warren Buffet and Berkshire Hathaway, who always made better investments than average, or my Google friends’ use of Facebook since the release of Google+, which is systematically lower than average. This is best visualized by two distributions, the black spike with all its probability at the overall mean (average individuals), and the red distribution with most of its probability far above or below this value (non-average individuals).

Once we find the probability each player is a member of the two categories for each metric, we can tell if a metric is predictive if: a) most individuals are systematically different from the average and b) most of the metric’s variance is explained by the model. Finally, the obligatory plot showing our method performs at least as well on a holdout sample as other methods for the 50 metrics tested:

For those interested (which should be everyone but in practice is almost no one), our method also automatically controls for multiple testing, as we perform 1,575 tests in our analysis.

This paper was co-authored with Blake McShane, James Piette, and Shane Jensen, and can be viewed as a more technical companion piece to our previous paper “A Point-Mass Mixture Random Effects Model for Pitching Metrics” which can be downloaded here. The poorly commented python code for the MCMC sampler can be found here. If you’re interested in implementing or tweaking our methodology, feel free to send me an email or reach out on Twitter.


How to hire a data scientist or statistician

Posted: August 9th, 2011 | Author: | Filed under: R, Statistics | 7 Comments »

In April, I interviewed with Chomp, Bing, Facebook, Foursquare, Zynga, and several other companies. Each company repeatedly expressed the difficulty of finding qualified candidates in the areas of data science and statistics. While everyone that interviewed me was incredibly talented and passionate about their work, few knew how to correctly interview a data scientist or statistician.

Hilary Mason (@hmason), chief scientist at bitly created an excellent graphic describing where data science sits (though I think math should be replaced by statistics)

I am obviously biased with respect to the importance of statistics based on my education, though other people seem to agree with me. During interviews, we tend to either ask questions that play to our individual strengths or brainteasers. Though easier, this approach is fundamentally wrong. Interviewers should outline the skills required for the role and ask questions to ensure the candidate possesses all the necessary qualifications. If your interview questions don’t have a specific goal in mind, they are shitty interview questions. This means, by definition, that most brain teaser and probability questions are shitty interview questions.

A data scientist or statistician should be able to:

  • Pull data from, create, and understand SQL and noSQL dbs (and the relative advantages of each)
  • understand and construct a good regression
  • write their own map/reduce
  • understand CART, boosting, Random Forests, maybe SVM and fit them in R or using some other open source implementation
  • take a project from start to finish, without the help of an engineer, and create actionable recommendations

Good Interview Questions

Below are a few of the interview questions I’ve heard or used over the past few years. Each has a very specific goal in mind, which I enumerate in the answer. Some are short and very easy, some are very long and can be quite difficult.

Q: How would you calculate the variance of the columns of a matrix (called mat) in R without using for loops.

A: This question establishes familiarity with R by indirectly asking about one of the biggest flaws of the language. If the candidate has used it for any non-trivial application, they will know the apply function and will bitch about the slowness of for loops in R. The solution is:

apply(mat, 2, var)

Q: Suppose you have a .csv files with two columns, the 1st of first names the 2nd of last names. Write some code to create a .csv file with last names as the 1st column and first names as the 2nd column.

A: You should know basic cat, awk, grep, sed, etc.

cat names.csv | awk -F “,” ‘{print $2″,”$1}’ > flipped_names.csv

Q: Explain map/reduce and then write a simple one in your favorite programming language.

A: This establishes familiarity with map/reduce. See my previous blog post.

Q: Suppose you are Google and want to estimate the click through rates (CTR) on your ads. You have 1000 queries, each of which has been issued 1000 times. Each query shows 10 ads and all ads are unique. Estimate the CTR for each ad.

A: This is my favorite interview question for a statistician. It doesn’t tackle one specific area, but gets at the depth of statistical knowledge they possess. Only good candidates receive this question. The candidate should immediately recognize this as a binomial trial, so the maximum likelihood estimator of the CTR is simply (# clicks)/(# impressions). This question is easily followed up by mentioning that click through rates are empirically very low, so this will estimate many CTRs at 0, which doesn’t really make sense. The candidate should then suggest altering the estimate by adding pseudo counts: (# clicks + 2)/(# impressions + 4). This is called the Wilson estimator and shrinks your estimate towards .5. Empirically, this does much better than the MLE. You should then ask if this can be interpreted in the context of Bayesian priors, to which they should respond, “Yes, this is equivalent to a prior of beta(2,2), which is the conjugate prior for the binomial distribution.”

The discussion can be led multiple places from here. You can discuss: a) other shrinkage estimators (this is an actual term in Statistics, not a Seinfeld reference, see Stein estimators for further reading) b) pooling results from similar queries c) use of covariates (position, ad text, query length, etc.) to assist in prediction d) method for prediciton logistic regression, complicated ML models, etc. A strong candidate can talk about this problem for at least 15 of 20 minutes.

Q: Suppose you run a regression with 10 variables and 1 is significant at the 95% level. Suppose you then find 10% of the data had been left out randomly and had their y values deleted. How would you predict their y values?

A: I would be very careful about doing this unless its sensationally predictive. If one generates 10 variables of random noise and regresses them against white noise, there is a ~40% chance at least one will be significant at a 95% confidence level. This question helps me understand if the individual understands regression. I also usually ask about regression diagnostics and assumptions.

Q: Suppose you have the option to go into one of two bank branches. Branch one has 10 tellers, each with a separate queue of 10 customers, and branch two has 10 tellers, sharing one queue of 100 customers. Which do you choose?

A: This question establishes familiarity with a wide range of basic stat concepts: mean, variance, waiting times, central limit theorem, and the ability to model and then analyze a real world situation. Both options have the same mean wait time. The latter option has smaller variance, because you are averaging the wait times of 100 individuals before you rather than 10. One can fairly argue about utility functions and the merits of risk seeking behavior over risk averse behavior, but I’d go for same mean with smaller variance (think about how maddening it is when another line at the grocery store is faster than your own).

Q: Explain how Random Forests differs from a normal regression tree.

A: This question establishes familiarity with two popular ML algorithms. “Normal” regression trees, have some splitting rule based on decrease in mean squared error or some other measure of error or misclassification. The tree grows until the next split decreases error by less than some threshold. This often leads to overfitting and trees fit on data sets with large numbers of variables can completely leave out many variables from the data set. Random Forests are an ensemble of fully grown trees. For each tree, a subsample of the variables and bootstrap sample of data are taken, fit, and then averaged together. Generally this prevents overfitting and allows all variables to “shine”. If the candidate is familiar with Random Forests, they should also know about partial dependence plots and variable importance plots. I generally ask this question of candidates that I fear may not be up to speed with modern techniques. Some implementations do not grow trees fully, but the original implementation of Random Forests does.

Bad Interview Questions

The following are probability and intro stat questions that are not appropriate for a data scientist or statistician roll. They should have learned this in intro statistics. These would be like asking an engineering candidate the complexity of binary search (O(log n)).

Q: Suppose you are playing a dice game; you roll a single die, then are given the option to re-roll a single time after observing the outcome. What is the expected value of the dice roll?

A: The expected value of a dice roll is 3.5 = (1+2+3+4+5+6)/6, so you should opt to re-roll only if the initial roll is a 1, 2, or 3. If you re-roll (which occurs with probability .5), the expected value of that roll is 3.5, so the expected value is:

4 * 1/6 + 5 * 1/6 + 6 * 1/6 + 3.5 * .5 = 4.25

Q: Suppose you have two variables, X and Y, each with standard deviation 1. Then, X + Y has standard deviation 2. If instead X and Y had standard deviations of 3 and 4, what is the standard deviation of X + Y?

A: Variances are additive, not standard deviations. The first example was a trick! sd(X+Y) = sqrt(Var(X+Y)) = sqrt(Var(X) + Var(Y)) = sqrt(sd(X)*sd(X) + sd(Y)*sd(Y)) = sqrt(3*3 + 4*4) = 5.

A few closing notes

Don’t ask anything about traversing a tree or graph structure that you learned in your algorithms class. This is a question for a software engineer, not a data scientist or statistician. If you are a software engineer interviewing a data scientist, ask your data scientist friends for questions beforehand. I do this when I interview software engineers and its a much better experience for everyone involved. If you don’t know any data scientists feel free to steal these or email me for more. Finally, I’d love to hear about your favorite interview questions, worst interview experiences, or anything else related to this topic.