Library not loaded: libmysqlclient.18.dylib

Posted: August 12th, 2011 | Author: | Filed under: MySQL, Python | 6 Comments »

Last week I upgraded to Lion from Snow Leopard. While I love the subtle touches of the new OS and even the natural (inverted) scrolling, it seriously screwed with my existing python packages and torched gcc until I installed X Code 4. I have a python script which uses the MySQLdb package. After a supposedly successful installation of MySQLdb, running my python script yielded the error:

ImportError: dlopen(/Users/alex/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.7-intel.egg-tmp/_mysql.so, 2): Library not loaded: libmysqlclient.18.dylib
Referenced from: /Users/alex/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.7-intel.egg-tmp/_mysql.so
Reason: image not found

For some reason the install pointed itself to the wrong place. Adding the following to your ~/.profile or ~/.bash_profile should fix the issue (assuming this is where you MySQL installation sits):

export DYLD_LIBRARY_PATH=/usr/local/mysql/lib:$DYLD_LIBRARY_PATH

Open up a new terminal and you should be good to go.

Update! This also fixes some ruby 1.9 and rails 3 installation issues on OSX Lion. Thanks to Mauro Morales (@noiz777 for finding this!


How to hire a data scientist or statistician

Posted: August 9th, 2011 | Author: | Filed under: R, Statistics | 7 Comments »

In April, I interviewed with Chomp, Bing, Facebook, Foursquare, Zynga, and several other companies. Each company repeatedly expressed the difficulty of finding qualified candidates in the areas of data science and statistics. While everyone that interviewed me was incredibly talented and passionate about their work, few knew how to correctly interview a data scientist or statistician.

Hilary Mason (@hmason), chief scientist at bitly created an excellent graphic describing where data science sits (though I think math should be replaced by statistics)

I am obviously biased with respect to the importance of statistics based on my education, though other people seem to agree with me. During interviews, we tend to either ask questions that play to our individual strengths or brainteasers. Though easier, this approach is fundamentally wrong. Interviewers should outline the skills required for the role and ask questions to ensure the candidate possesses all the necessary qualifications. If your interview questions don’t have a specific goal in mind, they are shitty interview questions. This means, by definition, that most brain teaser and probability questions are shitty interview questions.

A data scientist or statistician should be able to:

  • Pull data from, create, and understand SQL and noSQL dbs (and the relative advantages of each)
  • understand and construct a good regression
  • write their own map/reduce
  • understand CART, boosting, Random Forests, maybe SVM and fit them in R or using some other open source implementation
  • take a project from start to finish, without the help of an engineer, and create actionable recommendations

Good Interview Questions

Below are a few of the interview questions I’ve heard or used over the past few years. Each has a very specific goal in mind, which I enumerate in the answer. Some are short and very easy, some are very long and can be quite difficult.

Q: How would you calculate the variance of the columns of a matrix (called mat) in R without using for loops.

A: This question establishes familiarity with R by indirectly asking about one of the biggest flaws of the language. If the candidate has used it for any non-trivial application, they will know the apply function and will bitch about the slowness of for loops in R. The solution is:

apply(mat, 2, var)

Q: Suppose you have a .csv files with two columns, the 1st of first names the 2nd of last names. Write some code to create a .csv file with last names as the 1st column and first names as the 2nd column.

A: You should know basic cat, awk, grep, sed, etc.

cat names.csv | awk -F “,” ‘{print $2″,”$1}’ > flipped_names.csv

Q: Explain map/reduce and then write a simple one in your favorite programming language.

A: This establishes familiarity with map/reduce. See my previous blog post.

Q: Suppose you are Google and want to estimate the click through rates (CTR) on your ads. You have 1000 queries, each of which has been issued 1000 times. Each query shows 10 ads and all ads are unique. Estimate the CTR for each ad.

A: This is my favorite interview question for a statistician. It doesn’t tackle one specific area, but gets at the depth of statistical knowledge they possess. Only good candidates receive this question. The candidate should immediately recognize this as a binomial trial, so the maximum likelihood estimator of the CTR is simply (# clicks)/(# impressions). This question is easily followed up by mentioning that click through rates are empirically very low, so this will estimate many CTRs at 0, which doesn’t really make sense. The candidate should then suggest altering the estimate by adding pseudo counts: (# clicks + 2)/(# impressions + 4). This is called the Wilson estimator and shrinks your estimate towards .5. Empirically, this does much better than the MLE. You should then ask if this can be interpreted in the context of Bayesian priors, to which they should respond, “Yes, this is equivalent to a prior of beta(2,2), which is the conjugate prior for the binomial distribution.”

The discussion can be led multiple places from here. You can discuss: a) other shrinkage estimators (this is an actual term in Statistics, not a Seinfeld reference, see Stein estimators for further reading) b) pooling results from similar queries c) use of covariates (position, ad text, query length, etc.) to assist in prediction d) method for prediciton logistic regression, complicated ML models, etc. A strong candidate can talk about this problem for at least 15 of 20 minutes.

Q: Suppose you run a regression with 10 variables and 1 is significant at the 95% level. Suppose you then find 10% of the data had been left out randomly and had their y values deleted. How would you predict their y values?

A: I would be very careful about doing this unless its sensationally predictive. If one generates 10 variables of random noise and regresses them against white noise, there is a ~40% chance at least one will be significant at a 95% confidence level. This question helps me understand if the individual understands regression. I also usually ask about regression diagnostics and assumptions.

Q: Suppose you have the option to go into one of two bank branches. Branch one has 10 tellers, each with a separate queue of 10 customers, and branch two has 10 tellers, sharing one queue of 100 customers. Which do you choose?

A: This question establishes familiarity with a wide range of basic stat concepts: mean, variance, waiting times, central limit theorem, and the ability to model and then analyze a real world situation. Both options have the same mean wait time. The latter option has smaller variance, because you are averaging the wait times of 100 individuals before you rather than 10. One can fairly argue about utility functions and the merits of risk seeking behavior over risk averse behavior, but I’d go for same mean with smaller variance (think about how maddening it is when another line at the grocery store is faster than your own).

Q: Explain how Random Forests differs from a normal regression tree.

A: This question establishes familiarity with two popular ML algorithms. “Normal” regression trees, have some splitting rule based on decrease in mean squared error or some other measure of error or misclassification. The tree grows until the next split decreases error by less than some threshold. This often leads to overfitting and trees fit on data sets with large numbers of variables can completely leave out many variables from the data set. Random Forests are an ensemble of fully grown trees. For each tree, a subsample of the variables and bootstrap sample of data are taken, fit, and then averaged together. Generally this prevents overfitting and allows all variables to “shine”. If the candidate is familiar with Random Forests, they should also know about partial dependence plots and variable importance plots. I generally ask this question of candidates that I fear may not be up to speed with modern techniques. Some implementations do not grow trees fully, but the original implementation of Random Forests does.

Bad Interview Questions

The following are probability and intro stat questions that are not appropriate for a data scientist or statistician roll. They should have learned this in intro statistics. These would be like asking an engineering candidate the complexity of binary search (O(log n)).

Q: Suppose you are playing a dice game; you roll a single die, then are given the option to re-roll a single time after observing the outcome. What is the expected value of the dice roll?

A: The expected value of a dice roll is 3.5 = (1+2+3+4+5+6)/6, so you should opt to re-roll only if the initial roll is a 1, 2, or 3. If you re-roll (which occurs with probability .5), the expected value of that roll is 3.5, so the expected value is:

4 * 1/6 + 5 * 1/6 + 6 * 1/6 + 3.5 * .5 = 4.25

Q: Suppose you have two variables, X and Y, each with standard deviation 1. Then, X + Y has standard deviation 2. If instead X and Y had standard deviations of 3 and 4, what is the standard deviation of X + Y?

A: Variances are additive, not standard deviations. The first example was a trick! sd(X+Y) = sqrt(Var(X+Y)) = sqrt(Var(X) + Var(Y)) = sqrt(sd(X)*sd(X) + sd(Y)*sd(Y)) = sqrt(3*3 + 4*4) = 5.

A few closing notes

Don’t ask anything about traversing a tree or graph structure that you learned in your algorithms class. This is a question for a software engineer, not a data scientist or statistician. If you are a software engineer interviewing a data scientist, ask your data scientist friends for questions beforehand. I do this when I interview software engineers and its a much better experience for everyone involved. If you don’t know any data scientists feel free to steal these or email me for more. Finally, I’d love to hear about your favorite interview questions, worst interview experiences, or anything else related to this topic.


50 signals used to compute your Klout score

Posted: August 3rd, 2011 | Author: | Filed under: Klout | 3 Comments »

In my ongoing quest to deconstruct Klout, I’ve decided to begin to tackle the question “How is my Klout computed?” by looking at what signals make up an individual’s score. Klout CEO Joe Fernandez stated that his company’s score is computed using at least 50 signals. This post is my best guess at those 50 variables. Below I have a breakdown of putative signal by source (Twitter, Facebook, LinkedIn, etc.)

Twitter

  1. followers
  2. following – followers
  3. total RT
  4. weighted total RT
  5. unique RT
  6. weighted unique RT
  7. RT/tweet
  8. @mentions
  9. weighted @mentions
  10. unique @mentioners
  11. weighted unique @mentioners
  12. @mentions/tweet
  13. weighted @mentions/tweet

Facebook

  1. friends
  2. total likes
  3. likes/post
  4. total comments
  5. comments/post

LinkedIn

  1. recommenders
  2. likes
  3. comments
  4. connections

The astute among you will notice that only 22 signals are mentioned above, however, this fails to account for time, one critical aspect of Klout. Below I have a plot of my Network Influence subscore of my Klout score from a few days ago.

You will notice a big drop towards the beginning of the plot. This occurred exactly one month after my initial Klout post that was tweeted by Robert Scoble. That was the point at which my Klout began to increase significantly (it has since decreased significantly). If we include a each of the scores above over all time and the past month, that yields 44 signals. In addition, the phrase “In the past 90 days” appears in the new Klout UI (pictured above), so I don’t think its a huge leap to infer that each signal is also used over a 90 day period as well, yielding 66 signals. Finally, Klout now allows you to connect your Foursquare and Youtube accounts, so I assume they are tracking friends, checkins, comments, mayorships, Youtube thumbs up, Youtube comments, etc., yielding an ever larger signal total.

I don’t actually think that all of these “raw” signals are being used directly to calculate scores, that would be naive. I’m sure scores/totals are transformed (perhaps log), then normalized on a scale from 0 to 100, similar to the Klout score. I also think its likely that several of the individual signals listed above are multiplied or otherwise combined to create composite signals. As an example, its impressive if you are retweeted often OR if you receive many @ mentions, however its super impressive/kloutastic if you are retweeted often AND receive many @ mentions. A composite signal may capture that interplay.

In closing, I think someone with some time and access to the Klout api, could use these signals to reconstruct the Klout algorithm. If you’d like to try, shoot me an email and I’d be happy to help in my spare time.


June App Search Analytics

Posted: July 28th, 2011 | Author: | Filed under: Chomp, Statistics | No Comments »

Each month I create an App Search Analytics report for Chomp. You can read about the full details on the Chomp blog, but I wanted to quickly mention some of the monthly highlights here.

June App Search Analytics Highlights

Searching by Function: App Search v Web Search

Posted: July 6th, 2011 | Author: | Filed under: Chomp, Google | No Comments »

This year I was invited to speak at the Wharton Global Alumni Forum in San Francisco to talk about Apps and why we are so excited about them at Chomp. It was a tremendous honor to be one of the 50 speakers selected from the 86,000 alumni of the Wharton School. My presentation focused on three points:

  1. Explosive Growth in App Downloads and Usage
  2. Differences Between App Search and Web Search
  3. Chomp’s Advanced and Innovative Algorithms for App Search

Explosive growth in App Downloads and Usage

Apps have quickly become the window through which we consume content. The IDC predicts 183 billion mobile app downloads by 2015 and analytics industry leader Flurry reports that Consumers Now Spend More Time on Mobile Apps Than the Web. Just like the early days of the web, algorithms are needed to manage the explosive growth in and help users stay afloat and navigate the unending sea of apps.

Differences Between App Search and Web Search

The differences between App and Web search number far greater than what they have in common. Rather than searching by keyword, users search primarily by function or category when looking for apps. Web pages have a sophisticated link structure, apps do not. Expected results for the same query, have drastically different expected results on web as compared to app search; consider a search on Google for social networking and a search on Chomp for social networking. As a result, very different algorithms are needed to tackle the problem of App Search.

Chomp’s Advanced and Innovative Algorithms for App Search

Unlike Google, Yahoo, and Bing, Chomp is the first search engine built from the ground up for apps. Our sophisticated machine learning and natural language processing algorithms understand your query beyond just the keyword, Chomp understands the topic in which you are interested. When you search for to do list, we understand that you are interested in managing your time and remembering to complete various tasks. Chomp returns a wide array of apps to help you achieve this goal, not just apps named “to do list”.

In my talk I step through the above three points in more detail and of course end with some app recommendations. Make sure you check out:

  • Findmytap – the best way to find your favorite beer on tap nearby
  • Foodspotting – a great app for finding foodie worthy eats wherever you are
  • Strava – for logging and tracking your bicycle rides.

Above all I work at Chomp because I love and am passionate about apps and the very positive impact they’ve had on us all. Can you imagine your life without a checkin on Foursquare? How many hours have you spent hurling angry birds? No matter how hard I try, I can’t even get lost anymore. The full presentation can be found here and happy app searching!


Indirect Content Privacy Surveys: Measuring Privacy Without Asking About It

Posted: June 25th, 2011 | Author: | Filed under: Google, Privacy, Statistics | No Comments »

Awesome news! My most recent publication: Indirect Content Privacy Surveys: Measuring Privacy Without Asking About It became a featured publication on the Google Research Homepage. For when they take it down I have included screenshots of the homepage (paper is in bottom right hand corner):

and tweet announcing its post:

This is exciting for a couple of reasons. First, recognition by Google is great validation of the importance of the work. Googlers publish tons of papers, and its a great honor to have mine showcased on the front page of the research blog. Second, should I ever want to return to academia, this paper now adds much more academic “street cred.” Finally, a press piece about the article was written by Thomas Claburn (@ThomasClaburn). This is the first time anything I’ve written has ever received any sort of non-academic press coverage. His article can be found here. Thomas did contact me for comment several hours before he sent the article, but my coauthors and I were unable to run things through the necessary PR people.

I’m not really going to comment on his article, because I am no longer at Google, don’t want to take on the role of spokesperson, and technically anything I say on the subject should go through the Google press/PR people. I’ll simply say that understanding your user is key to making ANY good product. Laura, Jessica, and I didn’t write this paper or conduct this research with any specific agenda or to right any wrong. We wanted to understand how users feel about and share their content, so we asked. Interesting patterns in their responses emerged, so we investigated and reported our findings. Thats it.

Though I have a disclaimer in my “about” section, I want to again emphasize that all opinions expressed in this post are strictly my own. In particular, they do not reflect those of any past, present, or future employer, especially Google.


Klout perks: Nudie jeans party at Rolo SF

Posted: June 24th, 2011 | Author: | Filed under: Klout | No Comments »

Thursday June 23, I attended my first Klout perk party. Apparently I’ve built up enough Klout by bashing Klout to deserve an invite. The event was showcasing Nudie Jeans at the store Rolo in the SOMA district of San Francisco.


They had free food:

and a DJ:

If you’ve ever wondering what I’d look like as a hipster, here’s a shot of me in some very tight hipster jeans.

For those curious, this pair is the Slim Jim Org. Dry Dark, which can be yours for $179. Overall it was a fun event and one that may signify a move by Klout into the local space. This was the first event of its kind.


An interview with Klout

Posted: June 20th, 2011 | Author: | Filed under: Klout | 2 Comments »

After my initial Klout blog posts, I followed up with their Marketing Manager Megan Berry @meganberry and Director of Ranking Ash Rust @ashrust in a series of emails. I sent them a barrage of questions; below are their answers.

Q: How Klout will deal with identical individuals across multiple networks?
K: We look at each platform holistically to try and determine what the signals of influence are. We then perform sophisticated analysis to weight the different platforms appropriately for each person.
A: No information was conveyed in this answer. In academia and finance the word “sophisticated” is a completely loaded term roughly translating to “it’s actually trivially easy, but I think you are too stupid to understand.”

Q: My Foursquare friends are a strict subset of my Facebook and Twitter friends. Will they double count?
K: Follower and friend count are really not part of what we do — it is about ability to drive action.
A: I thought this was a decent answer. I have a few friends that are very active on foursquare, but relatively inactive on Twitter. If my activity drives actions on both Twitter and foursquare, I should get more credit.

Q: CEO Joe Fernandez stated: “Klout Score is not about followers or your activity level but about how people react to your content. ” This is a bit vauge. Now that we have more than 140 characters, can you elaborate a bit?
K: Yes, we don’t believe that followers or friends are a good measure of influence. Instead we’re looking at the engagement you get from people (i.e. RTs, @msgs, likes, etc.) and how influential those people are.
A: I agree that followers/friends should be secondary to actions indicating that individuals are actively engaged with your content (i.e. RTs, @msgs, likes, etc.). I also agree that a Robert Scoble or Michael Arrington RT or @ reply should be worth far more than my mom RTing something I say.

Q: Why do you feel you are better than competitors such as Peerindex and twittergrader?
K: We are the emerging standard in this industry — we are used by over 2000 applications and major brands to understand and measure online influence.
A: Worst answer ever, even worse than your parents saying “because I said so.” I more or less agree with their assessment that they are the emerging industry standard, however, would they still be if they didn’t get a 1 year jump start over all other competitors? What if PeerIndex’s infrastructure were more scalable and could handle the same scale as Klout? I expected some statement assessing their relative quality in terms of ranking or infrastructure, not a catch 22 or tautological response.

Q: I think the new K+ system is awesome, but am worried about spam. What steps are you taking to ameliorate the risk?
K: We’re watching this very carefully to understand how people are using it. We also limit people to 5 +K’s a day.
A: Here’s my favorite example so far:

This wasn’t spam; the label was generated by Klout for Daniel Bogan @waferbaby. I strongly believe that Daniel is authoritative on unicorns so I even voted for his Klout in this area

His current Klout topics (which still include unicorns) are available here. It also seems like Klout slightly changed their shade of orange from the pics above.

On a more serious note, I think the K+ system is an incredibly important step in the evolution of Klout. The next step is to provide topic specific scores. Using these they can start to tackle the holy grail of influence measures: individualized influence scores. A serious problem with existing Klout scores is that it removes individual context from the equation. Justin Bieber will forever have a Klout of 0 for me, even though his systemwide Klout is 100. It would be easy to check that Justin Bieber has no Klout for the topics I am interested in (apps, startups, statistics, mongodb, etc.) This approach is much more computationally expensive and harder to get right. Tweets, Facebook posts, etc. are not labeled with topics, so these must be inferred. This is a VERY difficult problem due to the small amount of text. I’m sure Klout and PeerIndex are both working very hard to tackle the problem. Whoever gets that right will take the “influence” market.

Here are several questions they refused to answer:

  • What % of Klout systemwide is attributed to Twitter v Facebook?
  • Will adding another network ALWAYS increase your score? If not always, empirically, what % of the time does an increase occur?
  • In a recent Kloutchat, the statement was made: “Nearly 50 variables in generating klout score but it all boils down to how people react to your content” Whats the most interesting/surprising variable that you are willing to divulge.

It’s not surprising that Megan and Ash did not answer all of my questions. I think the answer to the first question is roughly 85/15, but they won’t say that publicly because a) it might piss of Facebook if they think they are “underweighted” b) they don’t want people outside the company arguing about how this should be weighted. Im sure they have had tons of discussions about this internally. For the second question, the answer is yes, until Klout tells us otherwise. I won’t go on a rant about the silliness of this, however, if you want to boost your score, attach your FB account. I’ll do a longer “Klout SEO” post in a few weeks.

I sent PeerIndex CEO Azeem Azhar the same group of questions, and will post his responses when I hear back. Next post, I’ll answer the question: “If you were creating my own Klout/PeerIndex/Twittergrader competitor, what signals would you use?” I bet I can come up with a set of 50 variables very close to those used by Klout.


Update: Wharton Global Alumni Forum 2011

Posted: June 19th, 2011 | Author: | Filed under: Chomp, Statistics | No Comments »

Update! The subject of my talk at the Wharton Global Alumni Forum in San Francisco June 23-24, 2011, has changed. The new title and abstract can be found below.

Searching by Function: App Search v. Web Search

The age of apps has officially arrived. There are more than 500,000 apps for iOS and nearly 300,000 on Android. Last year alone 8 billion apps were downloaded, and nearly that many have been downloaded this year. Growth in this multi-billion dollar industry continues to accelerate. Navigating this massive landscape has become a real problem for smartphone users, because traditional keyword-based search algorithms fail to perform as efficiently. Consumers search by app function, requiring a very different approach to ranking. In this talk we dissect current issues with app search and discuss several solutions implemented at Chomp.


Disregard anything said by the International Air Transport Association

Posted: June 13th, 2011 | Author: | Filed under: Statistics | No Comments »

Today I read the least consequential and most pointless news article in the history of journalism: Is It Really Safe to Use a Cellphone on a Plane?. The article enumerates some recent findings of the International Air Transport Association, concerning the danger in use of personal electronic devices on airplanes (not cellphones specifically). The agency concludes that 75 events over the years 2003 – 2009 have possibly been linked to these devices. It then provides some scary quotes about the use of personal electronic devices. My favorite was: “A clock spun backwards and a GPS in cabin read incorrectly while two laptops were being used nearby.” Sounds like a voiceover clip from a crappy B movie.

There are approximately 32,000 flights over the US every day, which totals 81,760,000 over the course of the study. 75 incidents implies an incident rate of .0001%. Which is more likely: instruments working slightly less than 99.9999% of the time, or cell phones causing instruments to break, but only .0001% of the time?

So who wrote this journalistic gem? ABC journalists Brian Ross @brianross (who usually produces very high quality work) and Avni Patel. ABC’s own expert John Nance explains, “If an airplane is properly hardened, in terms of the sheathing of the electronics, there’s no way interference can occur.” If your own expert thinks the report is wrong, why report on it? Is this really the only story they could come up with? How about reporting on something of value, rather than going for shock value and misleading headlines? In the words of my parents, “I’m not upset, just disappointed.” I’d love to get my hands on this report, but apparently its a “confidential industry study,” which I believe is code for “embarrassing and wrong.”