Klout improves score by making it less transparent and even harder to explain

Posted: August 29th, 2011 | Author: | Filed under: Klout, PeerIndex, Statistics | No Comments »

After taking a few weeks off from reaming Klout, their newest “improvements” have left me with no choice but to write a sardonic and snarky response. Klout has added 5 new services (Instagram, Flickr, tumblr, Last.fm, and Blogger) and removed ANY secondary statistics from our profile pages. I’m still not sure which is worse, just that both are stupid. I’ll start by criticizing the addition of new services with a simulated conversation between Klout and myself.

Part 1: A conversation with Klout about their new signals

Alex: This brings the total services to 10. Really Klout, you need 10 services?
Klout: Of course this will help make your Klout score even better!
Alex: But you didn’t do a good job with just Twitter and Facebook, how can I expect you to do a good job with 10?
Klout: More data always improves the performance of complicated, black box, machine learning algorithms like our own.
Alex: That’s actually false.
Klout: Ummmm, look dude, I’m just a data whore and want to sell your data to the man.
Alex: So you just want all of my data to sell it to the man and give me nothing in return?
Klout: We actually have a terrific Klout perks program. I see you’ve received two Klout perks.
Alex: Yup, you sent me a Rizzoli and Isles gift pack, a TV show on a network I don’t have and literally hadn’t heard of before receiving the gift pack. Did I mention that the gift pack came with handcuff earrings?
Klout: But what about your other Klout perk, a party at Rolo, a store in SF that sells jeans. Careful analysis of your Twitter, Facebook, Foursquare and LinkedIn data led us to believe that you like or wear jeans.
Alex: Everyone wears jeans. That’s similar to predicting that I like to go on vacation or eat tasty food. These jeans happened to be $175, which doesn’t sound like much of a perk to me.

On top of this, android users actually can’t even connect their Klout accounts to Instagram because the app is iPhone only. Ironically, the Klout blog just posted about the average Klout of iPhone and Android users, finding the former beat out the latter 42.0 to 40.6. Perhaps the comparison would be more equal if Android users were allowed to connect 10 services rather than 9? Does MG Siegler actually need more Klout?

Part 2: Klout removes any accountability from website

Finally, let’s discuss, the complete lack of transparency imposed by their recent banishment of the majority of profile stats. Here is a screen shot of my “Network Influence” before:

and after:

You will notice that the supporting stats are gone. Though this absence makes it much harder for me to criticize the inconsistencies in their score, it also takes away most of the utility I received from Klout. Unless you have your own Twitter analytics, most people don’t have access to this info, thats one of the reason Klout was cool. It indulges my latent nerd narcissism. How many RTs do I get? How many @ mentions? How many unique ones? Now I just get a number with little explanation. Luckily, Klout competitor, Peerindex, still has much of that info:

From Klout’s point of view, I completely understand why they would want to add more services: greater reach, more data, more partners, etc. I suppose they could justify the removal of more specific stats by saying that things could get too crowded on the main page, but then put the data on another page, don’t take it away. Twitter and Facebook still drive the large majority of usage. Do you really think Blogger cares if their stats aren’t on the main page? Seems nefarious to me.


July App Search Analytics

Posted: August 27th, 2011 | Author: | Filed under: Chomp, Statistics | No Comments »

July’s App Search Analytics from Chomp have been posted on Chomp’s site. Here are the main bullet points:

  • Paid apps have increased their share for the third consecutive month, increasing 2% on Android and iPhone.
  • Blockbuster movie releases correlated strongly with app searches for the term “movie”.
  • Spotify searches surged 250% around the U.S. launch.

For more details see the post on Chomp’s blog here.

Chomp Charts July


mongorestore error: don’t know what to do with…

Posted: August 19th, 2011 | Author: | Filed under: mongodb | 2 Comments »

I like to develop for mongodb on my local machine to make sure everything is fast and won’t nuke one of our production dbs. To do this I often use the mongodump command to grab a collection and then load it locally so I can work with a snapshot of production data. Here is a problem I ran into recently that I thought would be worth blogging about. First I dumped the collection rob_job_metric:

mongodump –db data –collection rob_job_metric –out – > rob_job_metric.mongodump

Aftering moving it to my local machine, I tried mongorestore:

alex@Alexander-Braunsteins-iMac-2:scripts> ~/Downloads/mongodb-osx-x86_64-1.8.1/bin/mongorestore ~/Desktop/scripts/rob_job_metric.mongodump
connected to: ###########
don’t know what to do with [/Users/alex/Desktop/scripts/rob_job_metric.mongodump]

Really mongodb? I’m pretty sure you know what to do with this file as you just created it. After messing around a bit I discovered that mongorestore requires the file to end with .bson, even though the file was already in that format, just not named so:

alex@Alexander-Braunsteins-iMac-2:scripts> mv rob_job_metric.mongodump rob_job_metric.mongodump.bson
alex@Alexander-Braunsteins-iMac-2:scripts> ~/Downloads/mongodb-osx-x86_64-1.8.1/bin/mongorestore ~/Desktop/scripts/rob_job_metric_unique.mongodump.bson
connected to: ###########
Wed Aug 17 10:29:29 /Users/alex/Desktop/scripts/rob_job_metric.mongodump.bson
Wed Aug 17 10:29:29 going into namespace [scripts.rob_job_metric.mongodump]
22752893/23620007 96%
Wed Aug 17 10:29:32 113872 objects found


A simple MySQLdb example python script

Posted: August 17th, 2011 | Author: | Filed under: MySQL, Python | No Comments »

I mostly stick to mongodb nowadays, but every now and again I need to access data stored in a MySQL table. In my last post I talked about a MySQLdb error. This is a variant of the script which induced the error. It takes a .csv file with application ids piped to the script and joins them with price, category and name data from a db. This script uses the simplejson and MySQLdb

#!/usr/bin/env python

import sys
import simplejson
import MySQLdb
import re

def connect_db(host, port, user, password, db):
try:
return MySQLdb.connect(host=host, port=port, user=user, passwd=password, db=db)
except MySQLdb.Error, e:
sys.stderr.write(“[ERROR] %d: %s\n” % (e.args[0], e.args[1]))
return False

def main():
# the line below won’t work for you unless you put in your working credentials
# you didn’t think I’d put working credentials on my blog did you?
dbconn = connect_db(ip, port, user, password, db)

for line in sys.stdin.readlines():
app_id = line.split(“,”)[0]
sql = “SELECT info FROM apps WHERE id = ‘%s'” % app_id
try:
cursor = dbconn.cursor()
cursor.execute(sql)
result = cursor.fetchone()
except MySQLdb.Error, e:
sys.stderr.write(“[ERROR] %d: %s\n” % (e.args[0], e.args[1]))
continue

data = simplejson.loads(result[0])
price = data[“price”] if data[“price”] else “null”
categories = data[“categories”] if data[“categories”] else “null”
name = data[“appName”] if data[“appName”] else “null”
print “%s,%s,%s,%s” % (name, line.strip(), price, categories)

if __name__ == “__main__”:
main()


Library not loaded: libmysqlclient.18.dylib

Posted: August 12th, 2011 | Author: | Filed under: MySQL, Python | 6 Comments »

Last week I upgraded to Lion from Snow Leopard. While I love the subtle touches of the new OS and even the natural (inverted) scrolling, it seriously screwed with my existing python packages and torched gcc until I installed X Code 4. I have a python script which uses the MySQLdb package. After a supposedly successful installation of MySQLdb, running my python script yielded the error:

ImportError: dlopen(/Users/alex/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.7-intel.egg-tmp/_mysql.so, 2): Library not loaded: libmysqlclient.18.dylib
Referenced from: /Users/alex/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.7-intel.egg-tmp/_mysql.so
Reason: image not found

For some reason the install pointed itself to the wrong place. Adding the following to your ~/.profile or ~/.bash_profile should fix the issue (assuming this is where you MySQL installation sits):

export DYLD_LIBRARY_PATH=/usr/local/mysql/lib:$DYLD_LIBRARY_PATH

Open up a new terminal and you should be good to go.

Update! This also fixes some ruby 1.9 and rails 3 installation issues on OSX Lion. Thanks to Mauro Morales (@noiz777 for finding this!


How to hire a data scientist or statistician

Posted: August 9th, 2011 | Author: | Filed under: R, Statistics | 7 Comments »

In April, I interviewed with Chomp, Bing, Facebook, Foursquare, Zynga, and several other companies. Each company repeatedly expressed the difficulty of finding qualified candidates in the areas of data science and statistics. While everyone that interviewed me was incredibly talented and passionate about their work, few knew how to correctly interview a data scientist or statistician.

Hilary Mason (@hmason), chief scientist at bitly created an excellent graphic describing where data science sits (though I think math should be replaced by statistics)

I am obviously biased with respect to the importance of statistics based on my education, though other people seem to agree with me. During interviews, we tend to either ask questions that play to our individual strengths or brainteasers. Though easier, this approach is fundamentally wrong. Interviewers should outline the skills required for the role and ask questions to ensure the candidate possesses all the necessary qualifications. If your interview questions don’t have a specific goal in mind, they are shitty interview questions. This means, by definition, that most brain teaser and probability questions are shitty interview questions.

A data scientist or statistician should be able to:

  • Pull data from, create, and understand SQL and noSQL dbs (and the relative advantages of each)
  • understand and construct a good regression
  • write their own map/reduce
  • understand CART, boosting, Random Forests, maybe SVM and fit them in R or using some other open source implementation
  • take a project from start to finish, without the help of an engineer, and create actionable recommendations

Good Interview Questions

Below are a few of the interview questions I’ve heard or used over the past few years. Each has a very specific goal in mind, which I enumerate in the answer. Some are short and very easy, some are very long and can be quite difficult.

Q: How would you calculate the variance of the columns of a matrix (called mat) in R without using for loops.

A: This question establishes familiarity with R by indirectly asking about one of the biggest flaws of the language. If the candidate has used it for any non-trivial application, they will know the apply function and will bitch about the slowness of for loops in R. The solution is:

apply(mat, 2, var)

Q: Suppose you have a .csv files with two columns, the 1st of first names the 2nd of last names. Write some code to create a .csv file with last names as the 1st column and first names as the 2nd column.

A: You should know basic cat, awk, grep, sed, etc.

cat names.csv | awk -F “,” ‘{print $2″,”$1}’ > flipped_names.csv

Q: Explain map/reduce and then write a simple one in your favorite programming language.

A: This establishes familiarity with map/reduce. See my previous blog post.

Q: Suppose you are Google and want to estimate the click through rates (CTR) on your ads. You have 1000 queries, each of which has been issued 1000 times. Each query shows 10 ads and all ads are unique. Estimate the CTR for each ad.

A: This is my favorite interview question for a statistician. It doesn’t tackle one specific area, but gets at the depth of statistical knowledge they possess. Only good candidates receive this question. The candidate should immediately recognize this as a binomial trial, so the maximum likelihood estimator of the CTR is simply (# clicks)/(# impressions). This question is easily followed up by mentioning that click through rates are empirically very low, so this will estimate many CTRs at 0, which doesn’t really make sense. The candidate should then suggest altering the estimate by adding pseudo counts: (# clicks + 2)/(# impressions + 4). This is called the Wilson estimator and shrinks your estimate towards .5. Empirically, this does much better than the MLE. You should then ask if this can be interpreted in the context of Bayesian priors, to which they should respond, “Yes, this is equivalent to a prior of beta(2,2), which is the conjugate prior for the binomial distribution.”

The discussion can be led multiple places from here. You can discuss: a) other shrinkage estimators (this is an actual term in Statistics, not a Seinfeld reference, see Stein estimators for further reading) b) pooling results from similar queries c) use of covariates (position, ad text, query length, etc.) to assist in prediction d) method for prediciton logistic regression, complicated ML models, etc. A strong candidate can talk about this problem for at least 15 of 20 minutes.

Q: Suppose you run a regression with 10 variables and 1 is significant at the 95% level. Suppose you then find 10% of the data had been left out randomly and had their y values deleted. How would you predict their y values?

A: I would be very careful about doing this unless its sensationally predictive. If one generates 10 variables of random noise and regresses them against white noise, there is a ~40% chance at least one will be significant at a 95% confidence level. This question helps me understand if the individual understands regression. I also usually ask about regression diagnostics and assumptions.

Q: Suppose you have the option to go into one of two bank branches. Branch one has 10 tellers, each with a separate queue of 10 customers, and branch two has 10 tellers, sharing one queue of 100 customers. Which do you choose?

A: This question establishes familiarity with a wide range of basic stat concepts: mean, variance, waiting times, central limit theorem, and the ability to model and then analyze a real world situation. Both options have the same mean wait time. The latter option has smaller variance, because you are averaging the wait times of 100 individuals before you rather than 10. One can fairly argue about utility functions and the merits of risk seeking behavior over risk averse behavior, but I’d go for same mean with smaller variance (think about how maddening it is when another line at the grocery store is faster than your own).

Q: Explain how Random Forests differs from a normal regression tree.

A: This question establishes familiarity with two popular ML algorithms. “Normal” regression trees, have some splitting rule based on decrease in mean squared error or some other measure of error or misclassification. The tree grows until the next split decreases error by less than some threshold. This often leads to overfitting and trees fit on data sets with large numbers of variables can completely leave out many variables from the data set. Random Forests are an ensemble of fully grown trees. For each tree, a subsample of the variables and bootstrap sample of data are taken, fit, and then averaged together. Generally this prevents overfitting and allows all variables to “shine”. If the candidate is familiar with Random Forests, they should also know about partial dependence plots and variable importance plots. I generally ask this question of candidates that I fear may not be up to speed with modern techniques. Some implementations do not grow trees fully, but the original implementation of Random Forests does.

Bad Interview Questions

The following are probability and intro stat questions that are not appropriate for a data scientist or statistician roll. They should have learned this in intro statistics. These would be like asking an engineering candidate the complexity of binary search (O(log n)).

Q: Suppose you are playing a dice game; you roll a single die, then are given the option to re-roll a single time after observing the outcome. What is the expected value of the dice roll?

A: The expected value of a dice roll is 3.5 = (1+2+3+4+5+6)/6, so you should opt to re-roll only if the initial roll is a 1, 2, or 3. If you re-roll (which occurs with probability .5), the expected value of that roll is 3.5, so the expected value is:

4 * 1/6 + 5 * 1/6 + 6 * 1/6 + 3.5 * .5 = 4.25

Q: Suppose you have two variables, X and Y, each with standard deviation 1. Then, X + Y has standard deviation 2. If instead X and Y had standard deviations of 3 and 4, what is the standard deviation of X + Y?

A: Variances are additive, not standard deviations. The first example was a trick! sd(X+Y) = sqrt(Var(X+Y)) = sqrt(Var(X) + Var(Y)) = sqrt(sd(X)*sd(X) + sd(Y)*sd(Y)) = sqrt(3*3 + 4*4) = 5.

A few closing notes

Don’t ask anything about traversing a tree or graph structure that you learned in your algorithms class. This is a question for a software engineer, not a data scientist or statistician. If you are a software engineer interviewing a data scientist, ask your data scientist friends for questions beforehand. I do this when I interview software engineers and its a much better experience for everyone involved. If you don’t know any data scientists feel free to steal these or email me for more. Finally, I’d love to hear about your favorite interview questions, worst interview experiences, or anything else related to this topic.


50 signals used to compute your Klout score

Posted: August 3rd, 2011 | Author: | Filed under: Klout | 3 Comments »

In my ongoing quest to deconstruct Klout, I’ve decided to begin to tackle the question “How is my Klout computed?” by looking at what signals make up an individual’s score. Klout CEO Joe Fernandez stated that his company’s score is computed using at least 50 signals. This post is my best guess at those 50 variables. Below I have a breakdown of putative signal by source (Twitter, Facebook, LinkedIn, etc.)

Twitter

  1. followers
  2. following – followers
  3. total RT
  4. weighted total RT
  5. unique RT
  6. weighted unique RT
  7. RT/tweet
  8. @mentions
  9. weighted @mentions
  10. unique @mentioners
  11. weighted unique @mentioners
  12. @mentions/tweet
  13. weighted @mentions/tweet

Facebook

  1. friends
  2. total likes
  3. likes/post
  4. total comments
  5. comments/post

LinkedIn

  1. recommenders
  2. likes
  3. comments
  4. connections

The astute among you will notice that only 22 signals are mentioned above, however, this fails to account for time, one critical aspect of Klout. Below I have a plot of my Network Influence subscore of my Klout score from a few days ago.

You will notice a big drop towards the beginning of the plot. This occurred exactly one month after my initial Klout post that was tweeted by Robert Scoble. That was the point at which my Klout began to increase significantly (it has since decreased significantly). If we include a each of the scores above over all time and the past month, that yields 44 signals. In addition, the phrase “In the past 90 days” appears in the new Klout UI (pictured above), so I don’t think its a huge leap to infer that each signal is also used over a 90 day period as well, yielding 66 signals. Finally, Klout now allows you to connect your Foursquare and Youtube accounts, so I assume they are tracking friends, checkins, comments, mayorships, Youtube thumbs up, Youtube comments, etc., yielding an ever larger signal total.

I don’t actually think that all of these “raw” signals are being used directly to calculate scores, that would be naive. I’m sure scores/totals are transformed (perhaps log), then normalized on a scale from 0 to 100, similar to the Klout score. I also think its likely that several of the individual signals listed above are multiplied or otherwise combined to create composite signals. As an example, its impressive if you are retweeted often OR if you receive many @ mentions, however its super impressive/kloutastic if you are retweeted often AND receive many @ mentions. A composite signal may capture that interplay.

In closing, I think someone with some time and access to the Klout api, could use these signals to reconstruct the Klout algorithm. If you’d like to try, shoot me an email and I’d be happy to help in my spare time.