In this paper we track the evolution of baby names from 1882 through 2006. First, we demonstrate that names increase in popularity when other, phonetically similar, names have been popular in previous years (the more you hear it the more you like it). Second, we found the effects to be non-linear. In particular over-popularity hurt adoption (think of how annoying it was to have 8 people named Alex in your class, names/sounds can be too popular). Third, we found varying strength in these relationships across name position. The effect was strong for the first phonemes/sounds and less strong for ending phonemes/sounds (often rhyming) and internal phonemes/sounds.
Finally, we confirmed the initial study, by considering the impact of hurricane names on baby name frequency the following year. Hurricane names are a good case study as they are picked ahead of time and effectively an exogenous shock to the system. We found that increased severity (which correlates with increased mentions), yielded larger increases in frequency of similar names the following year.
This paper was a fun exercise in Bayesian Hierarchical Modeling. I’ll call it a zero inflated poisson regression, with all the usual priors.
If you want to build models like this or read up on Bayesian Statistics, here are 3 suggested books:
Bayesian Data Analysis – This is THE book in Bayesian Statistics. Its a bit theoretical for those without a math background, but is amazing and if you want to really learn Bayesian Stats you should start here.
This week there has been tremendous (even excessive) buzz in the popular media about the Higgs Boson (aka The G-d Particle). Though unofficially reported back in December, it was still an exciting day for the Physics community and general science nerds like me.
What struck me most about the article was the reason for the timing of the announcement. The two teams of 3,000+ scientists attempting to experimentally “prove” the existence of this particle decided to wait until, “the likelihood that their signal was a result of a chance fluctuation was less than one chance in 3.5 million.”
While this may seem a bit too cautious to some, people with a Ph D in Physics are generally smarter than either of us, so we should trust them on how to define the correct standard of proof, right? Okay, does one in 1 million sound more reasonable? One in 1000? The current accepted standard across essentially every other academic discipline, the US legal system, pharmaceutical companies, and the FDA is 1 in 20. I’ve ranted about the definition of statistical significance before, but think of it this way. If one in every 20 things your friend told you was wrong, would you trust them? If one in every 20 medications or vitamins you took had harmful and unreported side effects, would you still take them? If one in 20 natural gas pockets that are mined leak poison into your drinking water, would you let them drill in your backyard?
While these examples are obviously artificial, it begs the question: is the 1 in 20 standard stringent enough? If scientists use a one in 3.5 million standard to announce the existence of a particle, something which honestly doesn’t directly impact any of us at all, why are we using a 1 in 20 standard for deciding which drugs are safe, something which could sicken, injure, or kill us or our loved ones? I think thats one solid reason to revaluate, but here are several billion others: Diet Drug Wins FDA’s Approval. Pharmaceutical companies are heavily incentivized to prove that their drugs are safe, otherwise they lose millions in research and development. Even if you believe the 1 in 20 threshold is stringent enough, it needs to be properly followed, which I’m not convinced is always the case. Not only are many drugs later taken off of the market, but Pharmaceutical companies are routinely fined, for illegal marketing and false claims about their drugs. GlaxoSmithKline was just slapped with a several billion dollar fine, the largest of its kind.
There isn’t a magical threshold I’m suggesting for your statistical tests. In my own professional life and research, I tend to use 1 in 100 (p-value of .01), but it changes drastically based on the particular application. Keep the following things in mind when you are deciding your own statistical burden of proof:
What are the implications of a false positive? If its not a big deal, a lower threshold is probably fine (ie testing alternate website designs or fantasy sports predictions)
Consider the amount of data you have. Are you not seeing statistical significance because you don’t have much data yet or do you have tons of data and the correlation just doesn’t seem to be there? Plotting your significance level over time can help with this, but in general isn’t very precise. Does it look like its converging smoothly or is it “freaking out?”
Add a few random noise variables as a predictors. If they are more significant than the variable you care about, be skeptical.
Think about how many variables you have. The more you have, the more likely one will spuriously be “significant.” Use Bonferroni (though this is almost always too aggressive), Scheffe, or other intervals to adjust for the problem of “multiple comparisons.”
The Fibonacci sequence is a good approximation for the conversion between miles and kilometers.
1 mile = 1.6 km
2 miles = 3.21 km
3 miles = 4.82 km
5 miles = 8.04 km
8 miles = 12.87 km
13 miles = 20.92 km
You can generate the sequence on your own, by starting with two 1’s and then adding together the last two numbers to get the next. The ratio between numbers in the sequence converges the Golden Ratio, phi:
As the lone statistician in my workplace, I end up introducing many people to R. After the inevitable pirate jokes, my coworkers who program in real languages (C++, Java, Python, PHP, etc.) ultimately end up complaining about R, which does a couple of things very well and a lot of things VERY poorly. Each has complained about data tables and reading data into R.
For those that don’t know, the default data type for a .csv is not an array or list, its a data table, which takes up far more memory than it should and converts all string to factors for easier use in regression. This is dumb and in my experience will 5x-10x the time it takes to read in the file and memory it takes up. For a quick fix, set stringsAsFactors=F. If you don’t have column headings, which I normally don’t, set header=F as well:
data = read.csv(“datafile.csv”,header=F,stringsAsFactors = F)
First, as implied by the title of the article, games are decreasing in popularity on Android as a share of total downloads, while that same share is increasing on iOS. In December, games were 36.1% of iTunes downloads and 22% of Android downloads.
Next, I wanted to tackle some misconceptions about app pricing on the two platforms. As a proportion, paid apps are an almost negligible proportion of downloads on Android (where they hover around 3-4%). Consequently, average “app purchase price” (shown below) is quite low compared to iOS, where the proportion of paid app downloads is between 6 and 10 times as high.
The above plot is misleading because it hides two important facts:
$.99 apps are a VERY large proportion of iOS app downloads
a relatively larger proportion of app downloads on Android are at “premium” price points due to this relative lack of apps at price points less than $1
As a result, average app price, conditional on non-free apps, is actually higher on Android.
The point of this article isn’t to steer developers of apps (premium or otherwise) to or away from either platform, each of which has its strengths. You can read up on monetization of platforms here and here (one article is very pro-iOS, the other very pro-Android). Rather, I wanted to reinforce a basic lesson from Stat 101: averages can be very misleading.
I’m happy to announce my most recent publication “A Bayesian Variable Selection Approach to Major League Baseball Hitting Metrics” in the Journal of Quantitative Analysis in Sports. Though this might sound boring unless you are a baseball fan and/or a Bayesian (and perhaps even then), the paper is fundamentally about how to choose which metrics are predictive, a topic anyone in statistics, analytics, or any other data driven field should care deeply about.
I’ll try to motivate this in as general a setting as possible. Suppose you have some metric (batting average, earnings, engagement) for a population of individuals (baseball players, businesses, users of your product) over several time periods. A traditional Random effects model estimates an intercept term for every individual. In some situations the assumption is unrealistic.
Often populations contain individuals who are indistinguishable from average, meaning its better to estimate their value with the overall mean rather than with their own data. This implies the metric is not predictive for that player. By definition, those not in the previous group are systematically high or low relative to the average. Examples of the second group include Barry Bonds, who always hit more home runs and took more steroids than average, Warren Buffet and Berkshire Hathaway, who always made better investments than average, or my Google friends’ use of Facebook since the release of Google+, which is systematically lower than average. This is best visualized by two distributions, the black spike with all its probability at the overall mean (average individuals), and the red distribution with most of its probability far above or below this value (non-average individuals).
Once we find the probability each player is a member of the two categories for each metric, we can tell if a metric is predictive if: a) most individuals are systematically different from the average and b) most of the metric’s variance is explained by the model. Finally, the obligatory plot showing our method performs at least as well on a holdout sample as other methods for the 50 metrics tested:
For those interested (which should be everyone but in practice is almost no one), our method also automatically controls for multiple testing, as we perform 1,575 tests in our analysis.
This paper was co-authored with Blake McShane, James Piette, and Shane Jensen, and can be viewed as a more technical companion piece to our previous paper “A Point-Mass Mixture Random Effects Model for Pitching Metrics” which can be downloaded here. The poorly commented python code for the MCMC sampler can be found here. If you’re interested in implementing or tweaking our methodology, feel free to send me an email or reach out on Twitter.
Klout has been hyping up their score changes for a week now. The CEO Joe Fernandez has claimed that this makes the score more accurate, more transparent, and may cure some forms of cancer (well maybe not the last claim). Let’s just say I haven’t been this disappointed since the 2000 election. Let’s start with their first claim: accuracy. See figure 1, my new score
It’s exactly the same graphic as before, but with two decimal places. While my 8th grade Chemistry teacher may be glad that they are using more significant digits, I honestly don’t care. They were there before, just not displayed. Lame.
In their blog post, they claim: “This project represents the biggest step forward in accuracy, transparency and our technology in Klout’s history.” They support this vague claim with the histogram below, showing the differences in Klout scores, before and after the change:
This histogram leaves tons of open questions. Is this different than your normal daily shift in scores? The histogram reminds me of a t-distribution with a fatter positive tail. If more people are signing up for Klout than are leaving, thats probably what it should look like anyways as users hookup more networks and gradually become more active online. The graphic doesn’t show that your score is any better, just that it changed. That’s not impressive at all.
My beef with Klout remains simply that the service provides us with no real validation or explanation of our scores. They don’t show us how many times we have been RT’ed, mentioned, etc. On Google, you can look up your page rank, on app stores you can see your average rating and number of ratings, on Klout, you are told that your true reach has increased, but not told what that implies or how you can verify it.
Klout is still the social influence measurement leader, but with Peerindex rapidly improving (and better in many ways in my opinion), and new competitors such as Proskore and Kred popping up, Klout should be worried. I’ll have a review of both Proskore and Kred up shortly as well so you can easily compare them for yourself.
After taking a few weeks off from reaming Klout, their newest “improvements” have left me with no choice but to write a sardonic and snarky response. Klout has added 5 new services (Instagram, Flickr, tumblr, Last.fm, and Blogger) and removed ANY secondary statistics from our profile pages. I’m still not sure which is worse, just that both are stupid. I’ll start by criticizing the addition of new services with a simulated conversation between Klout and myself.
Part 1: A conversation with Klout about their new signals
Alex: This brings the total services to 10. Really Klout, you need 10 services? Klout: Of course this will help make your Klout score even better! Alex: But you didn’t do a good job with just Twitter and Facebook, how can I expect you to do a good job with 10? Klout: More data always improves the performance of complicated, black box, machine learning algorithms like our own. Alex: That’s actually false. Klout: Ummmm, look dude, I’m just a data whore and want to sell your data to the man. Alex: So you just want all of my data to sell it to the man and give me nothing in return? Klout: We actually have a terrific Klout perks program. I see you’ve received two Klout perks. Alex: Yup, you sent me a Rizzoli and Isles gift pack, a TV show on a network I don’t have and literally hadn’t heard of before receiving the gift pack. Did I mention that the gift pack came with handcuff earrings? Klout: But what about your other Klout perk, a party at Rolo, a store in SF that sells jeans. Careful analysis of your Twitter, Facebook, Foursquare and LinkedIn data led us to believe that you like or wear jeans. Alex: Everyone wears jeans. That’s similar to predicting that I like to go on vacation or eat tasty food. These jeans happened to be $175, which doesn’t sound like much of a perk to me.
On top of this, android users actually can’t even connect their Klout accounts to Instagram because the app is iPhone only. Ironically, the Klout blog just posted about the average Klout of iPhone and Android users, finding the former beat out the latter 42.0 to 40.6. Perhaps the comparison would be more equal if Android users were allowed to connect 10 services rather than 9? Does MG Siegler actually need more Klout?
Part 2: Klout removes any accountability from website
Finally, let’s discuss, the complete lack of transparency imposed by their recent banishment of the majority of profile stats. Here is a screen shot of my “Network Influence” before:
You will notice that the supporting stats are gone. Though this absence makes it much harder for me to criticize the inconsistencies in their score, it also takes away most of the utility I received from Klout. Unless you have your own Twitter analytics, most people don’t have access to this info, thats one of the reason Klout was cool. It indulges my latent nerd narcissism. How many RTs do I get? How many @ mentions? How many unique ones? Now I just get a number with little explanation. Luckily, Klout competitor, Peerindex, still has much of that info:
From Klout’s point of view, I completely understand why they would want to add more services: greater reach, more data, more partners, etc. I suppose they could justify the removal of more specific stats by saying that things could get too crowded on the main page, but then put the data on another page, don’t take it away. Twitter and Facebook still drive the large majority of usage. Do you really think Blogger cares if their stats aren’t on the main page? Seems nefarious to me.