From Karen to Katie: Using Baby Names to Study Cultural Evolution

Posted: July 10th, 2012 | Author: | Filed under: Statistics | No Comments »

I’m happy to announce that my paper From Karen to Katie: Using Baby Names to Understand Cultural Evolution has been accepted for publication in the journal, Psychological Science. The article also received some press in Time magazine.

In this paper we track the evolution of baby names from 1882 through 2006. First, we demonstrate that names increase in popularity when other, phonetically similar, names have been popular in previous years (the more you hear it the more you like it). Second, we found the effects to be non-linear. In particular over-popularity hurt adoption (think of how annoying it was to have 8 people named Alex in your class, names/sounds can be too popular). Third, we found varying strength in these relationships across name position. The effect was strong for the first phonemes/sounds and less strong for ending phonemes/sounds (often rhyming) and internal phonemes/sounds.

Finally, we confirmed the initial study, by considering the impact of hurricane names on baby name frequency the following year. Hurricane names are a good case study as they are picked ahead of time and effectively an exogenous shock to the system. We found that increased severity (which correlates with increased mentions), yielded larger increases in frequency of similar names the following year.

This paper was a fun exercise in Bayesian Hierarchical Modeling. I’ll call it a zero inflated poisson regression, with all the usual priors.

We received a shout out from Wharton, which is always fun.

If you want to build models like this or read up on Bayesian Statistics, here are 3 suggested books:


Convenient Definitions of Statistical Significance

Posted: July 9th, 2012 | Author: | Filed under: Statistics | No Comments »

This week there has been tremendous (even excessive) buzz in the popular media about the Higgs Boson (aka The G-d Particle). Though unofficially reported back in December, it was still an exciting day for the Physics community and general science nerds like me.

What struck me most about the article was the reason for the timing of the announcement. The two teams of 3,000+ scientists attempting to experimentally “prove” the existence of this particle decided to wait until, “the likelihood that their signal was a result of a chance fluctuation was less than one chance in 3.5 million.”

While this may seem a bit too cautious to some, people with a Ph D in Physics are generally smarter than either of us, so we should trust them on how to define the correct standard of proof, right? Okay, does one in 1 million sound more reasonable? One in 1000? The current accepted standard across essentially every other academic discipline, the US legal system, pharmaceutical companies, and the FDA is 1 in 20. I’ve ranted about the definition of statistical significance before, but think of it this way. If one in every 20 things your friend told you was wrong, would you trust them? If one in every 20 medications or vitamins you took had harmful and unreported side effects, would you still take them? If one in 20 natural gas pockets that are mined leak poison into your drinking water, would you let them drill in your backyard?

While these examples are obviously artificial, it begs the question: is the 1 in 20 standard stringent enough? If scientists use a one in 3.5 million standard to announce the existence of a particle, something which honestly doesn’t directly impact any of us at all, why are we using a 1 in 20 standard for deciding which drugs are safe, something which could sicken, injure, or kill us or our loved ones? I think thats one solid reason to revaluate, but here are several billion others: Diet Drug Wins FDA’s Approval. Pharmaceutical companies are heavily incentivized to prove that their drugs are safe, otherwise they lose millions in research and development. Even if you believe the 1 in 20 threshold is stringent enough, it needs to be properly followed, which I’m not convinced is always the case. Not only are many drugs later taken off of the market, but Pharmaceutical companies are routinely fined, for illegal marketing and false claims about their drugs. GlaxoSmithKline was just slapped with a several billion dollar fine, the largest of its kind.

There isn’t a magical threshold I’m suggesting for your statistical tests. In my own professional life and research, I tend to use 1 in 100 (p-value of .01), but it changes drastically based on the particular application. Keep the following things in mind when you are deciding your own statistical burden of proof:

  • What are the implications of a false positive? If its not a big deal, a lower threshold is probably fine (ie testing alternate website designs or fantasy sports predictions)
  • Consider the amount of data you have. Are you not seeing statistical significance because you don’t have much data yet or do you have tons of data and the correlation just doesn’t seem to be there? Plotting your significance level over time can help with this, but in general isn’t very precise. Does it look like its converging smoothly or is it “freaking out?”
  • Add a few random noise variables as a predictors. If they are more significant than the variable you care about, be skeptical.
  • Think about how many variables you have. The more you have, the more likely one will spuriously be “significant.” Use Bonferroni (though this is almost always too aggressive), Scheffe, or other intervals to adjust for the problem of “multiple comparisons.”

Happy testing!


Fibonacci Sequence as a conversion for miles and km

Posted: July 7th, 2012 | Author: | Filed under: Statistics | No Comments »

The Fibonacci sequence is a good approximation for the conversion between miles and kilometers.

1,1,2,3,5,8,13,21…

1 mile = 1.6 km
2 miles = 3.21 km
3 miles = 4.82 km
5 miles = 8.04 km
8 miles = 12.87 km
13 miles = 20.92 km

You can generate the sequence on your own, by starting with two 1′s and then adding together the last two numbers to get the next. The ratio between numbers in the sequence converges the Golden Ratio, phi:

phi

which is of course numerically very close to the actual conversion ratio of miles to kilometers, 1.609344.


My brother is amazing

Posted: June 7th, 2012 | Author: | Filed under: shameless self promotion | No Comments »

My brother, Michael Braunstein, graduated from the Ringling College of Art and Design in May, with a degree in Illustration.

Michael just put his first website up so I wanted to give him a web shout out. As you can see, Michael Braunstein’s art, is incredible and hits a wide range of styles and mediums. My favorite piece of Michael’s is this yawning lion.

It’s also neat to see Michael’s in progress art, as I usually only see the finished product.

My brother is currently freelancing, so make sure to contact Michael Braunstein, before he’s completely booked.


Reading big files in R

Posted: April 12th, 2012 | Author: | Filed under: R, Statistics | No Comments »

As the lone statistician in my workplace, I end up introducing many people to R. After the inevitable pirate jokes, my coworkers who program in real languages (C++, Java, Python, PHP, etc.) ultimately end up complaining about R, which does a couple of things very well and a lot of things VERY poorly. Each has complained about data tables and reading data into R.

For those that don’t know, the default data type for a .csv is not an array or list, its a data table, which takes up far more memory than it should and converts all string to factors for easier use in regression. This is dumb and in my experience will 5x-10x the time it takes to read in the file and memory it takes up. For a quick fix, set stringsAsFactors=F. If you don’t have column headings, which I normally don’t, set header=F as well:

data = read.csv(“datafile.csv”,header=F,stringsAsFactors = F)


Chomp App Search Analytics Year End Summary 2011

Posted: February 7th, 2012 | Author: | Filed under: apps, Chomp, Google, Statistics | No Comments »

My most recent App Search Analytics report from Chomp was written up in TechCrunch. Sarah Perez wrote a fantastic summary in her article, Games Decreasing In Popularity On Android, Entertainment Apps On The Rise, but I wanted to emphasize the most interesting points.

First, as implied by the title of the article, games are decreasing in popularity on Android as a share of total downloads, while that same share is increasing on iOS. In December, games were 36.1% of iTunes downloads and 22% of Android downloads.

Next, I wanted to tackle some misconceptions about app pricing on the two platforms. As a proportion, paid apps are an almost negligible proportion of downloads on Android (where they hover around 3-4%). Consequently, average “app purchase price” (shown below) is quite low compared to iOS, where the proportion of paid app downloads is between 6 and 10 times as high.

The above plot is misleading because it hides two important facts:

  • $.99 apps are a VERY large proportion of iOS app downloads
  • a relatively larger proportion of app downloads on Android are at “premium” price points due to this relative lack of apps at price points less than $1

As a result, average app price, conditional on non-free apps, is actually higher on Android.

The point of this article isn’t to steer developers of apps (premium or otherwise) to or away from either platform, each of which has its strengths. You can read up on monetization of platforms here and here (one article is very pro-iOS, the other very pro-Android). Rather, I wanted to reinforce a basic lesson from Stat 101: averages can be very misleading.


Quotes from Alex on Social Media and the Super Bowl

Posted: February 7th, 2012 | Author: | Filed under: shameless self promotion | No Comments »

I had the privilege of being interviewed for an article on titled Sports are going social, and it’s a winning combination. It’s not really super relevant to anything else on my blog, but I have a quote in there about why sports generate so much intense interest/conversation:

“We endear ourselves to the teams and even use terms such as ‘we’ and ‘us’ when talking about their most recent triumph or failure. It’s natural that something which creates such a deep emotional reaction yields so much conversation both in person and in social networks and social apps.”

and another one on why athletes and particularly fans engage with social networks:

“Rather than simply reading about players online, collecting cards, or wearing their jerseys, fans can follow their favorite players on Twitter, subscribe to their Facebook feed, or even see where they check-in on Foursquare, I see this trend not only continuing, but accelerating in the coming year as social media continues to become more ubiquitous.”

Just thought it was cool and would make my mom proud.


November App Search Analytics

Posted: December 6th, 2011 | Author: | Filed under: apps, Chomp, Statistics | No Comments »

The November Chomp App Search Analytics report is out! The official Chomp blog post can be found here. The quick summary is:

  • search traffic for the terms shopping and discounts spiked more than 1300% and 3000%, respectively on Black Friday and returned to normal on Cyber Monday
  • I highlight tons of great Holiday Apps and Games to keep you busy through the season
  • paid downloads were up 7% on iPhone

More details can be found in the report below:

Chomp Charts November


Legends and Dates in R plots

Posted: December 5th, 2011 | Author: | Filed under: Uncategorized | 1 Comment »

After looking up how to create a legend using ?legend and searching the R forums for the 86586586th time, I’ve decided to write my own post with a few examples and tricks I’ve picked up. I also provide example code for using dates as an x-axis.

I do most of my heavy computation in Python, leaving R for primarily making pretty plots, exploratory data analysis (EDA) when I first get my hands on a data set, and using my favorite R packages/functions that I’ll never implement on my own (ie Random Forests, CART, SVM). Below is an example plot, with two sets of numbers, a legend, and dates on the axis. Hopefully this is more helpful than the R documentation.

# pick a length, and generate two random normals of this length
len = 43
vals = rnorm(len,0,1)
vals2 = rnorm(len,0,.5)

# pick an initial date in form YYYYMMDD, then generate a years worth of weekly dates
date = 20110201
mydates<-as.Date(as.character(date),"%Y%m%d")
for(i in 1:52){ #
mydates = c(mydates,mydates[length(mydates)]+7)
}
# rename something shorter
x=mydates[2:(len+1)]

# set graphical parameters and plot both random normals, use xaxt="n" to eliminate the x-axis
par(mfrow=c(1,1))
plot(vals,type="l",col="blue",xaxt="n",ylab="y axis label",xlab="",main="Plot Title")
lines(vals2,col="red")
# add back an x-axis with dates, las and cex.axis set direction and size of dates
axis(1, at=1:len,x,las=2,cex.axis=.9)
# add a legend, lwd sets line width, you can use x,y coordinates instead of "bottomleft"
legend("bottomleft", c("thing1","thing2"), col = c("blue", "red"), lwd = 1, title="legend title")


Best Noodle Kugel Recipe

Posted: December 4th, 2011 | Author: | Filed under: recipes | 1 Comment »

I’m still powering through all the recipe posts I wanted to do after Rosh Hashanah, so here’s the latest, the best noodle kugel recipe ever. It’s my grandmother’s, so I know that slightly biases me, but I’ve had at least 10 people tell me it’s better than their grandmothers’. First here’s a pic of all the ingredients:

The full ingredient list is:

  • 1 pkg egg noodles.
  • cook, drain, put in large pot and add
  • 1/2 cup sugar
  • 1 cup sour cream
  • 1 container cottage cheese
  • 3 eggs beaten
  • 1 sm can crushed pineapple with juice
  • 1 stick melted margarine (butter)
  • 1 apple grated (thick)
  • 1 teaspoon vanilla
  • 1.5 teaspoon cinnamon

Beat eggs then add all other ingredients, cottage cheese last. When fully mixed the ingredients should smell like kugel and have roughly this consistency:

Mix well put in large baking pan sprinkle with crushed cornflake crumbs and dot with butter. Bake 1 hour at 350 and it should look like this:

Last, but certainly not least, here’s a pic of my grandmother (right) and me as its her recipe: