Yes, I get the joke. I understand that Tyler Vigen is being cute by publishing an automatically generated series of spurious correlations
(and visualizations) of time-series data that he collected on the internet. But two quick points about this website that has gone viral. One, it’s bad as a humor piece. Two, it’s bad for
science. (It’s obviously not science, so there’s really no evaluation necessary on it’s scientific merits). I suspect Vigen is a believer in science and well-intentioned, so I hope he’s paying attention. (At the end of the two-part rant, I’ve also added a didactic appendix about one way to avoid making inferences from spurious correlations.)
1. Satire is a good thing.
If you live to be one hundred, you’ve got it made. Very few people die past that age. ” ― George Burns
There are many witty jokes about statistics
. Concerning prior art when it comes to exposing spurious correlation plots, human rights writer and blogger Filip Spagnoli did pretty much this exact thing five years ago
A never-ending supply of pointless correlations, however, is a bit obnoxious.
2. Vigen says this:
“The charts on this site aren’t meant to imply causation nor are they meant to create a distrust for research or even correlative data. Rather, I hope this projects fosters interest in statistics and numerical research.”
Well, that’s a nice hope. But the fact on the ground is that politicians, news reporters, and regular folks alike frequently make anti-statistics and/or anti-science arguments, the gist of which is: quantitative analyses are a sham because you can tell any story you want with [scare-quotes optional] STATISTICS. Witness the climate change debate. The scientific method is reduced to fishing in a bucket of data for just the (spurious) correlation that makes your point. Right?
Wrong. Scientists and statisticians are acutely concerned about making erroneous inferences, which nevertheless does happen sometimes. Still, the framework of hypothesis testing
is almost as old as statistics itself and provides many ways to discredit inferences drawn from fishing around in a bucket of data like Vigen does. One of the simplest of these is called the Bonferroni correction. If Vigen had published a Bonferroni corrected p-value with each correlation, well, let’s just say the website may not have been as popular.
As it is, Vigen unwittingly adds fuel to the anti-science fire by turning correlation analysis into a (bad) joke. It works as a hit-generating machine for his website, but the cheap thrills come at the expense of informed debate on serious matters.
Reporting correlations that are obviously not statistically significant and then backing away from significance claims is disingenuous. It is no different from introducing misleading evidence into a criminal trial. Even if the jury is instructed to ignore the inadmissible evidence, the damage is done.
Appendix: What’s a Bonferroni correction?
Wolfram offers a pretty good and short explanation
. What follows is a simplified non-mathematical description of significance along with a worked example from Vigen’s data.
The key idea behind much of hypothesis testing is making the following very general argument: it’s unlikely that the coincidence/correlation of THIS and THAT occurred by random chance. Thus, there is probably SOMETHING going on, rather than NOTHING. Chance coincidences are made more likely by either (a) small observation samples or (b) multiple comparisons.
On Vigen’s website, you can pick one variable, say Number of films Nicolas Cage appeared in
(CageFilms for short) and then select from a pull-down menu of 401 other variables to display a visualization of the data over time along with a correlation coefficient. Vigen’s pull-down menu is automatically sorted from largest positive correlation to largest negative correlation. Let’s say you pick the first one, Number of Female Editors on the Harvard Law Review
(FemHLR for short).
Your hypothesis is that the two variables are really correlated. (Hypothetically speaking! Of course, you don’t believe that.) The null hypothesis is that they are not. Even if they are indeed not related, the correlation, which is after all just a mathematical function of the two random variables, may be non-zero by pure chance. To reject the null hypothesis, you want to say that it is unlikely that the observed non-zero correlation would have occurred by chance.
Vigen’s plot shows a correlation coefficient of 0.86! That’s high! Maximum correlation is 1. Is it significant? Well, there are 5 data points for each variable. That’s a pretty small sample. A really really simplistic analysis* will tell you that the probability of observing this degree of correlation by random chance is 78%. In other words, likely.
[* No, I am no going to even get into whether the distributional assumptions for the applicability of significance tests are appropriate. It’s beside the point!]
Here’s an animated example from skepticalscience.com
of erroneously using small samples to conclude that the arctic ice sheet is not declining.
In practice, if the chance probability of the observation is more than 1% (or 5% at most), we do not reject the null hypothesis. So, there is nothing going on here. And you don’t need to overload your System 2
and even look at the labels to know that Nicolas Cage films have no relationship to the number of female HLR editors. The statistical analysis already tells you that for free!
But wait, it gets worse. The reason we picked FemHLR was because it had the highest correlation with CageFilms out of 400 comparisons! Even if the chance probability had come out to 1%, we would still be cheating if we labeled it significant. Why? Because even by random chance, a 1% event is expected to occur about 4 times in 400 draws! We can’t go through 400 comparisons, hand pick one that is only
1% likely to have occurred by chance, and declare a finding. That’s what this comic
Bonferroni tells us that if we want to use a 1% threshold to declare an observation unlikely due to chance alone, and if we further have 400 comparisons, then we have to apply a 1/400% (=0.0025%) threshold to each one. But, wait again! No one said we have to hold fixed the Nicholas Cage variable. If, as Vigen does, we look at all the possible pairwise correlations we can make between the 400 variables we have, there are actually about 80,000 of them! So we really need to set a threshold on any of these correlations at something like the .00001% level.
Consider the correlation of cheese consumption with bedsheet-tangled deaths
. The correlation is not only high (0.95), but our black box likelihood machine tells us there is only a .003% chance that we would have observed this level of correlation by chance (p = 3.22e-05). Significant? Nope. Not once you apply a Bonferroni correction to our 80,000 comparisons.
I don’t know if Vigen has scraped even more than 402 variables from the CDC and US Census. Could very well be. Also, some of the correlations are probably
real and significant, like this profound correlation between 30-year precipitation data for Virginia and West Virginia
But in case you were under the impression that this is how results in science/statistics are obtained, please rethink that.