Tyler Vigen’s Spurious Correlations are neither informative nor funny, but they are damaging to science.

Yes, I get the joke. I understand that Tyler Vigen is being cute by publishing an automatically generated series of spurious correlations (and visualizations) of time-series data that he collected on the internet. But two quick points about this website that has gone viral. One, it’s bad as a humor piece. Two, it’s bad for science. (It’s obviously not science, so there’s really no evaluation necessary on it’s scientific merits). I suspect Vigen is a believer in science and well-intentioned, so I hope he’s paying attention. (At the end of the two-part rant, I’ve also added a didactic appendix about one way to avoid making inferences from spurious correlations.)

1. Satire is a good thing.

If you live to be one hundred, you’ve got it made. Very few people die past that age. ” ― George Burns
There are many witty jokes about statistics. Concerning prior art when it comes to exposing spurious correlation plots, human rights writer and blogger Filip Spagnoli did pretty much this exact thing five years ago.
(http://filipspagnoli.files.wordpress.com/2009/05/lemongraph.jpg)
Spagnoli on the other hand offers up a veritable treasure trove on statistical methods gone awry, accompanied by cartoons from xkcd.com, and Dilbert, and screenshots of Fox News.
What makes statistics jokes funny is both context and craft, which involves self-control. Don’t beat a joke to death by repeating it in ten thousand variations. One or two well-curated spurious correlations are funny, like this correlation of divorce rates in Maine with per capita consumption of margarine.

divorce-rate-in-maine_per-capita-consumption-of-margarine-us

 

A never-ending supply of pointless correlations, however, is a bit obnoxious.


2. Vigen says this:

“The charts on this site aren’t meant to imply causation nor are they meant to create a distrust for research or even correlative data. Rather, I hope this projects fosters interest in statistics and numerical research.”

 

Well, that’s a nice hope. But the fact on the ground is that politicians, news reporters, and regular folks alike frequently make anti-statistics and/or anti-science arguments, the gist of which is: quantitative analyses are a sham because you can tell any story you want with [scare-quotes optional] STATISTICS. Witness the climate change debate. The scientific method is reduced to fishing in a bucket of data for just the (spurious) correlation that makes your point. Right?

 

Wrong. Scientists and statisticians are acutely concerned about making erroneous inferences, which nevertheless does happen sometimes. Still, the framework of hypothesis testing is almost as old as statistics itself and provides many ways to discredit inferences drawn from fishing around in a bucket of data like Vigen does. One of the simplest of these is called the Bonferroni correction. If Vigen had published a Bonferroni corrected p-value with each correlation, well, let’s just say the website may not have been as popular.

 

As it is, Vigen unwittingly adds fuel to the anti-science fire by turning correlation analysis into a (bad) joke. It works as a hit-generating machine for his website, but the cheap thrills come at the expense of informed debate on serious matters.

 

Reporting correlations that are obviously not statistically significant and then backing away from significance claims is disingenuous. It is no different from introducing misleading evidence into a criminal trial. Even if the jury is instructed to ignore the inadmissible evidence, the damage is done.

Appendix: What’s a Bonferroni correction?

Wolfram offers a pretty good and short explanation. What follows is a simplified non-mathematical description of significance along with a worked example from Vigen’s data.

 

The key idea behind much of hypothesis testing is making the following very general argument: it’s unlikely that the coincidence/correlation of THIS and THAT occurred by random chance. Thus, there is probably SOMETHING going on, rather than NOTHING. Chance coincidences are made more likely by either (a) small observation samples or (b) multiple comparisons.

 

On Vigen’s website, you can pick one variable, say Number of films Nicolas Cage appeared in (CageFilms for short) and then select from a pull-down menu of 401 other variables to display a visualization of the data over time along with a correlation coefficient. Vigen’s pull-down menu is automatically sorted from largest positive correlation to largest negative correlation. Let’s say you pick the first one, Number of Female Editors on the Harvard Law Review (FemHLR for short).

 

Your hypothesis is that the two variables are really correlated. (Hypothetically speaking! Of course, you don’t believe that.) The null hypothesis is that they are not. Even if they are indeed not related, the correlation, which is after all just a mathematical function of the two random variables, may be non-zero by pure chance. To reject the null hypothesis, you want to say that it is unlikely that the observed non-zero correlation would have occurred by chance.

 

Vigen’s plot shows a correlation coefficient of 0.86! That’s high! Maximum correlation is 1. Is it significant? Well, there are 5 data points for each variable. That’s a pretty small sample. A really really simplistic analysis* will tell you that the probability of observing this degree of correlation by random chance is 78%. In other words, likely.
[* No, I am no going to even get into whether the distributional assumptions for the applicability of significance tests are appropriate. It’s beside the point!]

 

Here’s an animated example from skepticalscience.com of erroneously using small samples to conclude that the arctic ice sheet is not declining.

 

2013_Arctic_Escalator_500

 

In practice, if the chance probability of the observation is more than 1% (or 5% at most), we do not reject the null hypothesis. So, there is nothing going on here. And you don’t need to overload your System 2 and even look at the labels to know that Nicolas Cage films have no relationship to the number of female HLR editors. The statistical analysis already tells you that for free!

 

But wait, it gets worse. The reason we picked FemHLR was because it had the highest correlation with CageFilms out of 400 comparisons! Even if the chance probability had come out to 1%, we would still be cheating if we labeled it significant. Why? Because even by random chance, a 1% event is expected to occur about 4 times in 400 draws! We can’t go through 400 comparisons, hand pick one that is only 1% likely to have occurred by chance, and declare a finding. That’s what this comic is about:

Bonferroni tells us that if we want to use a 1% threshold to declare an observation unlikely due to chance alone, and if we further have 400 comparisons, then we have to apply a 1/400% (=0.0025%) threshold to each one. But, wait again! No one said we have to hold fixed the Nicholas Cage variable. If, as Vigen does, we look at all the possible pairwise correlations we can make between the 400 variables we have, there are actually about 80,000 of them! So we really need to set a threshold on any of these correlations at something like the .00001% level.
Consider the correlation of cheese consumption with bedsheet-tangled deaths. The correlation is not only high (0.95), but our black box likelihood machine tells us there is only a .003% chance that we would have observed this level of correlation by chance (p = 3.22e-05). Significant? Nope. Not once you apply a Bonferroni correction to our 80,000 comparisons.

 

I don’t know if Vigen has scraped even more than 402 variables from the CDC and US Census. Could very well be. Also, some of the correlations are probably real and significant, like this profound correlation between 30-year precipitation data for Virginia and West Virginia.
But in case you were under the impression that this is how results in science/statistics are obtained, please rethink that.

Which comes first, the Theta or the Q?

As I continue to try to reconcile EDM community thinking (largely influenced by the CMU cognitive community) with psychometric thinking, the more I suspect a philosophical roadblock. Q-matrix aka rule space theory was indeed influenced by AIED and diagnostic modeling ideas and has subsequently been reabsorbed by the EDM world as a stand-in for item response models and psychometrics as a whole. Which is odd, since the same cognitive EDM community is generally not thinking in terms of latent trait models, but rather in terms of mastery of some O(100) individual domain skills. In other words, from an increased likelihood of success in answering a question correctly, one makes an inference about which skills the examinee has mastered not an inference of how skilled the examinee is. (This reminds me of the distinction between guilt and shame…)

Since “the truth” is some description of the knowledge space in terms of knowledge components, the cogEDM practitioner want to assess the value of the Q-matrix, perhaps improve it (in the same vein as dividing a KC into subKCs), but learning the Q-matrix from the data is a bit disingenuous since the number of KCs must jibe with the philosophical assumptions of how learning works. Learning a Q-matrix with only 2 factors is never done.

Latent trait modeling tends toward the opposite extreme, perhaps due to historical reasons: one might typically assume that one overall skill should explain all the observations unless the model-data fit is really bad, in which case consider adding a dimension. The latent trait model serves the purpose of explaining the variance/covariance in the observed data, and therefore the more parsimonious the model, the better.

We may know that a domain expert would classify the questions on an instrument into 4 or 5 groups, but that doesn’t mean of course that 4 or 5 skills (KCs, whatever) are necessary to explain student response data. Of course if a single factor could explain the data, and if the items, thus parameterized do not cluster at all, then we might safely say the expert is just plain wrong. A compromise however is that it is possible to find 5 clusters with only 3 skills, just for example. Which brings me to the question in the title of this post.

But are they a lot of medals, considering?

If you’re paying attention, you may have noticed that the U.S. and China have racked up a lot of olympic medals. They are about head-to-head in the gold-medal count and far ahead of the rest of the pack. But China has about four times the population of the U.S., which has five times the population of the U.K.  So how does it look if you adjust for population? See below for the simple version, or go here for the fancy-pants pro version.

In the first bar plot, countries are ordered from left to right by gold-medal count, which is usually what is shown in medal tables (the numbers in parentheses are still total medals: sorry about the confusion). In the second plot, besides sorting the countries, I have left out those that have earned only one medal.

educational data mining <-?-> psychometrics

I have just returned from my first Psychometric Society meeting (IMPS 2012), which followed close on the heels of my first Educational Data Mining meeting (EDM 2012). Both were excellent, intellectually stimulating conferences. I am attaching the slides from my presentation IRT in the Style of Collaborative Filtering. In it, I expose myself to the psychometrician’s revulsion to Joint Maximum Likelihood Estimation, but not without a few defensive slides. I presented similar material–naturally from a different perspective–at EDM.

This is largely why I came to IMPS, and I succeeded in getting a conversation started with Alina von Davier at ETS, who appears to be interested in the same issues: EDM, driven by computer scientists and cognitive scientists with an artificial intelligence bent, is bringing out exciting, new models and learning algorithms that crunch (big) data and are measured by their ability to predict, i.e. by cross-validation.  Psychometricians, fueled by measurement theorists, statisticians and psychologists, worry (rightly) about reliability and construct validity. Especially for high-stakes implementations, they can’t afford to take a risk on a dynamic predictive model that may change the way it classifies a student from one day to the next. They need repeatable models with statistical guarantees behind them. In the middle ground (forthcoming title: When does cross-validation imply construct validity?) there has to be some interesting work to do.

For interesting notes from a grad student who is grappling with the tension between knowledge engineering vs learning-models-from-data in CS education, check out http://teachwrestleeatsleep.blogspot.com (plus you’ll learn something about wrestling training)

The tipping point for EDM?

A draft report commissioned by the U. S. Dept. of Education

“Big data, it seems, is everywhere—even in education. Researchers and developers of online learning systems, intelligent tutoring systems, virtual labs, simulations, games and learning management systems are exploring ways to better understand and use data from learners’ activities online to improve teaching and learning.”

NYC Value Added Score: Math Teacher Experience Adds No Added Value

In terms of actual gains for the year 2007-2008, the data do support the idea that math teachers get better with experience. But if you take the NYC DOE value-added score, which is what DOE bases its rankings on, there is no such indication, making this metric seem rather dubious…

Source data: NYC Dept of Ed. Click for larger image. Alternate figure including error bars.

Another figure suggesting that DOE percentile rank (based on value-added score) does not correlate with teacher experience. (If it did, you would expect the peak of the distribution to move to the right for more experienced teachers and to the left for less).

If the term NYC Department of Education ever crosses your lips, there is a good chance you’ve heard about the recent hullabaloo over the public release of NYC teacher data reports (Dennis Walcott Op-Ed, NYT, WSJ). Notice I say data here, not rankings, even though that’s the headline most everywhere. Critics are right to caution about taking these rankings too far (or anywhere maybe). I personally feel that the publication of names in the manner that has occurred is unfair. But oh, free data! I was not one of the researchers who got anonymized data ahead of time, so this has been my first glance at this trove.

The question that grabbed me was this: how do the DOE performance metrics vary with teacher experience? In my first pass through the three annual data sets, I noticed that the 2007-2008 set contained more fine-grained experience groups (see chart, the later sets say 1, 2, 3 or more than 3 years), so how about based on 2007-2008 data? The answer turned out quite subtle.

For starters, I decided to look at math results only (personal choice). I removed all teachers for whom years of experience was “unknown” or those listed as “co-teaching.” I also removed any teachers whose total student count was less than 20. The remaining list does contain duplicates of individuals, since a teacher who teaches both 7th and 8th grade, say, appears once for each. But my interest was in averaging the performance numbers by experience groups anyway, so I ignored this duplication issue.

NYC DOE released both actual student gains on proficiency tests (post-score minus pre-score) and something called “value-added score”. Value-added score is important in distinguishing between student groups which were projected to do well and did well and student groups which over- or under-performed expectations. In their own words, “on the 07-08 reports, value-added = actual gain – predicted gain.” A bit more on what predicted gain means below.

Here’s the meat of the matter: you might expect that if you average over hundreds of teachers in any random group, the average gains (z-scores) will wash out to zero. But if teachers get better with experience (a reasonable assumption, no?), then maybe you would see a trend if you grouped teachers by years of experience. It turns out that in terms of actual gains for the year 2007-2008 and for multi-year aggregate data in that report year (blue and green lines), the data do support the idea that teachers get better. Although there is a suggestion that after 10 years, there may be a bit of a slump…

On the other hand, if you take the DOE value-added score (red and purple lines), there is no such indication. (Standard errors on these points are large enough that the data are consistent with a flat trend-line.)

So what does that mean? How could teacher value-added not improve with teacher experience?

There may be something going on with the DOE’s “predicted gain” estimate*. Their Teacher Data Report FAQ from 2009 does say that teacher experience is factored in when comparing teachers to peer teachers:

The “peer comparison” sections of the report are different from the “citywide comparison” sections […] the predicted gain in all peer comparison calculations takes into account the teacher’s experience overall and in that grade and subject.

I don’t think this is what’s happening here, though, and discussions of value-added elsewhere don’t mention teacher experience as a factor. This being a citywide comparison, more experienced teachers should have higher value-added scores, and the predicted gains should not take teacher experience into consideration (though parents and principals should accept that new teachers are still developing mastery). If the DOE is giving new teachers a boost by lowering their predicted gains, this is not really doing a service to anyone. Adjust for students, yes. Do not adjust expectations for the limited experience of the teacher.

Update: Based on this much more definitive report describing the details of the NYC value-added model, I understand that teacher experience is definitely a variable in the model, though the report does not say how much. This fact might obviate my comments below.

The purpose of using value-added scores as I mentioned is to equal the playing field in comparing teachers with very different student groups. But in averaging hundreds of teachers together by years of experience, the “unequal field” effect should average away. So really the actual gains and the value-added scores could have the same trend with years of experience. But they don’t.

One possibility is that 1st and 2nd year teachers are given harder teaching assignments than their more senior colleagues. There may well be seniority privilege, and that could explain how lower actual gains in this group turn into average value-added scores. But this is not a comforting way out, because then it appears that years of experience really have no effect on teacher “value-added.”


*the astute observer may notice that (as expected) actual gains are the same for 07-08 and multi-year for teachers with 1 year of experience, whereas this is not the case for value-added scores. This is due to the fact that predicted gains factored into the teacher value-added score are based on the multi-year information even though the actual gains are not.

Are 3D movies a fad?

Yes they are! But like many fads, this is a cyclical one which recurs, apparently, every 28 years. True, this post is not topical for this blog and would be much more at home on the information diet. But J and I watched Hugo in 3D this weekend, so naturally this question came up.

The source data are from Wikipedia as acknowledged in the figure, but I excluded the category of “post-1952 short films.” Just about 400 films remain. I was sort of expecting this result, but it still came out surprisingly clear. Just to clarify the horizontal axis labels: the 37 films represented in the bar above 1984 include those released in the years 1981-1984 inclusive.

Mixed Reality State as metaphor

Back in 2007, Hübler and Gintautas published a paper (also in 2008) on an experiment that I think is fascinating: they coupled a real pendulum to a virtual pendulum and created what they called a mixed reality state.

You probably know what a real pendulum is and can imagine two of them connected somehow by a spring or a string in such a way that the two motions affect each other. Ok, here is help: despite Wikipedia, do not confuse two coupled pendulums

with a double pendulum

Now imagine simulating a pendulum on a computer using Newtonian mechanics and animating it subject to some initial conditions and/or a driving force. What Hübler and Gintautas did was physically couple the two types of pendulum, so that the virtual pendulum receives digital feedback about the position of the real one and the real pendulum received physical feedback depending on the position of the virtual one. This is not a diagram of what they did, but merely an artist’s conception:

Far as I can tell, this work has not triggered a cascade of related research, but I like the idea as a metaphor for use of educational technology and even more broadly for the integration of technology in many aspects of our lives.

Blended learning environments refer to the combined use of online and mobile computer technology with hands-on, peer-to-peer and instructor-guided classwork. Ideally these modes are not independent but coupled. The student receives information and feedback from both physical and virtual learning environments, and both environments respond to the student’s changes. A mixed reality state.

Some critics have imagined online learning as a sad and lonely substitute for social learning in groups. But online social networks and discussion groups suggest other possibilities; in fact they are becoming the subject of intense research on human dynamics. Most of us are responding to feedback from physical and virtual sources everyday through actions we take in the physical and virtual world. Forget the yellow submarine; we all live in a mixed reality state.