Misunderstanding reproducibility

A lot has been written about the results of the Reproducibility Project’s analysis of papers in psychology (for instance, here and here). While some of the response has been overwrought handwringing, perhaps the most embarrassing response comes in defense of the work that was not succeeding in being reproduced.  Prof. Barrett at Northeastern wrote a NY Times op-ed saying that this was just normal science stuff: “But the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works“.

What balderdash.

She offers examples of cases where application of ideas developed from some experiments failed to be applicable in other situations as comparable to what is going on here.  For instance:

Similarly, when physicists discovered that subatomic particles didn’t obey Newton’s laws of motion, they didn’t cry out that Newton’s laws had “failed to replicate.” Instead, they realized that Newton’s laws were valid only in certain contexts, rather than being universal, and thus the science of quantum mechanics was born.

Er, this was not a failure to replicate, this was examining the application of a theory to a novel environment. (This is also a novel history of how QM developed, but we need not go there today). A replication might have been repeating the experiments Newton did and seeing them produce different results. (If you like, measuring the time an apple dropped from a certain position bonks Sir Isaac on the head over and over). If a professor in this field cannot separate a reproduction of research from its application to a novel situation then one has to question the intellectual rigor of the field. (OK, well maybe that of just one practitioner that the NY Times felt was worthy of writing for them).

The reason for the desire to reproduce results is that there is a clear bias towards publishing “novel” results.  Twenty groups try the same thing and on average one will produce a significant (at 95% level) result that in fact is due to random chance.  (This is why amateur earthquake predictors occasionally are right–so many are trying that a few get lucky). That lucky project gets published while the other 19 don’t–that is the thesis behind publication bias. The intent of the reproducibility project, as far as GG can tell, is to do the same experiment as in the original work.  If the same experiment is conducted and the result is not confirmed, this is not learning that different conditions affect the result.  This is learning that the initial experiment’s results were not in fact statistically robust.

So the statement of Prof. Barrett “But the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works” is incorrect.  While it may not be cause for alarm per se, what this means is that more than 60% of the psychology literature (and a fairly prominent part at that) is somewhere between misleading and in error. And mind, we are not talking about the interpretation of the results, we are talking about the results themselves. Anything remotely equivalent to this in geophysics would be scandalous (GG hopes); it would be like discovering that reoccupying gravity stations from 10 studies finding that the gravity anomalies reported were significantly different than originally reported in 6 of them.

[You only encounter something remotely similar on the far side of inversion and model runs; the same data fed to different seismic inversions will usually produce different results.  But here the difference is the processing, not the original data, and most workers are careful in how they use their results.]

Now it is true that for some of the studies it may turn out that there are indeed confounding factors (American and British grad students having different behaviors for instance) and, unsurprisingly, a number of the original researchers are busily hunting them down to try and rescue their previous work from oblivion.  And, of course, it is plausible that a few of the Reproducibility Project’s studies are flawed and falsely show a failure to replicate original studies. But this says that a sizable part of the scientific literature (at least in this field and, most likely, medicine as well) is not useful as the basis for new work.

We often would say that a textbook is 90% right and 10% wrong while scientific publications were 90% wrong and 10% right.  But we were talking about the interpretations, not the data.  Here it seems to be the data itself that makes it into the literature. Not a good sign, and you’d think a practitioner in the field might recognize it for the trouble it is. (For a far more realistic appraisal of the situation, see the discussion over at Retraction Watch).

Tags: ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: