Are we ready for “Big Literature”
A lot has been written about Big Data, presenting it as some kind of super challenge and opportunity. Certainly there is a lot of data kicking around out there, but the thing about Big Data is that it is, well, data–if you know what you want to do with one data point, figuring out an algorithm to extract a lot of data points of interest and process them, while time consuming, isn’t hopeless. (so, for instance, one could argue that seismologists have for some time been in the Big Data pool, extracting and analyzing parts of the many terabytes of stored data for the pieces of interest).
But what if you have to examine each data point on its own? Such is the case with the scientific literature. And the explosion of that literature may threaten the ability to make progress. (That explosion will only get worse if peer-review is discarded in favor of open repositories). So, can we deal with “Big Literature”?
In the past, the trend has been to become more and more specialized, which allows one to face a tractable part of the literature. But with an increasing emphasis on interdisciplinary work, building on advances in multiple fields in increasingly desirable. How to deal with these pressures?
Perhaps this is why GG has noticed increasingly poor citation habits, where authors are not citing the papers that are the source of the algorithm, interpretation, or data point they wish to make use of but instead are citing some recent paper that also made use of this object. If so, this is unfortunate in two ways: it deprives the originator of the object the credit he/she/they deserve, and it also runs the risk of turning into a game of telephone where progressive misunderstandings of what was actually done/said/observed propagate through the literature, degrading later attempts to conduct research. That this kind of degraded citation practice exists with the presence of modern tools allowing quick access to nearly the full literature and its citation history suggests that various pressures are breaking down more traditional behaviors. Certainly some of this is the pressure to publish frequently while generating lots of grant applications, but some is probably reflecting the growth of the literature beyond the capacity of many of us to digest.
How will we deal with this? One way is to simply focus on a subset of the literature–probably the literature that supports the direction your research is going. Certainly a filter many senior researchers apply before reading a paper is a quick skim of the author list–certain individuals may have consistently produced work worth reading, and so the paper gets read, and others maybe have not. This can lead to the echo chambers we see people falling into politically. Or maybe we skim the latest work, missing out on perhaps relevant observations made in the past, thus risking repeating earlier mistakes, or simply unknowingly repeating earlier work (which is different from trying to reproduce a study). (We had a student once who liked to simply plow ahead, not caring what was in the literature. GG’s comment at the time was that when you reinvent the wheel, you risk reinventing the flat tire.)
GG isn’t sure where this will go. Will “Big Literature” produce new tools for metaanalysis of science? Are we doomed to trip over ourselves in our growing ignorance of the state of the whole field? Or will we react by compacting and reclassifying literature in some way to make it digestible to human brains? It might be time to start worrying about it before some future Congress calls scientists and program directors before it to explain how they could have unknowingly done the exact same thing as their predecessors some years before.