GG is hunting around for some information related to the little trainwreck series of posts, and has noticed some issues that bear on the broader business of (upbeat music cue here) Big Data.
Now Big Data comes in lots of flavors. Two leap to mind: satellite imagery and national health records. Much satellite imagery is collected regardless of immediate interest; it is then in the interests of the folks owning it that people will find the parts of interest to themselves. So Digital Globe, for instance, would very much like to sell its suite of images of, say, croplands to folks who trade in commodity futures. NASA would very much like to have people write their Congressional representatives about how Landsat imagery allowed them to build a business. So these organizations will invest in the metadata needed to find the useful stuff. And since there is a *lot* of useful stuff, it falls into the category of Big Data.
Health data is a bit different and far enough from GG’s specializations that the gory details are only faintly visible. There is raw mortality and morbidity information that governments collect, and there are some large and broad ongoing survey studies like the Nurses’ Health Study that collect a lot of data without a really specific goal. Marry this with data collected on the environment, say pollution measurements made by EPA, and you have the basis for most epidemiological studies. This kind of cross-datatype style of data mining is also using a form of Big Data.
The funny thing in a way is that the earth sciences also collect big datasets, but the peculiarities of them show where cracks exist in the lands of Big Data. Let’s start with arguably the most successful of the big datasets, the collection of seismograms from all around the world. This start with the worldwide standardized seismic network (WWSSN) in the 1960s. Although created to help monitor for nuclear tests, the data was available to the research community, albeit in awkward photographic form and catalogs of earthquake locations. As instrumentation transitioned into digital formats, this was brought together into the Global Seismographic Network archived by IRIS.
So far, so NASA-like. But there is an interesting sidelight to this: not only does the IRIS Data Management Center collect and provide all this standard data from permanent stations, it also archives temporary experiments. Now one prominent such experiment (EarthScope’s USArray) was also pretty standard in that it was an institutionally run set of instrument with no specific goal, but nearly all the rest were investigator-driven experiments. And this is where things get interesting.
GG stumbled onto a story about remaking scientific posters describing work from Mike Morrison, a PhD psychology student. His video on the weaknesses of scientific posters and his suggested solution is well worth watching. Many recommendations are classics, essentially boiling down to KISS (Keep It Simple, Stupid). GG is interested in investigating something of the origins of the problem described and how, in earth science, things might not be quite as amenable to his solution.
First up, how did we get to the poster hall of doom, anyways?
Posters are actually a fairly recent innovation (so the NPR story line about changing a “century” of conformity is nonsense). Professional meetings started as everybody getting together in a single room and, often, each reading their paper to the rest of their society (the early issues of the Bulletin of the Geological Society of America not only included the oral presentation but the Q and A afterward). Splitting into multiple oral sessions followed in time. When posters first showed up at AGU in the 1970s, they were in a small room and were a definite side show (GSA came later). Some of these were presentations that people otherwise couldn’t present (maybe they missed the meeting, or had breaking results that were too late for inclusion in the regular program), but some were materials that simply didn’t lend themselves to oral presentations. Big seismic reflection profiles and detailed geologic maps were often such materials. “Posters” as seen today didn’t really exist: printed materials were tacked up in whatever form was handy; layouts were impressively fluid. So initially a lot of posters were things actually better shared in that format.
GG has been telling (begging?) folks for some time that all he really wanted was for somebody, somewhere, to review his book The Mountains that Remade America (latest count remains 0). The old saw that any publicity is good publicity seemed to make sense, and even when reviewers aren’t terribly fond of a book, they rarely would discourage you from looking at it.
But now GG sees some advantages in anonymity courtesy of a review of Jared Diamond’s latest book. Diamond, being now a name brand non-fiction author, is not to be overlooked, which means that a review must appear for better or worse. And worse it is. It would be hard to find a review more negative than this one. Reviewer Anand Giridharadas hammers Diamond for numerous factual errors, for substituting apocryphal stories for research, for forcing facts to fit his theory, and for being woefully out of touch with barriers facing minorities and women in many cultures. It is, frankly, devastating.
Perhaps this is just water off Diamond’s back; his bank account is probably in a pretty healthy state regardless. You could hope his publisher (Little, Brown and Company) is wondering if fact checking such high profile texts might be a good investment. But GG is now taking some small measure of solace that at least he didn’t get a review that would have made him want to crawl under a rock.
Recently the libraries of the University of California system finally pulled the plug on the predatory pricing policies of Elsevier. All GG can say is, finally! [Note: GG has not reviewed or published with Elsevier as a matter of principle, only making the mistake of agreeing once to a review by accident]. What does this mean?
According to Marcus Banks, writing at Undark.org, it means that pure open access is the way out of this. His text implies that the costs of publication are so low that it is ridiculous to have such expenses, and he implies that prestige publications are really a sham for fleecing the scientific public. The sooner that academics realize that the open access journals are just as good, the sooner all will be right in the publishing world.
OK, now maybe GG has read a bit more into this essay than is really there, but there is this sense that all publishers really do is collect money off the backs of funding agencies for no good reason. And this logic can lead to a terrible decay in journal quality.
Once upon a time, having a “subscription” meant that things would come to you until either the term of the subscription ran out or you cancelled the subscription. The stuff that had already come, whether issues of Teen Vogue, the record of the month or volumes of an encyclopedia, were yours to keep. But in the world of the academic library, that model is vanishing, and with it potentially are large parts of the academic literature.
In the paper past, an academic library’s subscription to a professional journal meant that the library got paper copies of the journal that they could then place on shelves and allow people to read. As budgets might tighten or interests wane, libraries would cancel subscriptions–but those journals they had purchased remained on the shelves unless purged to make room for other material. This model is essentially dead.
Instead publishers have shifted to the software definition of “subscription”–which isn’t really a subscription at all. Just as to use Adobe’s Cloud package of software requires you to have an active subscription, so does getting access to all the issues of Science that you had subscribed to over the years. And if the journal decides to go to predatory pricing? Your options are nil. That money you poured into the journal all those years means nothing. In general, libraries are not allowed to make local copies of all the content they are subscribing to.
Arguably this is one of the best facets of a true open access policy: the freedom to copy materials means that there can be multiple archives. University archives can legally maintain and share copies of work produced at their institutions. Research groups can maintain thematic collections of articles relevant to their focus. (Note that current open access policies do not necessarily allow this: much as you can view some movies online so long as you watch the ads, some open access materials could require you to access the original portal and, perhaps, see advertisements there). In a sense, this can return libraries to their original function: instead of mere portals for providers, they return to being actual repositories of knowledge. So while we may have permanently lost the meaning of “subscription,” we can recover the true meaning of “library.”
Many of you no doubt have heard of the lack of reproducibility studies in some scientific fields. This has led to condemnation of publications that have rejected or discouraged papers attempting to reproduce some observation or effect.
Now this is not such a big deal in solid earth science (and probably not even climate science, where things are so contentious politically that redoing things is viewed in a positive way). Basically, for most geological observations we have the Earth, which remains pretty accessible to pretty nearly all of us. Raw observations are increasingly stored in open databases (seismology has been at this for decades, for instance). Cultural biases that color some psychological or anthropological works don’t apply much in solid earth, and the tweaky issues of precise use of reagents and detailed and inaccessible lab procedures that have caused heartburn in biological sciences are less prominent in earth science (but not absent! See discussions on how fission track ages are affected by etching procedures, or look at the failure of the USGS lab to use standards properly). We kind of have one experiment–Earth–and we aren’t capable of reproducing it (Hitchhiker’s Guide to the Galaxy not withstanding, there is no Earth 2.0).
No, the problem isn’t failing to publish reproductions. It is failing to recognize when we are reproducing older work. And it is going to get worse.
AS GG has noted before, citations to primary literature are become more and more scarce despite tools that make access to primary literature easier and easier. This indicates that less and less background work is being done before studies are moving forward: in essence, it is easier to do a study than prepare for it. The end result is pretty apparent: new studies will fail to uncover the old studies that essentially did the same thing.
Reexamining an area or data point is fine so long as you recognize that is what you are doing, but inadvertently conducting a replication experiment is not so great. Combine this with the already sloppier than desired citation habits we are forming and we risk running in circles, rediscovering that already discovered without gaining any insight.
Long long ago, computers were big expensive machines lodged in climate-controlled rooms behind lock and key, access being held by the masters of the campus IT professionals. Users paid by the kilobyte, by the seconds of connect time, by the milliseconds of compute time. The gods of IT raked in money like casinos.
Then came the PC. Within a few years, the IT department at MIT, for example, had collapsed from its previous lofty heights, discontinuing mainframes and reducing support staff to posting flyers around campus, offering services users were delighted to ignore. The totalitarian system was dead! Long live democracy!
Well, slowly but surely we’ve encouraged a new generation to take up the crown and beat us with the scepter of access until we bow down in homage to our noble masters. “The Cloud” is, in fact on most campuses, just the same mainframe. Better OS, much better iron, but as campus IT has decided that mere users must be protected from the world beyond, they have leveraged the need for security from the broader internet into security for the denizens of the IT department. Despite, all to frequently, their staff being the source of the serious break-ins (in GG’s building, the two serious security lapses were both caused by mistakes made by IT professionals).
And yet it is even more insidious. Instructors are increasingly told to place their courses within course management systems, web-based monstrosities like Blackboard, Canvas, and Desire2Learn. These three (GG has had experience with all of them) are essentially interchangeable even as each is painful in its own way; their main advantage over just regular web pages is that intraclass materials are private and so protected information like grades and use of copyrighted materials can be freely placed online. Yet, practically like clockwork, campus IT decides it is time to shift from one to the next. Why? Usually some relatively trivial capability is trundled out to justify the move (Now on smartphones! Now with free-form answer quizzes! Now looks snazzier!)–despite the likelihood that the previous provider will match that new wrinkle within a year or two. So faculty and teaching staff and students are forced to learn yet another way of doing the same damn thing, which means….time for our boys (and a few girls) in IT to collect paychecks running workshops on how to do things and building web pages on how things are different and, of course, spending months if not years first installing and then troubleshooting the new software and then migrating content over all while supporting the old system for a year or two longer than originally planned until it is now time to begin the process of investigating the latest iterations of such software, which inevitably leads to…moving to a new system!
Something similar goes on with email support, internet video conferencing, personnel management software and other computer-related interfaces. Non-IT administrators who in theory are riding herd on this are so divorced from both users and the technology that they lack the backbone to say “no, what we have will suffice.” It remains unclear if the disruption to instructors and students plays any role in the calculations made to justify these changes (it seems certain to be underestimated).
Of course campus IT is at increasing risk of being outsourced to companies like Microsoft and Google (indeed many functions already have). It isn’t hard to predict that there will be a major scandal when a university’s “private” information somehow wanders off campus. Watching all this can make a grumpy geophysicist who remembers the early days of the internet and the last gasps of the old IT mainframes dwell fondly on the memories of hope…