Mining the Data Dumps
GG is hunting around for some information related to the little trainwreck series of posts, and has noticed some issues that bear on the broader business of (upbeat music cue here) Big Data.
Now Big Data comes in lots of flavors. Two leap to mind: satellite imagery and national health records. Much satellite imagery is collected regardless of immediate interest; it is then in the interests of the folks owning it that people will find the parts of interest to themselves. So Digital Globe, for instance, would very much like to sell its suite of images of, say, croplands to folks who trade in commodity futures. NASA would very much like to have people write their Congressional representatives about how Landsat imagery allowed them to build a business. So these organizations will invest in the metadata needed to find the useful stuff. And since there is a *lot* of useful stuff, it falls into the category of Big Data.
Health data is a bit different and far enough from GG’s specializations that the gory details are only faintly visible. There is raw mortality and morbidity information that governments collect, and there are some large and broad ongoing survey studies like the Nurses’ Health Study that collect a lot of data without a really specific goal. Marry this with data collected on the environment, say pollution measurements made by EPA, and you have the basis for most epidemiological studies. This kind of cross-datatype style of data mining is also using a form of Big Data.
The funny thing in a way is that the earth sciences also collect big datasets, but the peculiarities of them show where cracks exist in the lands of Big Data. Let’s start with arguably the most successful of the big datasets, the collection of seismograms from all around the world. This start with the worldwide standardized seismic network (WWSSN) in the 1960s. Although created to help monitor for nuclear tests, the data was available to the research community, albeit in awkward photographic form and catalogs of earthquake locations. As instrumentation transitioned into digital formats, this was brought together into the Global Seismographic Network archived by IRIS.
So far, so NASA-like. But there is an interesting sidelight to this: not only does the IRIS Data Management Center collect and provide all this standard data from permanent stations, it also archives temporary experiments. Now one prominent such experiment (EarthScope’s USArray) was also pretty standard in that it was an institutionally run set of instrument with no specific goal, but nearly all the rest were investigator-driven experiments. And this is where things get interesting.
You see, an investigator-driven experiment is designed for collecting data to achieve some specific goal. So, for instance, several of GG’s portable experiments were to image parts of the Sierra Nevada. He and colleagues and students processed and analyzed that data for those purposes and wrote papers and all that. But that data has also been archived with IRIS, and so others can use that data for something totally different, like looking for seismic waves skimming the core-mantle boundary. In other words, the seismic signals that may well have been thrown away by the original investigator–the data dumps (in the same sense of mine dumps)–were grist (or maybe ore) for the mill for other investigators. And this works pretty seamlessly.
But what of other solid earth measurements, things like geologic mapping, geochemical measurements, geochronology? This is where we slam into a brick wall, and it is instructive to see why that is the case. So consider geochronology–getting dates from rocks.
On the face of it, this is pretty simple stuff: you have a rock with some name and you have a date and how you got the date. Now there are scores of wrinkles here: paleontological dates are best expressed in terms of eras and stages and such not while radiometric dates are numbers of years. And this is raw grist for a lot of geology. So you’d think that there would be a central database of this information. You would be wrong.
There are fits and starts and bits and pieces. There is the Paleobiology Database that has a lot of info on fossil localities; this is active and pretty cool, but it doesn’t quite revolve around determining ages. There is the old USGS radiometric age database that stalled out about 2003 and the NAVDAT database that was never fully complete and also stalled out around 2011 and the active but nearly empty geochron.org database. And most of these are dealing with the U.S. If you want to try to pull data together using geologic ages, best of luck. A lot of the data is not in anything remotely looking like a database.
So why did this happen, and what does it tell us about the future of geologic Big Data? The contrast between seismology and geochronology is quite instructive. Seismology had to keep track of stuff being generated all the time, much as NASA has to do. Before digital data, seismology was similarly balkanized. But once there was an easy way to move and copy seismic data, consolidation and uniform archiving of all data became much more attractive (and indeed was mandated by NSF). In contrast, there was never an overarching program for collecting geologic age data all the time; the closest was probably the USGS during times when they were mapping large parts of the country. Lacking some central focus, there wasn’t an obvious place for data to be stored anyways. And even if there was, it isn’t clear they would have had any motivation to share that data. So individual efforts scoured the literature time and time again to try to do things like map the position of the volcanic arc in the western U.S. as Coney and Reynolds did in 1977–without the benefit of a common database you could work from. You could contact the authors and ask for a copy of whatever they had–and maybe they’d have it for you and maybe they wouldn’t. Most likely, though, if you were building on that, you had to combine other datasets and then plow through the literature again to get what you needed.
Even the partial successes have issues that should be noted. NAVDAT accumulated tens of thousands of measurements out of the geologic literature, but this was not an institutional organization. This was investigator-driven and as such had a limited lifetime. It also meant that error correction is now impossible, and it turns out that quality control issues were present: some ages are misstated as to how they were made or who made them. So quality control is another issue that can blockade mining of such big datasets.
NSF, for instance, is well aware of this and mandates that data be archived. Somehow. Somewhere. And NSF funds organizations like EarthChem to help create those archives. But these efforts are, so far, not yielding the desired outcomes.
The lesson? Big data doesn’t happen on its own: there needs to be a focal point, and it helps a ton if somebody’s job is just to vacuum up all the information of a specific type. NSF isn’t the best home for such initiatives: investigators gravitate towards hypothesis-driven science, which is not the mission of these big data warehouses, and durations of grants means there is always another deadline approaching. Whether other parts of earth science can replicate IRIS’s success in corralling most of the seismic data out there remains to be seen, but at least there is a model. Until something like that emerges, it will take much more work to mine the geologic Big Data literature than it should.