GG is hunting around for some information related to the little trainwreck series of posts, and has noticed some issues that bear on the broader business of (upbeat music cue here) Big Data.
Now Big Data comes in lots of flavors. Two leap to mind: satellite imagery and national health records. Much satellite imagery is collected regardless of immediate interest; it is then in the interests of the folks owning it that people will find the parts of interest to themselves. So Digital Globe, for instance, would very much like to sell its suite of images of, say, croplands to folks who trade in commodity futures. NASA would very much like to have people write their Congressional representatives about how Landsat imagery allowed them to build a business. So these organizations will invest in the metadata needed to find the useful stuff. And since there is a *lot* of useful stuff, it falls into the category of Big Data.
Health data is a bit different and far enough from GG’s specializations that the gory details are only faintly visible. There is raw mortality and morbidity information that governments collect, and there are some large and broad ongoing survey studies like the Nurses’ Health Study that collect a lot of data without a really specific goal. Marry this with data collected on the environment, say pollution measurements made by EPA, and you have the basis for most epidemiological studies. This kind of cross-datatype style of data mining is also using a form of Big Data.
The funny thing in a way is that the earth sciences also collect big datasets, but the peculiarities of them show where cracks exist in the lands of Big Data. Let’s start with arguably the most successful of the big datasets, the collection of seismograms from all around the world. This start with the worldwide standardized seismic network (WWSSN) in the 1960s. Although created to help monitor for nuclear tests, the data was available to the research community, albeit in awkward photographic form and catalogs of earthquake locations. As instrumentation transitioned into digital formats, this was brought together into the Global Seismographic Network archived by IRIS.
So far, so NASA-like. But there is an interesting sidelight to this: not only does the IRIS Data Management Center collect and provide all this standard data from permanent stations, it also archives temporary experiments. Now one prominent such experiment (EarthScope’s USArray) was also pretty standard in that it was an institutionally run set of instrument with no specific goal, but nearly all the rest were investigator-driven experiments. And this is where things get interesting.
Nature asked nine “leading Europeans” what their top priority for science is. What is interesting is that if you read between the lines, you find their support for science–and these are definitely supporters of science–seems predicated on different ideas of what science is or should be. Consider these quotes:
“Curiosity-driven research is essential to support a knowledge-based society and push forward innovation….”–ISABELLE VERNOS, Member of the European Research Council Scientific Council, and ICREA Research Professor, Centre for Genomic Regulation, Barcelona, Spain.
“Science has not been primarily about raising questions that are relevant for society. Now it must be.”–JAN WOUTER VASBINDER & DANIEL R. BROOKS, Director, Para Limes, Valkenburg, the Netherlands (J.W.V.); Visiting senior fellow, Institute of Advanced Studies, Köszeg, Hungary (D.R.B.).
As these two quotes illustrate, the cross currents present among all these views is occasionally startling. Science needs more money, needs to be loved, needs to be relevant, needs to be fostered, needs to transcend national boundaries, needs to promote itself, needs to be practiced by the entire public, needs to be transparent. And that is not a complete list; the projections these nine put on science can either be taken as confirmation of the many ways science is important or an accusation that science as an intellectual enterprise is on its deathbed.
Many of you no doubt have heard of the lack of reproducibility studies in some scientific fields. This has led to condemnation of publications that have rejected or discouraged papers attempting to reproduce some observation or effect.
Now this is not such a big deal in solid earth science (and probably not even climate science, where things are so contentious politically that redoing things is viewed in a positive way). Basically, for most geological observations we have the Earth, which remains pretty accessible to pretty nearly all of us. Raw observations are increasingly stored in open databases (seismology has been at this for decades, for instance). Cultural biases that color some psychological or anthropological works don’t apply much in solid earth, and the tweaky issues of precise use of reagents and detailed and inaccessible lab procedures that have caused heartburn in biological sciences are less prominent in earth science (but not absent! See discussions on how fission track ages are affected by etching procedures, or look at the failure of the USGS lab to use standards properly). We kind of have one experiment–Earth–and we aren’t capable of reproducing it (Hitchhiker’s Guide to the Galaxy not withstanding, there is no Earth 2.0).
No, the problem isn’t failing to publish reproductions. It is failing to recognize when we are reproducing older work. And it is going to get worse.
AS GG has noted before, citations to primary literature are become more and more scarce despite tools that make access to primary literature easier and easier. This indicates that less and less background work is being done before studies are moving forward: in essence, it is easier to do a study than prepare for it. The end result is pretty apparent: new studies will fail to uncover the old studies that essentially did the same thing.
Reexamining an area or data point is fine so long as you recognize that is what you are doing, but inadvertently conducting a replication experiment is not so great. Combine this with the already sloppier than desired citation habits we are forming and we risk running in circles, rediscovering that already discovered without gaining any insight.
Certainly one of the most striking things about modern American political discourse is the magnitude of outright lying going on. While misdirection and obfuscation were not uncommon in political speech, outright provable lying wasn’t. And yet now we have a President who Politifact says has made statements that are either false or “pants on fire” 47% of the time and who has inspired the Washington Post fact checker to keep a running count of lies. This follows years of internet chain emails and conspiracy theorists that have made Snopes expand rapidly to capture and review all the questionable stuff circulating on the internet. Needless to say, this tends to encourage others to play equally fast and loose with truth. For a scientist, this is a distressing trend–but it isn’t really that new.
Now to be clear, big lies have made the circuit before, being a staple of the Nazi government, for instance; the related game of “whataboutism” was a favorite of the old Soviet state. Some might point to McCarthyism in the US as a domestic episode, though the Red Scare had less questioning of objective truth and more vilification by insinuation. Here GG refers to outright misrepresentations of is going on. And as science’s goal is to discern the nature and rules of the reality we inhabit, it has a habit of landing in the crosshairs of those whose interests conflict with reality.
Its been awhile since we discussed ways to make publication figures both accurate and fair: part 1 dealt with the problem of mapping variables that varied across the map. Part 2 was mainly an illustration of just how horrible Excel is for earth science work. Here we’ll consider some issues with directional data such as paleomagnetic directions and paleocurrent and such not.
Let’s start with the classic rose diagram:
Pretty different looking, no? On the right is the classic rose diagram where the length (radius) of each pie wedge is scaled by the value in that azimuth range. In this case, these are back azimuths of teleseismic arrivals measured for a tomography study. You can easily see that things are dominated by events to the northwest and to a lesser degree to the southeast and southwest.
To the left is the exact same data plotted by area instead of length. Which is better? As a test, what fraction of the data lies in the wedges from 120-140° and 300-320°?
How should one read a scientific paper? As presenting conclusions one should take as our best estimate of truth? Or as information one can use to test competing hypotheses? You might think it must be one or the other, but that is rarely the case.
Consider the just-published paper by Bahadori, Holt and Rasbury entitled “Reconstruction modeling of crustal thickness and paleotopography of western North America since 36 Ma”. From the abstract you might be tempted to say that this paper is solving a problem, in this case the Late Cenozoic paleoelevation history of the western U.S.:
Our final integrated topography model shows a Nevadaplano of ∼3.95 ± 0.3 km average elevation in central, eastern, and southern Nevada, western Utah, and parts of easternmost California. A belt of high topography also trends through northwestern, central, and southeastern Arizona at 36 Ma (Mogollon Highlands). Our model shows little to no elevation change for the Colorado Plateau and the northern Sierra Nevada (north of 36°N) since at least 36 Ma, and that between 36 and 5 Ma, the Sierra Nevada was located at the Pacific Ocean margin, with a shoreline on the eastern edge of the present-day Great Valley.
There is one key word in that paragraph that should make you careful in accepting the results: “model”. What is the model, and how reliable is it?
Few if any scientists are wild about the modern funding environment. With the exception of some big planetary probes, where the shear cost of the probe ensures some long term funding, nearly all science is funded on a 1 to 3 year timescale. Competition can be fierce and news of getting funded is often accompanied by a request to reduce the budget some amount.
GG reminds you who read this that this was not the sort of environment originally envisioned for NSF.
Even as this environment might not nurture an Einstein or Newton, one could argue that it rapidly prunes away uninteresting science. Such a view would not find comfort in the last paragraph of a perspective in Science on new research into the response of C3 versus C4 plants in a higher CO2 world (research that appears to challenge if not overturn the assumption that C3 plants will do far better than C4 plants):
Reich et al. were only able to make their discoveries because their experiment ran uninterrupted for two decades. This is extremely rare globally, showing that funding for long-term global-change experiments is a necessity. The experiment relied on a concerted effort to continually apply for funding, given the largely short-term nature of funding cycles. Because most funding agencies place a value on innovation and novelty, scientists are forced to come up with new reasons and new measurements to keep existing experiments running. The tenacity of Reich et al. and their ability to keep their experiment running has overturned existing theory and should lead to changes in how we think about and prepare for Earth’s future. Who knows how many processes remain undiscovered because of the unwillingness of funding agencies to support long-term experiments?
Frankly, similar long term programs in very diverse fields have been terminated for similar reasons, including solid earth science, so this isn’t just biology or climate change. For instance, the USGS has pulled a large number of stream gauges over the years in the western U.S. under the logic that we had seen enough to know what we needed to know–an absolute travesty given both long-term climatic oscillations, the reality that rainfall in arid and semi-arid areas is highly erratic, and the real possibility that a long term set of observations would be crucial in better understanding impacts of global warming on the hydrologic cycle. And that is for an agency that has monitoring as part of its mission; individual scientific projects are even harder to keep going. It would seem we really need a program for taking the long view–something few in politics ever do.