How should one read a scientific paper? As presenting conclusions one should take as our best estimate of truth? Or as information one can use to test competing hypotheses? You might think it must be one or the other, but that is rarely the case.
Consider the just-published paper by Bahadori, Holt and Rasbury entitled “Reconstruction modeling of crustal thickness and paleotopography of western North America since 36 Ma”. From the abstract you might be tempted to say that this paper is solving a problem, in this case the Late Cenozoic paleoelevation history of the western U.S.:
Our final integrated topography model shows a Nevadaplano of ∼3.95 ± 0.3 km average elevation in central, eastern, and southern Nevada, western Utah, and parts of easternmost California. A belt of high topography also trends through northwestern, central, and southeastern Arizona at 36 Ma (Mogollon Highlands). Our model shows little to no elevation change for the Colorado Plateau and the northern Sierra Nevada (north of 36°N) since at least 36 Ma, and that between 36 and 5 Ma, the Sierra Nevada was located at the Pacific Ocean margin, with a shoreline on the eastern edge of the present-day Great Valley.
There is one key word in that paragraph that should make you careful in accepting the results: “model”. What is the model, and how reliable is it?
Why make a model? For engineers, models are ways to try things out: you know all the physics, you know the properties of the materials, but the thing you are making, maybe not so much. A successful engineering model is one that behaves in desirable ways and, of course, accurately reproduces how a final structure works. In a sense, you play with a model to get an acceptable answer.
How about in science? GG sometimes wonders, because the literature sometimes seems confused. From his perspective, a model offers two possible utilities: it can show that something you didn’t think could happen, actually could happen, and it shows you situations where what you think you know isn’t adequate to explain what you observe. Or, more bluntly, models are useful when they give what seem to be unacceptable answers.
The strange thing is that some scientists seem to want to patch the model rather than celebrate the failure and explore what the failure means. As often as not, this is because the authors were heading somewhere else and the model failure was an annoyance that got in the way, but GG thinks that the failures are more often the interesting thing. To really show this, GG needs to show a couple actual models, which means risking annoying the authors. Again. Guys, please don’t be offended. After all, you got published (and for one of these, are extremely highly cited, so an obscure blog post isn’t going to threaten your reputation).
First, let’s take a recent Sierran paper by Cao and Paterson. They made a fairly simple model of how a volcanic arc’s elevation should change as melt is added to the crust and erosion acts on the edifice. They then plugged in their estimates of magma inputs. Now GG has serious concerns with the model and a few of the data points in the figure below, but that is beside the present point. Here they plot their model’s output (the solid colored line) against some observational points [a couple of which are, um, misplotted, but again, let’s just go with the flow here]:
The time scale is from today on the left edge to 260 million years ago on the right. The dashed line is apparently their intuitive curve to connect the points (it was never mentioned in the caption). What is exciting about this? Well the paper essentially says “hey we predicted most of what happened!” (well, what they wrote was “The simulations capture the first-order Mesozoic- Cenozoic histories of crustal thickness, elevation and erosion…”)–but that is not the story. The really cool thing is that vertically hatched area labeled “mismatch”. Basically their model demands that things got quite high about 180 Ma but the observations say that isn’t the case.
What the authors said is this: “Although we could tweak the model to make the simulation results more close to observations (e.g., set Jurassic extension event temporally slightly earlier and add more extensional strain in Early-Middle Jurassic), we don’t want to tune the model to observations since our model is simplified and one-dimensional and thus exact matches to observations are not expected.” Actually there are a lot more knobs to play with than extensional strain: there might have been better production of a high-density root than their model allowed, there might have been a strong signal from dynamic topography, there might be some bias in Jurassic pluton estimates…in essence, there is something we didn’t expect to be true. This failure is far more interesting than the success.
A second example is from the highly cited paper by Lujan Liu and colleagues in 2008. Here they took seismic tomography and converted it to density contrasts (again, a place fraught with potential problems) and then they ran a series of reverse convection runs, largely to see where a high wavespeed under the easternmost U.S. . The result? The anomaly thought to be the Farallon plate rises up to appear…under the western Atlantic Ocean. “Essentially, the present Farallon seismic anomaly is too far to the east to be simply connected to the Farallon-North American boundary in the Mesozoic, a result implicit in forward models.”
This is, again, a really spectacular result, especially as “this cannot be overcome either by varying the radial viscosity structure or by performing additional forward-adjoint iterations...” It means that the model, as envisioned by these authors, is missing something important. That, to GG, is the big news here, but it isn’t what the authors wanted to explore: they wanted to look at the evolution of dynamic topography and its role in the Western Interior Seaway–so they patched the model, introducing what they called a stress guide, but which really looks like a sheet of teflon on the bottom of North America so that the anomaly would rise up in the right place, namely the west side of North America. While that evidently is a solution that can work (and makes a sort of testable hypothesis), it might not be the only one. For instance, the slab might have been delayed in reaching the lower mantle as it passed through the transition zone near 660 km depth, meaning that the model either neglected those forces or underestimated them. Exploring all the possible solutions to this rather profound misfit of the model would have seemed the really cool thing to do.
Finally a brief mention of probably the biggest model failure and its amazingly continued controversial life. One of the most famous derivations is the calculation of the elevation of the sea floor based on the age of the oceanic crust; the simplest model is that of a cooling half space, and it does a pretty good job of fitting ocean floor depths out to about 70 million years in age. Beyond that, most workers find that the seafloor is too shallow:
This has spawned a fairly long list of papers seeking to explain the discrepancy (some by resampling the data to find the original curve can fit, others by using a cooling plate instead of a half space, others invoking the development of convective instabilities that cause the bottom of the plate to fall off, others invoke some flavor of dynamic topography, and more). In this case, the failure of the model was the focus of the community–that this remains controversial is a bit of a surprise but goes to show how interesting a model’s failure can be.
In part one, we saw that there are often differences between seismic tomographies of an area, and the suggestion was made that on occasion a tomographer might choose to make a big deal about an anomaly that in fact is noise or an artifact (GG does have a paper in mind but thinks it was entirely an honest interpretation). Playing with significance criteria (or not even having some) could allow an unscrupulous seismologist a chance to make a paper seem to have a lot more impact than it deserves.
Yet this is not really where the worst potential for abuse lies.
The worst is when others use the tomographic models as input for some other purpose. At present, this is most likely in geodynamics, but no doubt there are other applications. Which model should you use? If you run your geodynamic model with several tomographies and one yields the exciting result you were wanting to see, what do you do? Hopefully you share all the results, but it would be easy not to and instead provide some after the fact explanation for why you chose that model.
Has this happened? GG has heard accusations.
It’s not like the community is unaware of differences. Thorsten Becker published a paper in 2012 showing that in the western U.S. that seismic models were pretty similar except for amplitude–but “pretty similar” described correlation coefficients of 0.6-0.7. (That amplitude part is pretty important, BTW). About the same time (but less explicitly in addressing the geodynamics modeling community) Gary Pavlis and coauthors similarly compared things in the western U.S. and reached a similar conclusion. But this only provides a start; the key is, just how sensitive are geodynamic results to the differences in seismic tomography?
Frankly, earth science has faced issues for a long time as workers in one specialty had need of results from another. Usually this meant choosing between interpretations of some kind (that volcanic is really sourced from the mantle, not the crust, or that paleomagnetism is good and this other is bad). But the profusion of seismic models and their role as direct starting points for even more complex numerical modeling seems to pose a bigger challenge than radiometric dates or geologic maps, which never were so overabundant that you could imagine finding the one that worked best for your hypothesis. When you toss in some equal ambiguity about viscosity models in the earth, it can seem difficult to know just how robust the conclusions of a geodynamic model are.
Heaven help you if you are then picking between geodynamic models for anything–say like plate motion histories. You could be a victim of a double vp hack….
Maybe its just that February is finally ending, but GG has been navel gazing a bit after reading the exploits of some folks who really don’t understand what science is really for but who get to portray scientists in real life. If you have the stomach for it, Buzzfeed’s review of Brian Wansink’s rather unpleasant history of p-hacking at levels rarely seen is worth a read. Or you can see Retraction Watch’s ongoing accumulation of his retractions and revisions.
Those of us in geophysics pat ourselves on the back and are quietly happy that we don’t have hundreds of independent variables to go fishing in to find something marginally significant. But maybe we have issues that, while not as unscrupulous, are a means of finding something publishable in a pile of dreck.
So let’s go vp-hacking. (And yes, we’ll get in the weeds a bit here).
In looking at the little advertisements (“press releases”) for newsworthy new science that is the website SciTechDaily, GG found this stunning assertion:
First-of-Its Kind Seismic Study Challenges Concepts of Geology
Wow! A first-of-its-kind study and challenging some unnamed concepts of geology. Not every day that happens. What was more, the study was authored by well-respected scientists like Vadim Levin, who was quoted in the puff piece saying “The upwelling we detected is like a hot air balloon, and we infer that something is rising up through the deeper part of our planet under New England.”
Frankly, this is a case of university promotion run amok, and Vadim has to take at least partial ownership.
First, the study is hardly the first of its kind. It compares tomographic wave speeds with measurements of shear-wave splitting, stuff that has been done now for decades. What is new are some SKS splitting measurements from some sites that hadn’t been included in previous regional studies. The splitting magnitudes were small, suggesting that the regionally present transverse [horizontal] anisotropy was damped or reoriented in this region. Yet we get quotes from Vadim (who certainly should know better) like this: “Our study challenges the established notion of how the continents on which we live behave.”
Oh, be real. This study is not about to rewrite the textbooks despite Levin’s statement that “It challenges the textbook concepts taught in introductory geology classes.”
Look, the paper is perfectly fine. But it was not the work that originated the idea that this body under New England was a convective upwelling; in fact, those papers don’t challenge any notion about continents, instead suggesting that the trailing edges of continents might generate convective motions in the mantle. (Vadim was a coauthor on at least one of these papers published a year ago).
Clearly the hype with the press release is way out of proportion to the significance of the paper. This is not how we should be promoting science; in fact, it is just the kind of press release that can torque other workers in the field. GG’s view is that scientists need to control their message–not only in their papers but in the press releases they contribute to.
As an aside, how believable is this interpretation? Read More…
GG was recently dismayed by student “error analyses” in some reports that simply amounted to “well, we could have made a mistake”. As awful as these are, they are better than some of what is published in the professional literature these days.
We have so much data, so many big computers, so many clever coders that we can crunch and process huge datasets and then, in the end, the answer emerges. There it is, usually in blue and red, the world beneath our feet! Ta-da!
But wait. One big new model says the world at this point is red, but another says it is blue. Which is it? How are we to believe one or the other? All too often, a new model says nothing about why it is better or more believable than a previous model. In essence what you want is an error bar. Good luck finding that in a typical tomography paper, or a numerical modeling paper. Error bars are out of fashion.
This is worth a little investigation…
Ah, fall is in the air and so it is a perfect time to be grumpy. Today it is about mistaking a model assumption for a model result, and our candidate for proving the point is the art of balancing cross sections.
Long ago, cross sections were drawn to, well, look like geologists thought they might look without too much worry about whether they made any sense. That was of course silly, and over time some hardy souls wondered if you could take a cross section and treat it like a jigsaw puzzle, slicing it up on all the faults and unbending all the folds and then recovering something that looked reasonable for a starting model. Formalizing such sections provided rules, such as the length of a bed had to stay constant as you undid deformation, or the area of a geologic unit had to be preserved. While this allowed one to see if a section might be possible, it didn’t make for the easiest time in making a section that would work out.
In the late 1970s and early 1980s, John Suppe developed a geometrical approximation for deformation in fold-and-thrust belts he termed fault-bend folding, a methodology that allowed for the construction of balanced cross sections from primary geologic observations directly rather than through some trial-and-error process. Since then, the approach has had numerous adjustments and extensions made to it, but it still is the basis for most geologic cross sections made today. As such, it was a major step forward.
So what is the problem? As with many useful tools, it is in the approximations necessary to make the tool easily wielded.