Utopian Paper of the Future

NSF is essentially sponsoring a Geoscience Paper of the Future initiative that is now touring around the country trying to get geoscientists to make papers of the future.  A lot of what this initiative addresses amount to significant problems in earth science that need correcting.  But as often happens with such idealistic visions, somebody grabs the steering wheel and veers off the road and into the weeds and as a result leaves the rest of us unsure if this bandwagon will really go anywhere.

OK, first a quick summary: the idea is in essence one championed long ago by Stanford’s Jon Claerbout, namely the reproducible paper. The idea was that a “paper” would actually be something of a metadocument containing all the data and processing steps needed to make the paper from scratch.  You could ideally go in and tweak a few data points in a primary data table (or maybe add some) and then push a button and see the paper change before your eyes.  The beauty of this is that you can follow the data all the way from the source to the end result.  The main difference between this vision (which, by the ways, is perfectly enforceable) and the Geoscience Paper of the Future is that the GPotF recognizes distributed placement of the pieces.  It isn’t that you actually get everything in that one paper, it is that the paper tells you where all the pieces are.

Sounds great, and to the degree that researchers can create it, more power to them.  But there are some huge hurdles, and the GPotF team doesn’t really prioritize them.  The result: grumpy geophysicists….who might just forget the whole thing.

If this is so ideal, what is the problem?  Consider what all the elements of this vision are:

  • Data requirements: Making data available in a public repository, including documented metadata, a clear license specifying conditions of use, and citable using a unique and persistent identifier.

  • Software requirements: Making software available in a public repository, with documentation, a license for reuse, and a unique and citable persistent identifier. This includes not only modeling software, but also other ancillary software for data reformatting, data conversions, data filtering, and data visualization.

  • Provenance requirements: Documenting the provenance of results by explicitly describing the series of computations and their outcome in a workflow sketch, a formal workflow, or a provenance record, possibly in a shared repository and with a unique and persistent identifier.

Now one of the bugaboos of current publication can be original data–but not so much in seismology any more.  The IRIS Data Management Center hosts nearly all the research seismological data out there.  So primary data is no big issue.

What is an issue is that to get to the point where you can start to really do anything with that firehose of data, you have to trim it down.  Right now, that is in the Provenance requirement: you explain your workflow.  Here’s the thing: sometimes that workflow includes judgement. And that means that if all you have is the starting giant dataset and a filter than includes “judgement”, you have essentially blown it; other workers won’t have the same judgement.  What needs to be true is that the intermediate datasets be preserved, which isn’t so clearly noted.  However, realistic consideration of the reasons why data and provenance are included should lead to a better system to sharing what was done. So by and large this points in the right direction.

Where this initiative goes off the road is in the middle bullet, the software requirement.  Consider that in order to satisfy this, you cannot use any commercial software with a typical license.  No Photoshop, Illustrator, Mathematica, Matlab, Excel, Word, etc. Really?  Does this matter? If all you are doing is using nice tools to make things pretty, it is insane to add this requirement.  Basically, is presentation software an essential element of a reproducible paper? No.  So don’t make it a requirement (we’ll come back to why we might view these standards as requirements).

It is potentially even worse. There are commercial pieces of software that do important analyses (potential field analysis, processing reflection profiles, etc.); there is no way you can provide that software.  What is more, unless the code is fully scriptable and you decide to create the script to make a figure, providing any modern GUI software package doesn’t fill the need envisioned–you can’t just switch it on and expect to get the same result. Satisfying this requires a wholesale abandonment of commercial software, meaning that generation of new software duplicating the characteristics of the commercial stuff is needed.  Now there certainly is a real issue sitting here, but the blanket requirement for everything is misguided and burdensome in the extreme.

It also carries other issues.  One of the pitfalls of the original reproducible paper was that it was, like it or not, OS and (to a lesser degree) hardware dependent. To minimize this, you provide data and source code in ascii because binary files and executables are platform dependent. Yes, the papers and theses written that way were using a relatively bland Unix, but as anybody who has tried to install Linux or an open-source compiler knows, open source stuff forks like mad. The Unix you used might not match that used by others in some important respects.  Even within the same OS, slight differences in the revision history of a tool (like a compiler) could produce different results. Are you to provide a full base OS?  Claerbout’s vision in many ways parallels Bill Gate’s early fixation on providing his Basic compiler on all platforms and later visions for Java (write once, deploy many times): while it sounds nice, the reality is that things will diverge and so you had better be prepared for that.

As a result of all that, it is potentially more burdensome for a later worker to try and reproduce all the tools used than it is for the author to provide them all in open-source (or open-use) format.  You might be tweaking things endlessly, changing compiler flags, looking for older versions of OSes, etc. Is that really helpful if all you are recovering is an x-y plotting routine?

So you probably want to limit software to a subset, probably the specific specialized software written for that paper.  Otherwise you describe how things were done in other software with sufficient clarity (perhaps including an example) so it is easy to follow along provided that software or an equally capable substitute is available. And some things, like making basic plots, doesn’t need a description at all.

Now why sometimes call the concepts here requirements or standards?  Because it is likely that this is where NSF wants to go.  After all, in answering to Congress it would be great to say “look! everything you paid for is out there and is totally reproducible”.  Never mind if it costs twice as much to put out that way–that is not NSF’s problem (in fact, it isn’t hard to imagine them seeking more funding to support just that). Never mind if the cost to reproduce is excessive. And you can bet that if NSF shows that this is something that can be done, somebody in Congress might suggest that this is how it should always be done. So it is worth wrestling with these concepts now, before it seems too good to pass up and before the people who avoid it for good reason get blindsided in having to do it.  For although current papers have too little of this, it is quite possible that the GPotF will have too much.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: