The Stupidity of Searches

Not unexpected but not what you wanted to hear, a study on the impact of abstract construction on citations showed that abstracts that are longer, use more buzzwords, more adjectives and generally are more obtuse are more highly cited.  In geology, for instance, abstracts built of short sentences with few words and few buzzwords resulted in fewer citations than abstracts with long sentences with lots of words and more words you can’t find in the dictionary. Even sadder, adding hyperbole to the abstract clearly helped. The single biggest effect was the number of words in the abstract, which had a 7% effect on citations in geology. (Thanks to ScienceDaily and Retraction Watch for pointing to this).

The cause, the authors guess (and GG concurs) is the search engine.  There are two parts to this: one is that most search engines only have the keywords and abstract (and many, by default, will search only the title and abstract), the other is that in skimming the results of a search, many scientists will pick up on magic words relevant to what they are looking for. How do we fix this?

One possibility is that we all start writing the most bloated, self-congratulatory abstracts possible.  Yeech.

Another solution is to go to full text search.  For instance, GeoScienceWorld will do full text searches on papers that have been uploaded with text (some older papers have been just scans). However, full text searches tend to pick up a lot of cruft: if you search on, say, Sierra Nevada, you might hit papers that in turn cite a paper that has Sierra Nevada in the title–you are getting a hit on their bibliography.  You might have to try a few phrases to get what you want (and heaven help you if there is no phrase search–this has been a headache with Papers, for instance). It is quite possible that, even with full text search, that scientists are still going to want to search on titles and abstracts.

A solution GG pitched more than 10 years ago was the development of individual Bayesian databases similar to what spam filters like SpamSieve use. You might pick out 3 or 4 papers that address the stuff you are interest in and ask for the most similar papers in a database.  This would be using everything in the papers in question to identify similarity.  So, say, your papers were all on the Sierra Nevada but maybe covered paleomag and structural geology, you would tend to get Sierra Nevada-ish papers.  And if you trained the filter (much as you train a filter in SpamSieve), it would rather quickly get to the heart of things and return interesting results. Right now the closest you can get to this is to use the common citations tools in Web of Science (which, by the ways, is a vastly underutilized tool).

Even this won’t be perfect.  Terminology varies with time.  “Core complexes” used to be “gneiss domes” in the older literature (but gneiss domes today need not be core complexes).  “Relamination” is replacing “underplating”. Digital databases are still fairly new and there should be lots more tricks out there.  Google, for instance, knows if I am searching for “10.9 applications” that it should also look for “Mavericks applications” as MacOS 10.9 is also known as Mavericks; more specialized search engines could also comprehend synonyms



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: