Citation Statistics Smackdown

Sorry, it isn’t that dramatic.  But in updating various web tools, GG noticed dramatic differences between his supposed citations between Google Scholar and Web of Science. In the past he has assumed the difference was because Google was capturing junk citations, but today decided to actually look at what is going on in detail.  Which may or may not interest you, dear reader….

The raw starting points for Web of Science is here, and for Google is here. At the very top, GG’s h index is 21 with Web of Science, 27 with Google (a significant difference for those who love those things, just a numerical quirk for others). The most highly cited paper has  252 citations from WoS but a staggering 338 in Google. Although this is tedious to work through, there is clearly a lot of fodder for comparison, so let’s dive in.

An oddity of Google’s citation listing comes into focus quickly: sorting on date only yields the last 15 papers.

Google overestimates citations in at least one situation: it repeated the citation to papers in the Chinese Journal of Geophysics, linking to both the English language version and the original Chinese html version of the papers. Another goofy thing is the Google will mess up from time to time and assign a citation from a previous paper in Nature with the article that starts on the same page as the citation. For instance, Google has an immunology paper citing the Zandt et al. tectonics paper. Google does end up with some number of duplicated citations: several preprints are counted along with the actual publication. Also some Chinese and possibly Russian papers are counted twice, once as Chinese versions, once in English versions.

Mostly, however, the difference is in theses and books, items Web of Science explicitly does not track. Since some theses contain papers published elsewhere, some of these are duplicates. More embarrassingly, there are some term papers on the web that are taken as citable materials.

What is the balance, though?

Of the 331 references identified overall, only 5 in Web of Science were not in Google.  Two were chapters in the Treatise on Geochemistry, two others were in GSA Special Paper 456, and the last was a G^3 article. So of the remaining 326, 247 were in WoS and so 79 more are in Google. Since 338-326=12, there are 12 outright duplicate entries in Google; what of the 79 other additional entries?

Five did not cites the Zandt et al. paper at all; these were outright mistakes.  Combined with the 12 duplicate entries, 17 of the 338, or about 5%, of the Google citations are simply wrong. The duplicates are sometimes multiple language versions of the same paper, or a preprint showing up as a separate item.

  • Theses: 28
  • Books: 16 (including 8 from GSA Memoir 212, which WoS should have had)
  • Foreign language (Chinese and Russian): 12 (Some of which might be duplicates or not even cite the paper at all)
  • “News” Journals (GSA Today, Eos): 6
  • Real journals missed by WoS: 6 (which, if you add the 8 from GSA Memoir 212, are 14 references that WoS should have had).
  • Miscellaneous: 6. A term paper was in there, a meeting abstract, an in press paper.

Which do you take to be more accurate? The 252 in WoS should clearly be at least 258 and probably over 260 with the GSA volumes that are supposed to be counted these days.  The 6 GSA Today+EOS science articles probably deserve inclusion, though the EOS articles are shakier. On the other side, the 338 reported by Google should be no higher than 320 (338 – 17 – 6 + 5). Theses are something interesting in this count, as they represent some kind of original research, but these days most thesis work worth anything is published.  If you take that view we are down to 292, 26 above the 266 WoS probably should have had.

This leaves as seriously gray at least 8 books, 12 foreign language papers, and the 6 news journals. So arguably the uncertainty on a citation count is in the 10-20% range.  If we say the correct number is 279 +/-13, the 252 of WoS is 27 low and Google is 59 high.

What does this mean, aside from apparently we can’t even count integers? Perhaps a first-cut approach would be to take as a closer approximation to a “true” measure of citations by going a third of the way from WoS to Google numbers (true = WoS + (Google-WoS)/3, or true = 2/3(WoS) + 1/3(Google)).


6 responses to “Citation Statistics Smackdown”

  1. Paul Braterman says :

    And then, of course, there’s the count on ResearchGate


  2. cjonescu says :

    Well, to give a quick idea, ResearchGate claims 285 citations to this Zandt et al. paper, which in some ways ironically comes close to what was given above. But ResearchGate makes it painful to actually access all of those to do an analysis like that above (one of those a year is quite enough).


  3. Torbjörn Björkman says :

    I just thought I’d say that these numbers look very much like what I saw when looking into this a while back, Scholar ~10-20% higher than WoS. For me (in physics), a fairly large chunk of the discrepancy is stuff on ArXiv, but those mostly just amount to getting the citations a few months earlier than in WoS. The rest is mostly from theses, then the roughly evenly split between “bilingual duplicates”, conference abstracts and things in chinese, japanese and russian that I have no idea what they are, but which seem to be formatted according to some of the previous categories. Apart from that, a very small number of obvious mistakes.

    For what it’s worth, I think that both theses and conference proceedings are perfectly legit to count, if what you hope to capture is something like “at-least-somewhat-vetted scientific discussion” (a thesis would in fact be reviewed quite a lot harder than a journal paper around here [=Scandinavia]). The bilingual duplicates are wrong of course, but on the whole I think they are a small price to pay if it helps us get around the huge English bias in other databases.


    • cjonescu says :

      Thanks for the insight; in a way I am surprised it is somewhat similar in physics as I would expect fewer book-based citations. Theses might be moving into a less gray literature position (actually might write a bit on that), but there can be a huge variation in quality. There are MS theses that are more BS (and I don’t mean Bachelor of Science) while some unpublished PhDs might be better science than some minimum publishable unit stuff.

      Conference abstracts/proceedings in geoscience do not count for anything anymore, but that is very different in other fields (e.g., engineering). One of the ironies of citation counts is that the significance of the things being counted varies from discipline to discipline (for instance, WoS type numbers for history would make no sense given the far greater prevalence of books), but the numbers are most greatly relied upon by those outside an individual field (e.g., a university-level promotion committee).


Trackbacks / Pingbacks

  1. Changing Shades of Gray | The Grumpy Geophysicist - August 23, 2017

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: