Citation Statistics Smackdown

Sorry, it isn’t that dramatic.  But in updating various web tools, GG noticed dramatic differences between his supposed citations between Google Scholar and Web of Science. In the past he has assumed the difference was because Google was capturing junk citations, but today decided to actually look at what is going on in detail.  Which may or may not interest you, dear reader….

The raw starting points for Web of Science is here, and for Google is here. At the very top, GG’s h index is 21 with Web of Science, 27 with Google (a significant difference for those who love those things, just a numerical quirk for others). The most highly cited paper has  252 citations from WoS but a staggering 338 in Google. Although this is tedious to work through, there is clearly a lot of fodder for comparison, so let’s dive in.

An oddity of Google’s citation listing comes into focus quickly: sorting on date only yields the last 15 papers.

Google overestimates citations in at least one situation: it repeated the citation to papers in the Chinese Journal of Geophysics, linking to both the English language version and the original Chinese html version of the papers. Another goofy thing is the Google will mess up from time to time and assign a citation from a previous paper in Nature with the article that starts on the same page as the citation. For instance, Google has an immunology paper citing the Zandt et al. tectonics paper. Google does end up with some number of duplicated citations: several preprints are counted along with the actual publication. Also some Chinese and possibly Russian papers are counted twice, once as Chinese versions, once in English versions.

Mostly, however, the difference is in theses and books, items Web of Science explicitly does not track. Since some theses contain papers published elsewhere, some of these are duplicates. More embarrassingly, there are some term papers on the web that are taken as citable materials.

What is the balance, though?

Of the 331 references identified overall, only 5 in Web of Science were not in Google.  Two were chapters in the Treatise on Geochemistry, two others were in GSA Special Paper 456, and the last was a G^3 article. So of the remaining 326, 247 were in WoS and so 79 more are in Google. Since 338-326=12, there are 12 outright duplicate entries in Google; what of the 79 other additional entries?

Five did not cites the Zandt et al. paper at all; these were outright mistakes.  Combined with the 12 duplicate entries, 17 of the 338, or about 5%, of the Google citations are simply wrong. The duplicates are sometimes multiple language versions of the same paper, or a preprint showing up as a separate item.

  • Theses: 28
  • Books: 16 (including 8 from GSA Memoir 212, which WoS should have had)
  • Foreign language (Chinese and Russian): 12 (Some of which might be duplicates or not even cite the paper at all)
  • “News” Journals (GSA Today, Eos): 6
  • Real journals missed by WoS: 6 (which, if you add the 8 from GSA Memoir 212, are 14 references that WoS should have had).
  • Miscellaneous: 6. A term paper was in there, a meeting abstract, an in press paper.

Which do you take to be more accurate? The 252 in WoS should clearly be at least 258 and probably over 260 with the GSA volumes that are supposed to be counted these days.  The 6 GSA Today+EOS science articles probably deserve inclusion, though the EOS articles are shakier. On the other side, the 338 reported by Google should be no higher than 320 (338 – 17 – 6 + 5). Theses are something interesting in this count, as they represent some kind of original research, but these days most thesis work worth anything is published.  If you take that view we are down to 292, 26 above the 266 WoS probably should have had.

This leaves as seriously gray at least 8 books, 12 foreign language papers, and the 6 news journals. So arguably the uncertainty on a citation count is in the 10-20% range.  If we say the correct number is 279 +/-13, the 252 of WoS is 27 low and Google is 59 high.

What does this mean, aside from apparently we can’t even count integers? Perhaps a first-cut approach would be to take as a closer approximation to a “true” measure of citations by going a third of the way from WoS to Google numbers (true = WoS + (Google-WoS)/3, or true = 2/3(WoS) + 1/3(Google)).



3 responses to “Citation Statistics Smackdown”

  1. Paul Braterman says :

    And then, of course, there’s the count on ResearchGate


  2. cjonescu says :

    Well, to give a quick idea, ResearchGate claims 285 citations to this Zandt et al. paper, which in some ways ironically comes close to what was given above. But ResearchGate makes it painful to actually access all of those to do an analysis like that above (one of those a year is quite enough).


