This week, references to a couple of articles:
I recently read a blog post, Language Log, about the “train wreck” that is the metadata for Google books. I won’t repeat the examples but there are plenty in the blog post to indicate that the metadata are a mess, particularly for older books. Google’s chief engineer said the dates were supplied by libraries but the blog post’s author, Geoff Nunberg, believes the errors are a result of erroneous dating from OCR text. But then there are classification errors as well (for example, an edition of Moby Dick is classed under Computers). Google tried to blame libraries for these errors too but the categories are not those used by libraries. I’ll leave you to read the post but it’s pretty horrific reading.
Then just yesterday I read this article in Library Journal about Google Scholar’s problems (and the first paragraph happened to mention the above blog post). Again, it seems, Google did not use the perfectly good metadata on offer by experts but decided to rely on “smart” crawlers. The resulting problems include phantom citations, inflated numbers and lost authors (to name a few!).
While Google has fixed some of the problems, it seems evident that correcting these errors is not a priority.
It amazes me (or perhaps it shouldn’t) that Google appears to have such a lackadaisical attitude towards the accuracy of its results.