Bookworm: Ngrams Meet the (Open) Library Catalog

read

Google's Ngram Viewer promised some interesting insights into a subset of the books that the company had digitized. The tool offered an interactive visualization of a dataset containing more than 500 billion words from some 5.2 million books. By querying the Ngram viewer, you can see how much word usage changes over time.

A new tool, called Bookworm released by Harvard's Cultural Observatory offers another way to interact with digitized book content and full text search. Bookworm doesn't rely on the Google digitization efforts, but rather uses books in the public domain. It is also less concerned with tracking the history of a word or phrase, but rather helps enable searches of other library metadata, including genre, author information, publication place and date.

As Ben Schmidt, a member of the team working on the project describes it, "Bookworm in fact straddles the space between something like Ngrams and the more traditional library catalog."

Ngrams: An Incomplete Picture

I still remember the day when Google launched its Ngram Viewer late last year. I was writing for ReadWriteWeb back then, and I recall that few news sites initially picked up on the news. You know how it goes with those folks: "Oh. That looks scholarly. Yawn. Let's write about group messaging or group buying apps instead! Whee!" I remember sending an email to Dan Cohen, director of the Center for History and New Media at George Mason University asking "This seems pretty cool, amirite?"

I think it was probably Alexis Madrigal from The Atlantic who started writing some incredibly fun posts about Ngrams -- as he is wont to do -- and other news sites realized that, indeed, having a tool that lets you visualize some of the data from Google's book digitization efforts is indeed pretty cool.

But then there came the pushback about the Ngrams Viewer from some humanities scholars who questioned a quantitative approach to the study of literature, who pointed out the problems with the OCR technology that underlies the digitization efforts, and who questioned how much we can really learn from just this limited glimpse into part of the (limited) Google Books corpus. Tim Carmody, no surprise, wins the headline award here for his Snarkmarket piece "Google Ngrams F---ing Sucks."

The Problems Bookworm Solves

Despite the ease by which Ngrams purports to let users glean insights from the history of published words, it's pretty clear that it's not a complete (or completely accurate tool). Yet the idea of this sort of search-plus-visualization is really compelling.

Bookworm builds on this visualization, but does so with a much richer sense of libraries, metadata, and texts are interconnected. It feels as though it moves closer to the ways in which we use the library stacks -- you search for a subject or book; you go to that shelf; you grab that book and then you browse what's nearby. As our reading and research habits become more digital themselves, these sorts of discovery tools are crucial.

And Bookworm also points to the possibility of uncovering things that we can probably only know through these types of data-oriented projects and it suggests there's a great deal to be learned when people get to create their own searches rather than simply rely on the categorization and organization of the print-bound library.

The Bookworm project uses data from the Open Library and Internet Archive (that means not only is the content open but that you can make corrections on the data when you spot OCR errors). Bookworm is still in alpha but has been submitted to the Digital Public Library of America beta sprint.

Bookworm: Ngrams Meet the (Open) Library Catalog

Audrey Watters

Ngrams: An Incomplete Picture

The Problems Bookworm Solves

Written by

Audrey Watters

Credits

Hack Education

The History of the Future of Education Technology