Saturday, May 30 • 9:00am - 12:00pm
Workshop on Text Analytics with the HathiTrust Research Center: An Introduction to Tools for Working with Digitized Text and Metadata

This workshop is intended for a broad audience ranging from curious graduate students exploring digital humanities to the experienced text mining researcher. The availability of large corpora of digitized text from the world’s research libraries has the potential to transform research in the humanities in novel ways. Not only scholars who specialize in digital humanities but also all scholars including those specializing in traditional areas can potentially benefit from using the resources and tools that are now becoming available in this field. The HathiTrust Digital Library (HTDL) is one of the premier resources for textual corpora and has a growing collection drawn from some of the world’s foremost research libraries, which currently consists of over thirteen million volumes of digitized text and the bibliographic metadata associated with them. Such an extensive corpus affords the ability to scale up inquiry and enables new kinds of research questions to be asked. The HathiTrust Research Center (HTRC), which is the HathiTrust’s research-oriented affiliate, has been developing sophisticated computational tools, including ones that will allow support for textual analytics even when copyright restrictions preclude the availability of the full-text content to scholars.

The workshop will provide a hands-on introduction to the HTDL collection and its metadata, and to the tools and functionalities developed by the HTRC that leverage these resources. Through the concrete instances of the HTRC tools, the workshop will orient attendees about the new challenges and opportunities that the ability to carry out algorithmic text analysis at such a large scale presents to researchers. The workshop will cover the Secure Hathi Analytics Research Commons (SHARC), the HathiTrust+Bookworm (HT+BW) tool and the HTRC Extracted Features Dataset. Attendees will be shown how to build their own worksets (small, customized subcorpora from the HathiTrust Digital Library corpus) and how to conduct analyses on worksets. There will also be group discussion involving all attendees about the emerging questions that these novel developments are likely to inaugurate in their own fields and about how these developments can affirm or disrupt (or both affirm and disrupt simultaneously) established practices of inquiry.


Sayan Bhattacharyya

CLIR Postdoctoral Research Fellow, University of Illinois at Urbana-Champaign

Main Library, 3 West Instruction Room

