Slices and Ad Hoc Collections

June 28th, 2010

In the Center for History and New Media’s study of text mining techniques and their utility for historical work, funded by the National Endowment for the Humanities, we have been trying out different use cases and techniques on focus groups of historians. As part of that process we have settled on two major kinds of collections and related methodologies that we intend to explore further as part of this Digging into Data collaboration: text mining on “slices” of individual corpora that meet certain criteria, and “ad hoc” research libraries assembled from disparate online collections.

The first kind of collection to mine involves a humanities researcher asking for all of the documents from a corpus with particular metadata or full-text keyword matches. The second uses a personal assemblage of individual documents, generally on a common topic, drawn from a wide range of collections. In Zotero, the tool being used for these tests, the former is equivalent to a “saved search” on a collection (i.e., all matching items from a Boolean search of a collection); the latter, the folders that a Zotero user can set up in the interface to organize their library into subsections manually.

CHNM has begun to look at passing both the slices and the ad hoc text corpora to analytical and visualization services. For slices, our assumption has been that Zotero cannot pull down the entirety of the corpus, but will instead pass a token, URI, or query string to reference the text the analytical service needs, which will be then be delivered directly from the collection to the text-mining service. For ad hoc collections, Zotero can bundle the text itself and send it from the client.

For slices, pre-processing of the text by analytical services may have significant advantages, e.g., to establish normalized compression distance and clusters of similarly themed documents. For ad hoc collections, pre-processing of texts is likely less helpful (since the historian has already clustered them for a reason). However, the need for sophisticated text mining services that are more fine-grained, such as word usage or entity extraction, rises. For instance, Zotero recently added a popular mapping service that takes advantage of place-name extraction from full texts within an ad hoc collection.

The Old Bailey in Numbers

June 18th, 2010

Datamining is about discovering patterns in text, but the Old Bailey Proceedings already incorporates tagged data reflecting what contemporaries thought they were doing.  The  nature of the crime, the name, gender and age of the defendant, the verdict and  punishment were described in words their authors thought beyond  mis-interpretation.  To use datamining to find new patterns, it would  help if we could subtract the patterns that we already know about.   The  huge rise in theft prosecutions in the first half of the nineteenth  century, the changing proportion of men and women prosecuted, the  evolving nature of the crime itself; each needs to be interrogated to  illustrate where changes in language can be explained as the result of  changing judicial practise, and where these changes suggest a new and different  explanation.

Datamining with Criminal Intent

April 9th, 2010

Background

Over the past few decades scholars have increasingly used court records to illuminate historical themes in novel ways. The published Proceedings of the Old Bailey have been a fertile source for scholars working in these varied traditions, allowing them to use both qualitative and quantitative approaches to the evolution of the criminal justice system, of interpersonal relationships and human behaviour more generally. But, despite the fact that 120 million words of court transcripts published in the Proceedings are now available online in a structured and searchable form, historians and humanist scholars continue to use these legal records in an essentially iterative and traditional manner; and largely failing to take full advantage of the variety of forms of analysis the Proceeding’s online format allow. At the same time, in Zotero, a popular environment for managing online scholarship has been created that allows humanists to collect, index and manipulate large amounts of text; while in TAPoR Tools, a range of facilities for the quantitative analysis of text, has been piloted and tested. By bringing together in one seamless online environment, the text of the Proceedings, the functionality of Zotero and the tools created by TAPoR, this project will allow scholars to take new approaches to this old source.

This project will create an intellectual exemplar for the role of data mining in an important historical discipline–the history of crime–and illustrate how the fundamental conundrums of historical research on large bodies of text that have dogged humanist research over the last forty years might be addressed. By allowing the analysis and statistical representation of the types of language used in court and how it changed over time, and by comparing these `data mined´ patterns to those found in tagged data “With Criminal Intent” will achieve three things. First, a whole new way of charting changes in crime reporting and prosecution will be created; second, a new methodology for the consistent discovery of related descriptions will be benchmarked, and finally a working model of how large corpora can be handled online and in a distributed fashion, will be demonstrated. The significance of this project therefore runs beyond the discipline of the history of crime, and addresses historical scholarship more broadly, and scholarly engagement with large corpora.
Aims and Objectives

This project aims to demonstrate that greater historical rigour can be achieved, and new insights gained through the application of data mining and statistical analysis to large bodies of primary sources such as the The Proceedings of the Old Bailey. Given the availability and power of modern text mining techniques and the fact that the Proceedings have already been optimized for use with these techniques, we believe that by building on the success of previous work, this project will change the research paradigm. In the process, it will allow the end user, scholars and students, to experience the three separate components of this project (the Proceedings, Zotero and TAPoR tools) as a single seamless resource. To achieve this aim, we need to reach three specific objectives:

  1. The creation of Newgate Commons: a new form of interface for the Old Bailey Proceedings that supplements the current search interfaces. The Newgate Commons will allow scholars to use mining and clustering techniques to identify, collect and work with, sets of relevant trials and related texts, and to extract them for further study with other tools. The interface will also make it easy for users to train machine learning `agents´ to help identify patterns in the text (and underlying account of prosecutions and punishments) of interest to the researcher.
  2. The modification of Zotero Virtual Collections,the Zotero bibliographic reference management tool, so it can be used to manage the collections of documents created within the Newgate Commons and call upon full texts only when needed.
  3. Voyeur Analytics: the project will connect Zotero to analytical tools designed by the TAPoR project to work on large collections, including the Voyeur toolset for analysis and visualization. The emphasis throughout will be on extending existing tools as needed to allow researchers to navigate between them seamlessly and to use Zotero as a hub from which to manage large study collections. In the process we will create the potential to analyze and visualize change over time in a way that goes beyond current historical methodologies, illuminating the relationship between text and event in new ways.