Using Zotero and TAPoR on the Old Bailey Proceedings: Data Mining With Criminal Intent, JISC Offices, London, 22 February 2010.

April 16th, 2010

Datamining with Criminal Intent, 22 February 2010, slide 1

The notes for a presentation to the British partners in the ‘Digging into Data’ challenge delivered by Tim Hitchcock at a meeting held at the JISC offices in London, 22 February 2010:

Datamining with Criminal Intent:

This project combines three distinct elements and services.  First, The Old Bailey Online – 120,000,000 words of double rekeyed text.  The Largest Corpora of accurately transcribed – rather than OCRd historical text we have.  And also one that is extremely heavily tagged in xml for content:  gender, role, date, location, crime etc.

Second: Zotero – The citation management plugin of choice.  Zotero is an easy-to-use plugin in Firefox, for  gathering, organizing, and analyze sources (citations, full texts, web pages, images, and other objects).  For this project, Zotero provides the all-important environment in which new tools of visualisation and data mining can be played and applied.

And finally: The TAPoR Portal – a tools set  for sophisticated analysis and retrieval, of text, using the methodologies of quantitative linguistics and the Voyeur suite of visualisation tools for the anslys of text.

The methodology involves creating an API for the Old Bailey, that allows sub sets of text to be passed off to Zotero where they can be turned into study collections (either on their own or in combination with sources derived from elsewhere).  These study collections will then be passed off to TAPoR and Voyeur for analysis and representation.

The experience of the end user should  be seamless, though the processing remains distributed.  And the current functionality of the Old Bailey (the search and representation of tagged data, will remain available through Zotero).

The idea is to create intellectual exemplar for the role of data mining in an important historical discipline–the history of crime–and illustrate how the fundamental conundrums of historical research on large bodies of text might be addressed. By allowing the analysis and statistical representation of the types of language used in court and how it changed over time, and by comparing these ‘data mined’ patterns to those found in tagged data “With Criminal Intent” will achieve three things.

First, a whole new way of charting changes in crime reporting and prosecution will be created; second, a new methodology for the consistent discovery of related descriptions will be benchmarked, and finally a working model of how large corpora can be handled online and in a distributed fashion, will be demonstrated.

To take a few mock ups of what the end user will be able to do:

Datamining With Criminal Intent, 22 February 2010, slide 2

In the first instance, the user can simply collect trials and sessions as they want – either through the current search mechanisms, or through a series of new tools that we are creating to sit with the Old Bailey API, which we are calling the Newgate Commons.  These can then sit in Zotero, for both analysis, and easy citation, when you eventually come to write up the results.

You can then use this as the basis for analysis using TAPoR Tools:

Datamining With Criminal Intent, 22 February 2010, slide 3

Datamining With Criminal Intent, 22 February 2010, slide 4

Frequency measures, proximity, word distribution graphs can all be invoked, with the metadata embedded within the original text, allowing measures of change over time, and between different types of trials or content to be quickly compared.

Or, through Voyeur, you can apply visualisations including wordle like clouds – which again can be separated out using the basic metadata for cross comparison and analysis.

And finally, the study collections with their metadata will be available for mash-ups.

Datamining With Criminal Intent, 22 February 2010, slide 5

At the moment, while Old Bailey place names are tagged, they aren’t geo-referenced, and this is  the next job.

At the end of the day, all we are really doing is bringing together the largest body of historical text we know, with the tools historians are increasingly comfortable with.  We will then have a serious go at applying them, and using them to write some history – history  of the sort that changes what people think they can do.

Datamining With Criminal Intent, 22 February 2010, slide 6

Voyeur and Old Bailey

April 15th, 2010
Screen Shot

Screen shot of Voyeur with Old Bailey data

With Criminal Intent has connected Voyeur with the Old Bailey Online project in a preliminary prototype. Click here to try Voyeur with a subset of the full Old Bailey Corpus.

ComputerWorld Canada story on Criminal Intent project

April 9th, 2010

ComputerWorld Canada has published a story on the Data Mining With Criminal Intent project, U of A text mining project could help businesses (Rafael Ruffolo, March 25, 2010 for ComputerWorld Canada.) Canadian participant Geoffrey Rockwell is quoted to the effect that,

“instead of looking for a needle in the haystack, an effective text mining tool will try and show you the shape of the haystack and tell you the words you might want to find.”

Datamining with Criminal Intent

April 9th, 2010

Background

Over the past few decades scholars have increasingly used court records to illuminate historical themes in novel ways. The published Proceedings of the Old Bailey have been a fertile source for scholars working in these varied traditions, allowing them to use both qualitative and quantitative approaches to the evolution of the criminal justice system, of interpersonal relationships and human behaviour more generally. But, despite the fact that 120 million words of court transcripts published in the Proceedings are now available online in a structured and searchable form, historians and humanist scholars continue to use these legal records in an essentially iterative and traditional manner; and largely failing to take full advantage of the variety of forms of analysis the Proceeding’s online format allow. At the same time, in Zotero, a popular environment for managing online scholarship has been created that allows humanists to collect, index and manipulate large amounts of text; while in TAPoR Tools, a range of facilities for the quantitative analysis of text, has been piloted and tested. By bringing together in one seamless online environment, the text of the Proceedings, the functionality of Zotero and the tools created by TAPoR, this project will allow scholars to take new approaches to this old source.

This project will create an intellectual exemplar for the role of data mining in an important historical discipline–the history of crime–and illustrate how the fundamental conundrums of historical research on large bodies of text that have dogged humanist research over the last forty years might be addressed. By allowing the analysis and statistical representation of the types of language used in court and how it changed over time, and by comparing these `data mined´ patterns to those found in tagged data “With Criminal Intent” will achieve three things. First, a whole new way of charting changes in crime reporting and prosecution will be created; second, a new methodology for the consistent discovery of related descriptions will be benchmarked, and finally a working model of how large corpora can be handled online and in a distributed fashion, will be demonstrated. The significance of this project therefore runs beyond the discipline of the history of crime, and addresses historical scholarship more broadly, and scholarly engagement with large corpora.
Aims and Objectives

This project aims to demonstrate that greater historical rigour can be achieved, and new insights gained through the application of data mining and statistical analysis to large bodies of primary sources such as the The Proceedings of the Old Bailey. Given the availability and power of modern text mining techniques and the fact that the Proceedings have already been optimized for use with these techniques, we believe that by building on the success of previous work, this project will change the research paradigm. In the process, it will allow the end user, scholars and students, to experience the three separate components of this project (the Proceedings, Zotero and TAPoR tools) as a single seamless resource. To achieve this aim, we need to reach three specific objectives:

  1. The creation of Newgate Commons: a new form of interface for the Old Bailey Proceedings that supplements the current search interfaces. The Newgate Commons will allow scholars to use mining and clustering techniques to identify, collect and work with, sets of relevant trials and related texts, and to extract them for further study with other tools. The interface will also make it easy for users to train machine learning `agents´ to help identify patterns in the text (and underlying account of prosecutions and punishments) of interest to the researcher.
  2. The modification of Zotero Virtual Collections,the Zotero bibliographic reference management tool, so it can be used to manage the collections of documents created within the Newgate Commons and call upon full texts only when needed.
  3. Voyeur Analytics: the project will connect Zotero to analytical tools designed by the TAPoR project to work on large collections, including the Voyeur toolset for analysis and visualization. The emphasis throughout will be on extending existing tools as needed to allow researchers to navigate between them seamlessly and to use Zotero as a hub from which to manage large study collections. In the process we will create the potential to analyze and visualize change over time in a way that goes beyond current historical methodologies, illuminating the relationship between text and event in new ways.