The notes for a presentation to the British partners in the ‘Digging into Data’ challenge delivered by Tim Hitchcock at a meeting held at the JISC offices in London, 22 February 2010:
Datamining with Criminal Intent:
This project combines three distinct elements and services. First, The Old Bailey Online – 120,000,000 words of double rekeyed text. The Largest Corpora of accurately transcribed – rather than OCRd historical text we have. And also one that is extremely heavily tagged in xml for content: gender, role, date, location, crime etc.
Second: Zotero – The citation management plugin of choice. Zotero is an easy-to-use plugin in Firefox, for gathering, organizing, and analyze sources (citations, full texts, web pages, images, and other objects). For this project, Zotero provides the all-important environment in which new tools of visualisation and data mining can be played and applied.
And finally: The TAPoR Portal – a tools set for sophisticated analysis and retrieval, of text, using the methodologies of quantitative linguistics and the Voyeur suite of visualisation tools for the anslys of text.
The methodology involves creating an API for the Old Bailey, that allows sub sets of text to be passed off to Zotero where they can be turned into study collections (either on their own or in combination with sources derived from elsewhere). These study collections will then be passed off to TAPoR and Voyeur for analysis and representation.
The experience of the end user should be seamless, though the processing remains distributed. And the current functionality of the Old Bailey (the search and representation of tagged data, will remain available through Zotero).
The idea is to create intellectual exemplar for the role of data mining in an important historical discipline–the history of crime–and illustrate how the fundamental conundrums of historical research on large bodies of text might be addressed. By allowing the analysis and statistical representation of the types of language used in court and how it changed over time, and by comparing these ‘data mined’ patterns to those found in tagged data “With Criminal Intent” will achieve three things.
First, a whole new way of charting changes in crime reporting and prosecution will be created; second, a new methodology for the consistent discovery of related descriptions will be benchmarked, and finally a working model of how large corpora can be handled online and in a distributed fashion, will be demonstrated.
To take a few mock ups of what the end user will be able to do:
In the first instance, the user can simply collect trials and sessions as they want – either through the current search mechanisms, or through a series of new tools that we are creating to sit with the Old Bailey API, which we are calling the Newgate Commons. These can then sit in Zotero, for both analysis, and easy citation, when you eventually come to write up the results.
You can then use this as the basis for analysis using TAPoR Tools:
Frequency measures, proximity, word distribution graphs can all be invoked, with the metadata embedded within the original text, allowing measures of change over time, and between different types of trials or content to be quickly compared.
Or, through Voyeur, you can apply visualisations including wordle like clouds – which again can be separated out using the basic metadata for cross comparison and analysis.
And finally, the study collections with their metadata will be available for mash-ups.
At the moment, while Old Bailey place names are tagged, they aren’t geo-referenced, and this is the next job.
At the end of the day, all we are really doing is bringing together the largest body of historical text we know, with the tools historians are increasingly comfortable with. We will then have a serious go at applying them, and using them to write some history – history of the sort that changes what people think they can do.