Slices and Ad Hoc Collections

June 28th, 2010

In the Center for History and New Media’s study of text mining techniques and their utility for historical work, funded by the National Endowment for the Humanities, we have been trying out different use cases and techniques on focus groups of historians. As part of that process we have settled on two major kinds of collections and related methodologies that we intend to explore further as part of this Digging into Data collaboration: text mining on “slices” of individual corpora that meet certain criteria, and “ad hoc” research libraries assembled from disparate online collections.

The first kind of collection to mine involves a humanities researcher asking for all of the documents from a corpus with particular metadata or full-text keyword matches. The second uses a personal assemblage of individual documents, generally on a common topic, drawn from a wide range of collections. In Zotero, the tool being used for these tests, the former is equivalent to a “saved search” on a collection (i.e., all matching items from a Boolean search of a collection); the latter, the folders that a Zotero user can set up in the interface to organize their library into subsections manually.

CHNM has begun to look at passing both the slices and the ad hoc text corpora to analytical and visualization services. For slices, our assumption has been that Zotero cannot pull down the entirety of the corpus, but will instead pass a token, URI, or query string to reference the text the analytical service needs, which will be then be delivered directly from the collection to the text-mining service. For ad hoc collections, Zotero can bundle the text itself and send it from the client.

For slices, pre-processing of the text by analytical services may have significant advantages, e.g., to establish normalized compression distance and clusters of similarly themed documents. For ad hoc collections, pre-processing of texts is likely less helpful (since the historian has already clustered them for a reason). However, the need for sophisticated text mining services that are more fine-grained, such as word usage or entity extraction, rises. For instance, Zotero recently added a popular mapping service that takes advantage of place-name extraction from full texts within an ad hoc collection.

Criminal Intent project in the news

June 19th, 2010

The Globe and Mail, Canada’s “national newspaper”, has an article today in the Focus section on

Supercomputers seek to ‘model humanity’. The article mentions the Digging into Data program and our Criminal Intent project.

The Old Bailey in Numbers

June 18th, 2010

Datamining is about discovering patterns in text, but the Old Bailey Proceedings already incorporates tagged data reflecting what contemporaries thought they were doing.  The  nature of the crime, the name, gender and age of the defendant, the verdict and  punishment were described in words their authors thought beyond  mis-interpretation.  To use datamining to find new patterns, it would  help if we could subtract the patterns that we already know about.   The  huge rise in theft prosecutions in the first half of the nineteenth  century, the changing proportion of men and women prosecuted, the  evolving nature of the crime itself; each needs to be interrogated to  illustrate where changes in language can be explained as the result of  changing judicial practise, and where these changes suggest a new and different  explanation.

Mind the Gap Results

June 17th, 2010

Tableau Display of Subset of Old Bailey Data

At the Mind the Gap workshop a Criminal Intent team experimented with a number of promising data visualization and mining techniques. For example we tried a “data warehouse” visual comparison model using the Tableau software on data formatted for it. The image above shows how we can compare based on structural information. Here is a visualization using correspondence analysis.

Correspondence Analysis Visualization

Now we have to select the most promising and develop the ideas.

Normalized Compression Distance