March 21st, 2011
There are few instances when the data you have are exactly what you need, particularly for feeding into analytic tools of various kinds. This is one of the reasons I think it’s so important for a significant subset of digital humanists to have some basic programmings skills; you don’t need to be building tools, sometimes you’re you’re just building data. Otherwise, not only are you dependent on the tools built by others, but you’re also dependent on the data provided by others.
Recently we’ve been working on better integration of Voyeur Tools with the Old Bailey archives. Currently we have a cool prototype that allows you to do an Old Bailey query and to save the results as a Zotero entry and/or send the document directly to Voyeur Tools. The URL that is sent to Voyeur contains parameters that cause Voyeur to go fetch the individual trial accounts from Old Bailey via its experimental API. That’s great except that going to fetch all of those documents adds considerably latency to the system, which for larger collection can cause network timeouts. It would be preferable to have a local document store in Voyeur Tools. Read the rest of this entry »
March 16th, 2011
The Digging Into Data program which funded the Criminal Intent project has announced a second round. They have added four more funders and one more country (the Netherlands.) As Brett Bobley put it,
Round one proved to be enormously popular, with nearly 90 international research teams competing. One piece of feedback we heard very strongly from the field is that we needed more money in this program so that we could (a) award more total projects and (b) allow for more funds per project. I’m happy to say that by adding four extra funders we have tried to address both of these concerns.
The beauty of this programme is that it funds teams across countries (Canada, USA, UK, and now the Netherlands) without forcing you to apply separately to each national programme. One application, two or three budgets, one adjudication, and you can do research with international colleagues!
March 4th, 2011
January 25th, 2011
The Proceedings of the Old Bailey contains about 120 million words. Trial records can be searched to locate individual instances of particular words (like ‘brimstone’), but what do you do when you are curious about the kinds of patterns to be found across the archive as a whole? That’s where data mining comes in.
While glancing through some records that came up while running one of our programs, I noticed a reference to a scarlet cloak. I got to wondering what kinds of colour words appear in the OB, and how frequent each one is. The two graphs below come from a Mathematica hack that took about two hours to do. First, I pulled out all of the words in the Old Bailey that also appear in an English dictionary. I also generated a list of dictionary words for spectral colours, then took the intersection of the two sets. Both of these operations were made easy by access to Wolfram Research’s curated data sets from directly within Mathematica. I then used Wolfram Alpha to look up an RGB colour value for each color word, again directly from Mathematica, and plotted each colour in an RGB space as a little cuboid. That’s the first image. The second image shows what proportion of all trials that include colour words mention a particular colour. That’s the second image.
There a number of possible confounding factors here. “Black,” “White,” and “Brown” are surnames, “silver” and “ivory” refer to materials, etc. So if we were going to use these results for something significant we would take the time to do a linguistic parse of the contexts in which each colour term appeared and make sure that we had properly disambiguated them. But this is just for fun. Because programming is fun.
January 14th, 2011
Part of our work on the Old Bailey collection has been to compute the Normalized Compression Distances between the approximately 200,000 documents. Normalized Compression Distance (NCD) is a way of measuring the similarity of documents by comparing them in compressed form (where repetition is essentially suppressed) – see Rudi Cilibrasi and Paul Vitányi’s description. Bill Turkel wrote about some initial small-scale experiments that he’d conducted on clustering items from NCD, and encouraged us to apply the same techniques on the Old Bailey collection.
Computing the NCD for all of Old Bailey is fairly computationally intensive: each of the 200,000 documents has an NCD value with respect to each other document, which produces some 20 billion records. That’s a lot, especially since each document and each individual pair of documents requires compression. Our first experiment used a Java implementation of the BZip2 compression algorithm. From what we could observe, the results were promising, but slow. We decided to switch to a GZip compression algorithm and we were shocked at the difference in speed for processing a couple hundred documents:
- BZip2: 4 minutes and 49 seconds
- GZip: 0 minutes and 6 seconds
For 20 billion records, that’s quite a difference. Our preliminary examination of the results indicated that there were only minor differences in the relative NCD values between the two algorithms. BZip2 does produce more compact representations of strings, but that’s not a signficant factor for us (it changes the NCD results, but not only slightly). We’ve bviously chosen to continue with GZip for computing the NCD values.
January 6th, 2011
In a previous post we mentioned that Tim Hitchcock and Bill Turkel are using Mathematica to create prototypes of various text mining and visualization tools. One advantage of using Mathematica is that it is very easy to build dynamic prototypes that can be shared with colleagues to get their feedback. We have put one of these online at the Wolfram Demonstrations Project: “Term Weighting with TF-IDF.” The TF-IDF is a common information retrieval measure that indicates which terms in a document (in this case a legal trial) should be weighted most heavily when trying to understand what the document is about.
December 11th, 2010
As we try to figure out which data mining tools and strategies might be most useful to the ordinary working historian (whom we call “the OWH”) we are experimenting with a variety of different approaches. This way we can capitalize on the fact that we represent many different disciplinary backgrounds and skill sets. One strategy that Tim Hitchcock and Bill Turkel have been pursuing is to try to model various characteristics of the entire run of Old Bailey trials. For this, they are using the new version of Mathematica, a computational environment that combines advanced tools for programming, statistics, visualization and many other tasks. While working in Mathematica is not typical for OWHs, at least not yet, it does have the advantage of allowing us to get an overview of patterns that appear across tens of thousands of pages or millions of words. It also allows us to build various measurements into prototype interfaces and demonstrations that can be shared with colleagues (examples will be posted here later).
December 11th, 2010
The British Library has recently mounted a ten month exhibition to highlight new approaches to research, and cutting edge applications of digital technologies. From among thirty projects showcased by the exhibition, Datamining with Criminal Intent was one. Dan Cohen and Tim Hitchcock were also interviewed for the exhibition, and their contributions can be viewed online at:
November 30th, 2010
Those following the Criminal Intent project may wish to experiment themselves with a related project that does text mining on Victorian books. In particular, the team at Mason plans to compare some of the topic graphs from the nineteenth century with subject matter in the Old Bailey. And while you’re exploring some of those graphs, be sure to read the caveats, which likely apply here as well.
November 19th, 2010
Joerg Sanders working with John Simpson at the University of Alberta have a first prototype of a “data wharehousing” prototype that will let users explore the Old Bailey data through an interface that lets you compare things. Above you see an example of comparisons by gender over a time period.