January 25th, 2011
The Proceedings of the Old Bailey contains about 120 million words. Trial records can be searched to locate individual instances of particular words (like ‘brimstone’), but what do you do when you are curious about the kinds of patterns to be found across the archive as a whole? That’s where data mining comes in.
While glancing through some records that came up while running one of our programs, I noticed a reference to a scarlet cloak. I got to wondering what kinds of colour words appear in the OB, and how frequent each one is. The two graphs below come from a Mathematica hack that took about two hours to do. First, I pulled out all of the words in the Old Bailey that also appear in an English dictionary. I also generated a list of dictionary words for spectral colours, then took the intersection of the two sets. Both of these operations were made easy by access to Wolfram Research’s curated data sets from directly within Mathematica. I then used Wolfram Alpha to look up an RGB colour value for each color word, again directly from Mathematica, and plotted each colour in an RGB space as a little cuboid. That’s the first image. The second image shows what proportion of all trials that include colour words mention a particular colour. That’s the second image.
There a number of possible confounding factors here. “Black,” “White,” and “Brown” are surnames, “silver” and “ivory” refer to materials, etc. So if we were going to use these results for something significant we would take the time to do a linguistic parse of the contexts in which each colour term appeared and make sure that we had properly disambiguated them. But this is just for fun. Because programming is fun.
January 14th, 2011
Part of our work on the Old Bailey collection has been to compute the Normalized Compression Distances between the approximately 200,000 documents. Normalized Compression Distance (NCD) is a way of measuring the similarity of documents by comparing them in compressed form (where repetition is essentially suppressed) – see Rudi Cilibrasi and Paul Vitányi’s description. Bill Turkel wrote about some initial small-scale experiments that he’d conducted on clustering items from NCD, and encouraged us to apply the same techniques on the Old Bailey collection.
Computing the NCD for all of Old Bailey is fairly computationally intensive: each of the 200,000 documents has an NCD value with respect to each other document, which produces some 20 billion records. That’s a lot, especially since each document and each individual pair of documents requires compression. Our first experiment used a Java implementation of the BZip2 compression algorithm. From what we could observe, the results were promising, but slow. We decided to switch to a GZip compression algorithm and we were shocked at the difference in speed for processing a couple hundred documents:
- BZip2: 4 minutes and 49 seconds
- GZip: 0 minutes and 6 seconds
For 20 billion records, that’s quite a difference. Our preliminary examination of the results indicated that there were only minor differences in the relative NCD values between the two algorithms. BZip2 does produce more compact representations of strings, but that’s not a signficant factor for us (it changes the NCD results, but not only slightly). We’ve bviously chosen to continue with GZip for computing the NCD values.
January 6th, 2011
In a previous post we mentioned that Tim Hitchcock and Bill Turkel are using Mathematica to create prototypes of various text mining and visualization tools. One advantage of using Mathematica is that it is very easy to build dynamic prototypes that can be shared with colleagues to get their feedback. We have put one of these online at the Wolfram Demonstrations Project: “Term Weighting with TF-IDF.” The TF-IDF is a common information retrieval measure that indicates which terms in a document (in this case a legal trial) should be weighted most heavily when trying to understand what the document is about.