New graphs for the data warehousing

Digging Into Data: Round Two

March 16th, 2011

The Digging Into Data program which funded the Criminal Intent project has announced a second round. They have added four more funders and one more country (the Netherlands.) As Brett Bobley put it,

Round one proved to be enormously popular, with nearly 90 international research teams competing. One piece of feedback we heard very strongly from the field is that we needed more money in this program so that we could (a) award more total projects and (b) allow for more funds per project. I’m happy to say that by adding four extra funders we have tried to address both of these concerns.

The beauty of this programme is that it funds teams across countries (Canada, USA, UK, and now the Netherlands) without forcing you to apply separately to each national programme. One application, two or three budgets, one adjudication, and you can do research with international colleagues!

Posted in Digging Into Data | No Comments »

New graphs for the data warehousing

March 4th, 2011

John Simpson who is programming the Data Warehousing experiment at the University of Alberta has developed a new graph for us that uses the dygraphs Javascript Visualization Library. One of the things it allows us to do is add annotations (the boxes with letters like [C]) with key historical events for people to use to orient themselves. It also allows users to zoom in and out and explore the graph in other ways.

Posted in Project News, Visualizations | No Comments »

What Colour is Your Archive?

January 25th, 2011

The Proceedings of the Old Bailey contains about 120 million words. Trial records can be searched to locate individual instances of particular words (like ‘brimstone’), but what do you do when you are curious about the kinds of patterns to be found across the archive as a whole? That’s where data mining comes in.

While glancing through some records that came up while running one of our programs, I noticed a reference to a scarlet cloak. I got to wondering what kinds of colour words appear in the OB, and how frequent each one is. The two graphs below come from a Mathematica hack that took about two hours to do. First, I pulled out all of the words in the Old Bailey that also appear in an English dictionary. I also generated a list of dictionary words for spectral colours, then took the intersection of the two sets. Both of these operations were made easy by access to Wolfram Research’s curated data sets from directly within Mathematica. I then used Wolfram Alpha to look up an RGB colour value for each color word, again directly from Mathematica, and plotted each colour in an RGB space as a little cuboid. That’s the first image. The second image shows what proportion of all trials that include colour words mention a particular colour. That’s the second image.

There a number of possible confounding factors here. “Black,” “White,” and “Brown” are surnames, “silver” and “ivory” refer to materials, etc. So if we were going to use these results for something significant we would take the time to do a linguistic parse of the contexts in which each colour term appeared and make sure that we had properly disambiguated them. But this is just for fun. Because programming is fun.

Posted in Uncategorized | 1 Comment »

Normalized Compression Distance Algorithms

January 14th, 2011

Part of our work on the Old Bailey collection has been to compute the Normalized Compression Distances between the approximately 200,000 documents. Normalized Compression Distance (NCD) is a way of measuring the similarity of documents by comparing them in compressed form (where repetition is essentially suppressed) – see Rudi Cilibrasi and Paul Vitányi’s description. Bill Turkel wrote about some initial small-scale experiments that he’d conducted on clustering items from NCD, and encouraged us to apply the same techniques on the Old Bailey collection.

Computing the NCD for all of Old Bailey is fairly computationally intensive: each of the 200,000 documents has an NCD value with respect to each other document, which produces some 20 billion records. That’s a lot, especially since each document and each individual pair of documents requires compression. Our first experiment used a Java implementation of the BZip2 compression algorithm. From what we could observe, the results were promising, but slow. We decided to switch to a GZip compression algorithm and we were shocked at the difference in speed for processing a couple hundred documents:

BZip2: 4 minutes and 49 seconds
GZip: 0 minutes and 6 seconds

For 20 billion records, that’s quite a difference. Our preliminary examination of the results indicated that there were only minor differences in the relative NCD values between the two algorithms. BZip2 does produce more compact representations of strings, but that’s not a signficant factor for us (it changes the NCD results, but not only slightly). We’ve bviously chosen to continue with GZip for computing the NCD values.

Posted in Uncategorized | No Comments »

Rapid Prototyping with Mathematica

January 6th, 2011

In a previous post we mentioned that Tim Hitchcock and Bill Turkel are using Mathematica to create prototypes of various text mining and visualization tools. One advantage of using Mathematica is that it is very easy to build dynamic prototypes that can be shared with colleagues to get their feedback. We have put one of these online at the Wolfram Demonstrations Project: “Term Weighting with TF-IDF.” The TF-IDF is a common information retrieval measure that indicates which terms in a document (in this case a legal trial) should be weighted most heavily when trying to understand what the document is about.