What Colour is Your Archive?

January 25th, 2011

The Proceedings of the Old Bailey contains about 120 million words. Trial records can be searched to locate individual instances of particular words (like ‘brimstone’), but what do you do when you are curious about the kinds of patterns to be found across the archive as a whole? That’s where data mining comes in.

While glancing through some records that came up while running one of our programs, I noticed a reference to a scarlet cloak. I got to wondering what kinds of colour words appear in the OB, and how frequent each one is. The two graphs below come from a Mathematica hack that took about two hours to do. First, I pulled out all of the words in the Old Bailey that also appear in an English dictionary. I also generated a list of dictionary words for spectral colours, then took the intersection of the two sets. Both of these operations were made easy by access to Wolfram Research’s curated data sets from directly within Mathematica. I then used Wolfram Alpha to look up an RGB colour value for each color word, again directly from Mathematica, and plotted each colour in an RGB space as a little cuboid. That’s the first image. The second image shows what proportion of all trials that include colour words mention a particular colour. That’s the second image.

There a number of possible confounding factors here. “Black,” “White,” and “Brown” are surnames, “silver” and “ivory” refer to materials, etc. So if we were going to use these results for something significant we would take the time to do a linguistic parse of the contexts in which each colour term appeared and make sure that we had properly disambiguated them. But this is just for fun. Because programming is fun.

One response to “What Colour is Your Archive?”

  1. […] This post was mentioned on Twitter by Tim Sherratt and williamjturkel, doug moncur. doug moncur said: Criminal Intent: what colour is your archive ? http://bit.ly/dKKkO0 via @addthis […]