Fred Gibbs has updated the SendToUrl Firefox Plugin. The Plugin allows you to send data from your Zotero library to web-based tools such as Voyant, Wordle or user-defined resources. Download the plugin or see this post for more information.
SSHRC has awarded funding to Stéfan Sinclair & Geoffrey Rockwell for the Voyeur Notebooks project. Voyeur Notebooks will prototype a new, web-based interface that will allow humanities researchers and students to create analytic works of “literate programming” that interweave narrative about the intellectual process of text analysis with procedural source code. Designing and implementing this prototype will serve the following five core objectives:
- To determine whether or not it is possible to design and implement a literate programming interface as a web application.
- To determine the optimal architecture for functionality to occur immediately within the browser (client-side) or to require a call to a web service (server-side).
- To explore what code syntax would be most appropriate for a primarily humanities audience.
- To explore how Voyeur Notebooks might be designed to leverage social media practices.
- To determine the viability of a more robust and full-featured literate programming interface, perhaps supported by a proposal to the Insight Grant program.
We look forward to developing Voyeur Notebooks within the context of the Criminal Intent collaboration.
Click through for a high resolution version of this Poster
There are few instances when the data you have are exactly what you need, particularly for feeding into analytic tools of various kinds. This is one of the reasons I think it’s so important for a significant subset of digital humanists to have some basic programmings skills; you don’t need to be building tools, sometimes you’re you’re just building data. Otherwise, not only are you dependent on the tools built by others, but you’re also dependent on the data provided by others.
Recently we’ve been working on better integration of Voyeur Tools with the Old Bailey archives. Currently we have a cool prototype that allows you to do an Old Bailey query and to save the results as a Zotero entry and/or send the document directly to Voyeur Tools. The URL that is sent to Voyeur contains parameters that cause Voyeur to go fetch the individual trial accounts from Old Bailey via its experimental API. That’s great except that going to fetch all of those documents adds considerably latency to the system, which for larger collection can cause network timeouts. It would be preferable to have a local document store in Voyeur Tools. Read the rest of this entry »
The Proceedings of the Old Bailey contains about 120 million words. Trial records can be searched to locate individual instances of particular words (like ‘brimstone’), but what do you do when you are curious about the kinds of patterns to be found across the archive as a whole? That’s where data mining comes in.
While glancing through some records that came up while running one of our programs, I noticed a reference to a scarlet cloak. I got to wondering what kinds of colour words appear in the OB, and how frequent each one is. The two graphs below come from a Mathematica hack that took about two hours to do. First, I pulled out all of the words in the Old Bailey that also appear in an English dictionary. I also generated a list of dictionary words for spectral colours, then took the intersection of the two sets. Both of these operations were made easy by access to Wolfram Research’s curated data sets from directly within Mathematica. I then used Wolfram Alpha to look up an RGB colour value for each color word, again directly from Mathematica, and plotted each colour in an RGB space as a little cuboid. That’s the first image. The second image shows what proportion of all trials that include colour words mention a particular colour. That’s the second image.
There a number of possible confounding factors here. “Black,” “White,” and “Brown” are surnames, “silver” and “ivory” refer to materials, etc. So if we were going to use these results for something significant we would take the time to do a linguistic parse of the contexts in which each colour term appeared and make sure that we had properly disambiguated them. But this is just for fun. Because programming is fun.
Part of our work on the Old Bailey collection has been to compute the Normalized Compression Distances between the approximately 200,000 documents. Normalized Compression Distance (NCD) is a way of measuring the similarity of documents by comparing them in compressed form (where repetition is essentially suppressed) – see Rudi Cilibrasi and Paul Vitányi’s description. Bill Turkel wrote about some initial small-scale experiments that he’d conducted on clustering items from NCD, and encouraged us to apply the same techniques on the Old Bailey collection.
Computing the NCD for all of Old Bailey is fairly computationally intensive: each of the 200,000 documents has an NCD value with respect to each other document, which produces some 20 billion records. That’s a lot, especially since each document and each individual pair of documents requires compression. Our first experiment used a Java implementation of the BZip2 compression algorithm. From what we could observe, the results were promising, but slow. We decided to switch to a GZip compression algorithm and we were shocked at the difference in speed for processing a couple hundred documents:
- BZip2: 4 minutes and 49 seconds
- GZip: 0 minutes and 6 seconds
For 20 billion records, that’s quite a difference. Our preliminary examination of the results indicated that there were only minor differences in the relative NCD values between the two algorithms. BZip2 does produce more compact representations of strings, but that’s not a signficant factor for us (it changes the NCD results, but not only slightly). We’ve bviously chosen to continue with GZip for computing the NCD values.
In a previous post we mentioned that Tim Hitchcock and Bill Turkel are using Mathematica to create prototypes of various text mining and visualization tools. One advantage of using Mathematica is that it is very easy to build dynamic prototypes that can be shared with colleagues to get their feedback. We have put one of these online at the Wolfram Demonstrations Project: “Term Weighting with TF-IDF.” The TF-IDF is a common information retrieval measure that indicates which terms in a document (in this case a legal trial) should be weighted most heavily when trying to understand what the document is about.
As we try to figure out which data mining tools and strategies might be most useful to the ordinary working historian (whom we call “the OWH”) we are experimenting with a variety of different approaches. This way we can capitalize on the fact that we represent many different disciplinary backgrounds and skill sets. One strategy that Tim Hitchcock and Bill Turkel have been pursuing is to try to model various characteristics of the entire run of Old Bailey trials. For this, they are using the new version of Mathematica, a computational environment that combines advanced tools for programming, statistics, visualization and many other tasks. While working in Mathematica is not typical for OWHs, at least not yet, it does have the advantage of allowing us to get an overview of patterns that appear across tens of thousands of pages or millions of words. It also allows us to build various measurements into prototype interfaces and demonstrations that can be shared with colleagues (examples will be posted here later).
The British Library has recently mounted a ten month exhibition to highlight new approaches to research, and cutting edge applications of digital technologies. From among thirty projects showcased by the exhibition, Datamining with Criminal Intent was one. Dan Cohen and Tim Hitchcock were also interviewed for the exhibition, and their contributions can be viewed online at:
We now have a demonstrator site for the Old Bailey API that will form the basis for the ‘Newgate Commons’. Thanks to the hard work of Jamie McLaughling at the HRI in Sheffield, the demonstrator is now fully functioning, and we hope to make it available in a more robust version for public use within the couple of months.
As it stands, the demonstrator allows queries on both keywords and phrases, and on structured and tagged data to be generated as either a search URL, or else as a Zip file of the relevant trial texts. The basic interface also allows the user to build a complex query and specify the output format.
The demonstrator also allows the search criteria to be manipulated (‘Drilled’ and ‘Undrilled’), and for the results to be further broken down by specific criteria (‘Broken Down by’).
The demonstrator creates a much improved server-side search and retrieval function that generates a frequency table describing how many of its hits contain specific ‘terms’ (i.e. tagged data from the Proceedings, such as verdict). It is fast and flexible, and will form the basis for swapping either full files, or persistent address information by using a Query URL, with both Zotero and TAPoR tools.