Part of our work on the Old Bailey collection has been to compute the Normalized Compression Distances between the approximately 200,000 documents. Normalized Compression Distance (NCD) is a way of measuring the similarity of documents by comparing them in compressed form (where repetition is essentially suppressed) – see Rudi Cilibrasi and Paul Vitányi’s description. Bill Turkel wrote about some initial small-scale experiments that he’d conducted on clustering items from NCD, and encouraged us to apply the same techniques on the Old Bailey collection.
Computing the NCD for all of Old Bailey is fairly computationally intensive: each of the 200,000 documents has an NCD value with respect to each other document, which produces some 20 billion records. That’s a lot, especially since each document and each individual pair of documents requires compression. Our first experiment used a Java implementation of the BZip2 compression algorithm. From what we could observe, the results were promising, but slow. We decided to switch to a GZip compression algorithm and we were shocked at the difference in speed for processing a couple hundred documents:
- BZip2: 4 minutes and 49 seconds
- GZip: 0 minutes and 6 seconds
For 20 billion records, that’s quite a difference. Our preliminary examination of the results indicated that there were only minor differences in the relative NCD values between the two algorithms. BZip2 does produce more compact representations of strings, but that’s not a signficant factor for us (it changes the NCD results, but not only slightly). We’ve bviously chosen to continue with GZip for computing the NCD values.