More Like This
The ‘More Like This’ function allows you to identify similar texts to the trial you start with. It relies on measures of word frequency and density, and works best to locate trials with similar descriptive elements. As aspects of criminal justice – the precise nature of the crime, verdict and punishment – are frequently expressed in generic form, ‘More Like This’ will also tend to locate trials for the same kind of crime or which resulted in the same punishment.
In essence, this facility is using an index of every word in the Proceedings to determine both where words appear, and how common they are. Starting with an individual trial ‘More Like This’ counts all the words it contains, and ranks them from the most frequent to the least frequent (excluding ‘stop’ and two character words). This generates a list of terms from the original trial ordered according to frequency. The number of appearances for each word is then multiplied by a measure of how rare each word is in the Proceedings as a whole: or its Inverse Document Frequency. This allows a ‘score’ to be calculated for each word in the trial. The twenty five highest scoring words in the resulting list is then used to generate a query that locates all trials where these words can be found.
For a more detailed description this function, see How More Like This Works in Lucene.