Thursday 15 May 2014

Detecting similar documents with "shingling"

The Problem


I looked into this based on an example interview question. Imagine that you're implementing a meta-search engine for job advertisements. The search engine queries a number of job databases, and combines the results.

One issue is that the same job is often advertised in more than one database. How would you detect duplicate listings across the sources? Keep in mind that the listings may not be completely identical. There may be slight differences in the wording, order of text, punctuation etc.

The question boils down to how to measure the similarity of two documents. Near-duplicates are documents with a high similarity measure. There is a simple way to measure this which is also quite fast on relatively small sets of short documents.

Jaccard Similarity


There is a mathematical measure of similarity known as the Jaccard similarity. Without getting too theoretical, Jaccard similarity is a value from 0 to 1 defined for two sets. It describes how much sets A and B overlap. It is calculated as: 

|AB||AB|
Venn diagram of sets A and B

In the diagram on the right, the Jaccard similarity is the size of the dark blue overlapping area divided by the total area of A and B.

Sets with Jaccard similarity 1.0 are identical. Sets with Jaccard similarity 0.0 are completely distinct. Sets with Jaccard similarity 0.95 are highly similar, but not identical.

Shingling


How can we turn job descriptions into sets to calculate the Jaccard similarity?

A straightforward idea is to use the set of words in the text. For example "How much wood would a woodchuck chuck if a woodchuck could chuck wood?" contains 9 unique words:

a, chuck, could, if, how, much, wood, woodchuck, would

Seems reasonable, but descriptions for similar jobs at different companies probably use very similar words. They may have a high Jaccard similarity even though they are different advertisements.

A better approach which includes information about word order is "shingling". A shingle is just a small slice of text from the document. Typically, shingles are 5 to 9 characters long, and can span multiple words. Using the previous example and shingles of length 7, the first three shingles would be:

"How muc", "ow much", "w much "

The entire document can be shingled this way, and the shingles used as the document set.

Putting it together


The basic algorithm is:

1) Extract shingle sets for each document.
2) For every pair of documents:
    a) Calculate size of intersection of shingle sets.
    b) Calculate size of union of shingle sets.
    c) If a÷b> flag the pair as duplicates.
  flag the pair as duplicates.
The value c is a similarity threshold. I chose 0.5.

I implemented this basic algorithm in Java as a practice exercise. I ran the quick and dirty implementation over news articles downloaded from news websites, some of which were very similar. It seemed able to identify articles on the same topic, or articles with large sections of identical text, depending on the similarity threshold used.

But how efficient is this naive implementation?

Time


Time complexity is dominated by step 2). Using a hash-based set data structure (such as HashSet in Java) it is:

O(d2.n)

Where d is the number of documents and n is the document length in characters, which is approximately the same as the number of shingles. This is reasonable for relatively small collections of moderately long documents.

Space


Space complexity is not great, but again not an issue on a small scale. The uncompressed shingle sets extracted in step 1) use:

O(d.n.k)

For shingles of size k. In other words, k times as much memory as the documents themselves!

Further reading


 times as much memory as the documents themselves for shingles of size .I learned about shingling and Jaccard similarity in "Mining of Massive Datasets" by Rajaraman and Ullman (2013), which describes a more elegant and scalable technique for working with longer documents called "minhash signatures". It is much more efficient in both time and space, but still relatively straightforward.

No comments:

Post a Comment