The Problem
One issue is that the same job is often advertised in more than one database. How would you detect duplicate listings across the sources? Keep in mind that the listings may not be completely identical. There may be slight differences in the wording, order of text, punctuation etc.
There is a mathematical measure of similarity known as the Jaccard similarity. Without getting too theoretical, Jaccard similarity is a value from 0 to 1 defined for two sets. It describes how much sets A and B overlap. It is calculated as:
In the diagram on the right, the Jaccard similarity is the size of the dark blue overlapping area divided by the total area of A and B.
The entire document can be shingled this way, and the shingles used as the document set.
The basic algorithm is:
1) Extract shingle sets for each document.
2) For every pair of documents:
a) Calculate size of intersection of shingle sets.
b) Calculate size of union of shingle sets.
c) If a÷b>c flag the pair as duplicates.
The value c is a similarity threshold. I chose 0.5.
I implemented this basic algorithm in Java as a practice exercise. I ran the quick and dirty implementation over news articles downloaded from news websites, some of which were very similar. It seemed able to identify articles on the same topic, or articles with large sections of identical text, depending on the similarity threshold used.
But how efficient is this naive implementation?
Time complexity is dominated by step 2). Using a hash-based set data structure (such as HashSet in Java) it is:
O(d2.n)
Where d is the number of documents and n is the document length in characters, which is approximately the same as the number of shingles. This is reasonable for relatively small collections of moderately long documents.
Space complexity is not great, but again not an issue on a small scale. The uncompressed shingle sets extracted in step 1) use:
O(d.n.k)
For shingles of size k. In other words, k times as much memory as the documents themselves!
I learned about shingling and Jaccard similarity in "Mining of Massive Datasets" by Rajaraman and Ullman (2013), which describes a more elegant and scalable technique for working with longer documents called "minhash signatures". It is much more efficient in both time and space, but still relatively straightforward.
The question boils down to how to measure the similarity of two documents. Near-duplicates are documents with a high similarity measure. There is a simple way to measure this which is also quite fast on relatively small sets of short documents.
Jaccard Similarity
|A∩B||A∪B|
In the diagram on the right, the Jaccard similarity is the size of the dark blue overlapping area divided by the total area of A and B.
Sets with Jaccard similarity 1.0 are identical. Sets with Jaccard similarity 0.0 are completely distinct. Sets with Jaccard similarity 0.95 are highly similar, but not identical.
How can we turn job descriptions into sets to calculate the Jaccard similarity?
Shingling
A straightforward idea is to use the set of words in the text. For example "How much wood would a woodchuck chuck if a woodchuck could chuck wood?" contains 9 unique words:
a, chuck, could, if, how, much, wood, woodchuck, would
Seems reasonable, but descriptions for similar jobs at different companies probably use very similar words. They may have a high Jaccard similarity even though they are different advertisements.
A better approach which includes information about word order is "shingling". A shingle is just a small slice of text from the document. Typically, shingles are 5 to 9 characters long, and can span multiple words. Using the previous example and shingles of length 7, the first three shingles would be:
"How muc", "ow much", "w much "
The entire document can be shingled this way, and the shingles used as the document set.
Putting it together
1) Extract shingle sets for each document.
2) For every pair of documents:
a) Calculate size of intersection of shingle sets.
b) Calculate size of union of shingle sets.
c) If a÷b>c flag the pair as duplicates.
The value c is a similarity threshold. I chose 0.5.
I implemented this basic algorithm in Java as a practice exercise. I ran the quick and dirty implementation over news articles downloaded from news websites, some of which were very similar. It seemed able to identify articles on the same topic, or articles with large sections of identical text, depending on the similarity threshold used.
But how efficient is this naive implementation?
Time
O(d2.n)
Where d is the number of documents and n is the document length in characters, which is approximately the same as the number of shingles. This is reasonable for relatively small collections of moderately long documents.
Space
O(d.n.k)
For shingles of size k. In other words, k times as much memory as the documents themselves!
No comments:
Post a Comment