Introduction

Picture it…New York City 2014, two documents walk into a bar. We are given the task to determine if the documents are duplicates of each other or if they are just near duplicates. How would we do this if we weren't allowed to actually read the documents? How would we we solve this same problem if we needed to compare n number of documents?

Though the above scenario is cute, finding similar (aka near duplicate documents) is serious business. Search engines, news aggregators, e-commerce sites and host of other software applications use algorithms to detect similar documents. One common algorithm for detecting similar documents is Minhash. Minhash makes use of the principles of Jaccard's similarity coefficient to detect near duplicate documents. In this post, I will discuss the basics of a Jaccard's similarity coefficient as well as a technique called "shingling," which allows us to represent documents as sets (or bags) in order to calculate this coefficient.

All About The Shingles: Representing Documents as Sets

Lets assume a document is just a ginormous string. In order to compare documents to determine if they are near duplicates, we need a way to represent these documents such that if short strings or phrases appear in these documents, even in they are in different orders we have a good chance of determining how similar these documents are. One common approach is to represent documents as sets of "contiguous subsequences of tokens"[1] or as sets of short substring of characters. This is approach is known as shingling.

There are two approaches to shingling. W-shingles and k-shingles. A w-shingle represents tuple of w tokens that appear close together in a given document. A k-shingle represents a k length substring of characters from a given document. Note, that it very possible that shingles will appear more than once in a collection. Using sets instead of bags (lists) to store shingles eliminates the possibility of duplicates because sets by their nature do not allow duplicate items.

Examples:

 4-shingles (k-shingles) from this string: "abcd efghij cklmn op!"

 ['abcd', 'bcde', 'cdef', 'fghi', 'hijc', 'mnop', 'lmno', 'defg', 'cklm', 'efgh', 'ghij', 'jckl', 'klmn', 'ijck']

 // Note, the above document string was normalized (white spaces and punctuation removed)

 4-gram shingles (w-shingles) from this string: "do or do not there is no try"
 
 [('do', 'or', 'do', 'not'), ('do', 'not', 'there', 'is'), ('there', 'is', 'no', 'try'), ('or', 'do', 'not', 'there'), ('not', 'there', 'is', 'no')]

There are a few things that can be done to normalize the substrings and tokens in shingles such as removing all white spaces, removing punctuation, stemming and converting all strings to lowercase. These changes can affect the quality of the shingles generated and hence can affect the results when comparing documents.

Your Shingle Size and You

After doing some research and writing my own shingling implementation it seems the exercise of choosing a shingle size seems to be more of a dark art than a science, though there are some rules of thumb. The shingle size plays a big part in the accuracy of the results when comparing the similarity of documents. The shingle size is determined by the length of "typical" documents you are comparing as well as how large the character set is [2]. According to "Mining Massive Data Sets" a good rule of thumb is, the shingle size "...should be picked large enough that the probability of any given shingle appearing in any given document is low." [2]

For more information on choosing shingle size see:

Comparing Documents: A Set Problem

It's nice that we know how to represent documents we want to compare as sets, now what? Basically, comparing two documents for textual similarity is a set problem thanks to the Jaccard Similarity Coefficient. The Jaccard Similarity Coefficient of sets A and B is defined as a ratio of the cardinality of the intersection of A and B divided by the cardinality of the union of A and B.
The ratio is a value that between 0 and 1. Where 0 means the documents aren't similar at all and 1 means they are identical.

The ratio is a value that between 0 and 1. Where 0 means the documents aren't similar at all and 1 means they are identical.

Okay, so this all sounds pretty simple and is great if we don't have many documents to compare but if we needed to search and compare an infinite number of documents for example, comparing every document would be highly inefficient. In the next section, I will briefly discuss some optimizations that make near duplicate document section more efficient when searching and comparing a large collection of documents.

Beyond The Basics: Optimizations

One common optimization for large scale search problems is reducing the search space, whether by the use of heuristics or use of random sampling reducing the search space is critical to search large if not infinite number of documents. Another common optimization for larger scale search problems is reducing the amount of storage required to store documents either for a long or short period of time.

Many applications of near duplicate document detection today are very large scale search problems. We may need to compare millions if not billions of documents at time, so we need optimizations that will reduce the search space, allow us to quickly compare and efficiently store documents. Two such popular optimizations are minhash and Locality Sensitive Hashing (LSH).

Minhash maps shingles (sets of strings/characters) to integers by hashing each shingle with n number of hashing functions and the minimum hash value for each shingle is kept. Using n hashing functions and choosing the minimum hash value acts as "random" shingle sampling. By hashing strings we can compare and store them more efficiently, as integers are programmatically "cheaper" to work with than strings.

Locality Sensitive Hashing it is a probabilistic, search algorithm that uses hashing to detect similar documents via the use of collisions. LSH seeks to limit the search space only to documents that are likely to be similar. One approach to LSH is similar to minhash where we seek to hash shingles n times but in this case we want a hashing function that will generate similar hash codes for similar shingles. Using these hashes we can determine which document pairs should be compared for similarity as only similar documents should have hash collisions. This bucketing of similar items to be compared drastically reduces the search space as we will not need to examine all document candidate pairs. [3] Other approach to optimizing LSH is to parallelize the algorithm. A paper written by researchers from MIT and Intel discusses this variant called PLSH. [4]

Conclusion

Though Minhash, shingling and even LSH are relatively simple algorithms they allow us to do very powerful work to search for near duplicate documents. Though this post is a brief overview on minhash and shingling, I hope it proves useful to others looking to develop their own implementations. Below, I have included links to references as well as other helpful resources.

We've implemented minhash, k-shingling, w-shingling and calculating Jaccard Similarity Coeffient in OpenLSH (see github link below). Please note, as of the time of writing this post our minhash implementation is untested, so use at your own risk. I am working on unit tests and should have something pushed soon. Our LSH implementation is also coming soon.

References

[1] W-shingling - http://en.wikipedia.org/wiki/W-shingling
[2] Choosing the Shingle Size, Chapter 3, - http://infolab.stanford.edu/~ullman/mmds/book.pdf
[3] LSH for Minhash Signatures, Chapter 3, 3.4.1- http://infolab.stanford.edu/~ullman/mmds/book.pdf
[4] Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing - http://istc-bigdata.org/plsh/docs/plsh_paper.pdf