Tutorial: Google All Pairs Similarity Search

Google All Pairs Similarity Search is a program released on Google Code. It allows for extremely fast similarity calculations on data sets of millions of documents.

Similarity Join

This is the problem that Google All Pairs Similarity Search is trying to solve: Given a dataset of sparse vector data, find all similar vector pairs according to a similarity function such as cosine distance and a given similarity score threshold.

Vectors are binary. Either 1 or 0. Vectors with 0 can be left out, creating a sparse data set.

The Paper

The algorithm is first described in detail in the 2007 paper Scaling Up All-Pairs Similarity Search [pdf].

It was released at the Proceedings of the Sixteenth International World Wide Web Conference.

Continue reading