unique-content1[1]

Tutorial: Google All Pairs Similarity Search

Google All Pairs Similarity Search is a program released on Google Code. It allows for extremely fast similarity calculations on data sets of millions of documents.

Similarity Join

This is the problem that Google All Pairs Similarity Search is trying to solve: Given a dataset of sparse vector data, find all similar vector pairs according to a similarity function such as cosine distance and a given similarity score threshold.

Vectors are binary. Either 1 or 0. Vectors with 0 can be left out, creating a sparse data set.

The Paper

The algorithm is first described in detail in the 2007 paper Scaling Up All-Pairs Similarity Search [pdf].

It was released at the Proceedings of the Sixteenth International World Wide Web Conference.

Under construction. Come back soon.

How to build Google All Pairs Similarity Search

Here follow the instructions to build Google All Pairs Similarity Search on both Linux and Windows.

Linux

[~/ap-demo]: svn checkout http://google-all-pairs-similarity-search.googlecode.com/svn/trunk/ google-all-pairs-similarity-search-read-only
...
Checked out revision 26.
[~/ap-demo]: svn checkout http://sparsehash.googlecode.com/svn/trunk/ sparsehash-read-only
...
Checked out revision 46.
[~/ap-demo]: cd sparsehash-read-only/
[~/ap-demo/sparsehash-read-only]: ./configure
...
[~/ap-demo/sparsehash-read-only]: make
...
[~/ap-demo/sparsehash-read-only]: cd ../google-all-pairs-similarity-search-read-only/
[~/ap-demo/google-all-pairs-similarity-search-read-only]: make
...
g++ -DNDEBUG -I../sparsehash-read-only/src -O3 -D_FILE_OFFSET_BITS=64  -o ap main.o allpairs.o data-source-iterator.o
[~/ap-demo/google-all-pairs-similarity-search-read-only]: wget http://google-all-pairs-similarity-search.googlecode.com/files/dblp_le_fixed.bin.gz
...
10:20:59 (4.91 MB/s) - 'dblp_le_fixed.bin.gz' saved [26061298/26061298]
[~/ap-demo/google-all-pairs-similarity-search-read-only]: gunzip dblp_le_fixed.bin.gz
[~/ap-demo/google-all-pairs-similarity-search-read-only]: ./ap .9 dblp_le_fixed.bin > /dev/null
; User specified similarity threshold: 0.9
; Found 34807 similar pairs.
; Candidates considered: 4988117
; Vector intersections performed: 1618864
; Total running time: 5 seconds

Windows

C:\ap-demo>svn checkout http://google-all-pairs-similarity-search.googlecode.com/svn/trunk/ google-all-pairs-similarity-search-read-only
...

C:\ap-demo>cd google-all-pairs-similarity-search-read-only/

C:\ap-demo\google-all-pairs-similarity-search-read-only>"%VS90COMNTOOLS%vsvars32.bat"
Setting environment for using Microsoft Visual Studio 2008 x86 tools.

C:\ap-demo\google-all-pairs-similarity-search-read-only>nmake -f Makefile.w32 
...
/out:ap.exe 
...

C:\ap-demo\google-all-pairs-similarity-search-read-only>wget http://google-all-pairs-similarity-search.googlecode.com/files/dblp_le_fixed.bin.gz
...
15:20:55 (152.99 KB/s) - 'dblp_le_fixed.bin.gz' saved [26061298/26061298]

C:\ap-demo\google-all-pairs-similarity-search-read-only>gunzip dblp_le_fixed.bin.gz 

C:\ap-demo\google-all-pairs-similarity-search-read-only>ap.exe .9 dblp_le_fixed.bin > tmp.out
; User specified similarity threshold: 0.9
; Found 34807 similar pairs.
; Candidates considered:  4988117
; Vector intersections performed: 1618864
; Total running time: 9 seconds

Notice that Windows performs a bit slower on the same data set. This is because the Linux build uses sparse hash, which is a faster hashing mechanism.

Leave a Reply

Your email address will not be published. Required fields are marked *