Skip to main content

Python bindings for the Google All Pairs Similarity Search

Project description

Welcome to the Google all pairs similarity search package!

This package provides a bare-bones implementation of the
"All-Pairs-Binary" algorithm described in the following paper:

R. J. Bayardo, Yiming Ma, Ramakrishnan Srikant. Scaling Up All-Pairs
Similarity Search. In Proc. of the 16th Int'l Conf. on World Wide Web,
131-140, 2007. (download from: http://www.bayardo.org/ps/www2007.pdf)

===============================================================================
BUILDING

*** For Linux & similar platforms ***

Simply typing "make" in the directory containing this file will build
the executable. This code depends on the google-sparsehash package
(http://code.google.com/p/google-sparsehash/). If the build fails, you
may have to update the CFLAGS within the Makefile to specify the
location of the sparsehash header files.

If you'd rather not use the google sparsehash library, it is
straightforward to modify the algorithm to use STL hash_map instead.

*** For Win32 / Microsoft VC++ v9 & a command shell ***

NOTE: The build under windows does not exploit (and hence it does not
require) the Google sparsehash library. Instead it uses the regular
STL hash_map, which isn't quite as fast but seems to work well
enough. To build:

C:\google-all-pairs-similarity-search>"%VS90COMNTOOLS%vsvars32.bat"

C:\google-all-pairs-similarity-search>nmake -f Makefile.w32

===============================================================================
DATASET FORMAT

The dataset format expected by the algorithm is "apriori binary." In
an apriori binary encoded dataset, each vector has the following
format where each component is encoded as a raw 4-byte integer:

<record id> <number of features> <fid 1> <fid 2> ... <fid n>

(Endianness of the integers should match that of your platform,
e.g. little-endian for Intel x86 architectures.)

Record ids can be arbitrary integers. Feature ids should be assigned
such that feature id "i" corresponds to the "ith" least frequently
occuring feature in the dataset. For example, the feature with id
whose integer value is "1" should be the least frequently occurring
value in the dataset. Feature ids within a vector should then appear
in increasing order of their id.

Records in the dataset should appear in order of increasing vector
size (a vector's size is its number of features.)

It is easy to extend the algorithm to read CSV formatted data if
desired.

You can download the apriori-binary little-endian encoded dblp dataset
from here: http://www.bayardo.org/bin/dblp_le.bin.gz

===============================================================================
RUNNING THE ALGORITHM

To run the algorithm under Linux/Unix:

./ap <sim_threshold> <dataset_path>

Under Windows:

ap <sim_treshold> <dataset_path>

For example, to mine all pairs of vectors with .9 or higher cosine
similarity from the dblp dataset on a Linux-type system, you might
type:

[~/google-all-pairs-similarity-search]: /ap .9 dblp_le.bin > some_file.txt
; User specified similarity threshold: 0.9
; Found 35022 similar pairs.
; Candidates considered: 9274991
; Vector intersections performed: 2063790
; Total running time: 5 seconds
[~/google-all-pairs-similarity-search]:

The output of the algorithm is text format. Each line will contain a
pair of vector ids that were found to be similar, followed by the
actual cosine similarity score.

NOTE: The algorithm is currently configured to use no more than ~1GB
of main memory. The algorithm will enter an "out of core" mode should
the dataset require more than this amount of RAM to process in a
single pass. The constant bounding RAM usage can be changed in main.cc
as desired.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyallpairs-0.1.0.tar.gz (11.2 kB view details)

Uploaded Source

File details

Details for the file pyallpairs-0.1.0.tar.gz.

File metadata

  • Download URL: pyallpairs-0.1.0.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pyallpairs-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c13599f2a115cee2d5b43ad14eee2bf17bebe2daf5f10a5f936c0f1b4d47169d
MD5 8038fe01ae92168954100e87ddbc04eb
BLAKE2b-256 ce4e65925066021747aa13846629c8e6053e7dbb0d608786864686a9d8a35d14

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page