Skip to main content

A Python implementation of Locality Sensitive Hashing for finding nearest neighbors and clusters in multidimensional numerical data

Project description

Consult the module API page at

https://engineering.purdue.edu/kak/distLSH/LocalitySensitiveHashing-1.0.1.html

for all information related to this module, including information regarding the latest changes to the code. The page at the URL shown above lists all of the module functionality you can invoke in your own code.

Locality Sensitive Hashing (LSH) is a computationally efficient approach for finding nearest neighbors in large datasets. The main idea in LSH is to avoid having to compare every pair of data samples in a large dataset in order to find the nearest similar neighbors for the different data samples. With LSH, one can expect a data sample and its closest similar neighbors to be hashed into the same bucket with a high probability. By treating the data samples placed in the same bucket as candidates for similarity checking, we significantly reduce the computational burden associated with similarity detection in large datasets.

While LSH algorithms have traditionally been used for finding nearest neighbors, this module goes a step further and explores using LSH for clustering the data. Strictly speaking, this violates the basic mandate of LSH, which is to return just the nearest neighbors. (A data sample X being Y’s nearest neighbor and Y being Z’s nearest neighbor, in the sense neighbors are commonly defined with the Cosine metric in LSH, does not imply that X and Z will always be sufficiently close to be considered each other’s nearest neighbors.) Nonetheless, if you believe that your datafile consists of non-overlapping data clusters, this module may do a decent job of finding those clusters.

Typical usage syntax for invoking LocalitySensitiveHashing in your own code:

from LocalitySensitiveHashing import *
datafile = "data_for_lsh.csv"
lsh = LocalitySensitiveHashing(
                   datafile = datafile,
                   dim = 10,
                   r = 50,
                   b = 100,
                   expected_num_of_clusters = 10,
          )
lsh.get_data_from_csv()
lsh.initialize_hash_store()
lsh.hash_all_data()
similarity_groups = lsh.lsh_basic_for_neighborhood_clusters()
coalesced_similarity_groups = lsh.merge_similarity_groups_with_coalescence( similarity_groups )
merged_similarity_groups = lsh.merge_similarity_groups_with_l2norm_sample_based( coalesced_similarity_groups )
lsh.write_clusters_to_file( merged_similarity_groups, "clusters.txt" )

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

LocalitySensitiveHashing-1.0.1.tar.gz (63.6 kB view details)

Uploaded Source

File details

Details for the file LocalitySensitiveHashing-1.0.1.tar.gz.

File metadata

File hashes

Hashes for LocalitySensitiveHashing-1.0.1.tar.gz
Algorithm Hash digest
SHA256 a4b38b34b94f5f251096f53655cd2fdff0dbdfa552ed89a3d4ec84e738f0e5c9
MD5 5ad10c6eef6acdf7912e2960f552021e
BLAKE2b-256 6d334ac1ce68013e61a418968f3bded7571b4259f45655445c321d6d3fb84ac1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page