Skip to main content

Nearest neighbors search using locality-sensitive hashing

Project description

lshashing

Build Status Downloads

python library to perform Locality-Sensitive Hashing to search for nearest neighbors in high dimensional data.

For now it only supports random projections but future versions will support more methods and techniques.

Implementation

pip install lshashing

Make sure the data and query points are numpy arrays!.

from lshashing import LSHRandom
import numpy as np

sample_data = np.random.randint(size = (20, 20), low = 0, high = 100)
point = np.random.randint(size = (1,20), low = 0, high = 100)

lshashing = LSHRandom(sample_data, hash_len = 4, num_tables = 2)

print(lshashing.tables[0].hash_table)
# {25: [0, 1, 6, 11, 13, 15], 
#  0: [2, 5], 
#  1: [3, 7, 16, 17, 18, 19], 
#  16: [4, 8, 14], 
# 144: [9], 
#  9: [10], 
#  81: [12]}

print(lshashing.knn_search(sample_data, point[0], k = 4, buckets = 3, radius = 2))
# [[150.33961554   2.        ]
#  [151.30432909   5.        ]
#  [155.11608556   3.        ]
#  [166.76030703  18.        ]]

First column is the distances while the second column is the indices of the neighbors.

lshashing also supports parallelism using joblib library.

sample_data = np.random.randint(size = (20, 20), low = 0, high = 100)
point = np.random.randint(size = (1, 20), low = 0, high = 100)

lsh_random_parallel = LSHRandom(sample_data, 4, parallel = True)
lsh_random_parallel.knn_search(sample_data, point[0], 4, 3, parallel = True)
# [Neighbor(index=1, distance=152.6106156202772, value=[[47 51 23]...]),
#  Neighbor(index=16, distance=168.08331267558955, value=[[55 61 83]...]),
#  Neighbor(index=14, distance=171.8254928699464, value=[[98 43 81]...]),
#  Neighbor(index=7, distance=183.15294155431957, value=[[75 39 27]...])]

Adding new entries

Simply you can add new entries to the hash tables using the add_new_entry method.

from lshashing import LSHRandom
import numpy as np

sample_data = np.random.randint(size = (15, 20), low = 0, high = 100)
point = np.random.randint(size = (1, 20), low = 0, high = 100)

lshashing = LSHRandom(sample_data, hash_len = 3, num_tables = 2)
print(lshashing.tables[0].hash_table)
# {9: [0, 3, 9, 11], 36: [1, 4, 5, 10, 12, 13], 4: [2, 6, 7, 8], 49: [14]}

print(lshashing.n_rows)
# 15

lshashing.add_new_entry(point)

print(lshashing.n_rows)
# 16

print(lshashing.tables[0].hash_table)
# {9: [0, 3, 9, 11], 36: [1, 4, 5, 10, 12, 13], 4: [2, 6, 7, 8, 15], 49: [14]}

Locality-sensitive hashing is an approximate nearest neighbors search technique which means that the resulted neighbors may not always be the exact nearest neighbor to the query point. To enhance and ensure better extactness, hash length used, number of hash tables and the buckets to search need to be tweaked.

I also made some comparison between lshashing, linear method to get KNNs and scikit-learn's BallTree and KDTree and here are the results.

python examples/lshashing_compare.py

#                 ##### LSHashing Module #####
# sample data shape:  (15000, 30000)

# query point
# (30000,)

#         ##### Start comparison in searching for 5 nearest neighbors #####

# ##### search knn traditionaly
# time to search: 5.72 seconds

# [(69872.09976807624, 1844),
#  (69876.11714312695, 7288),
#  (69876.55885202132, 2554),
#  (69939.38269673246, 9227),
#  (69939.697440009, 5948)]


# ##### Search with lshashing package:
# time to construct 2 lsh tables of 15 hash length: 9.10 seconds
# time to search in 10 buckets with radius 5: 0.37 seconds

#          distances        indices
# array([[69939.38269673,  9227.        ],
#        [70032.03792551,  9782.        ],
#        [70094.87232316,  6840.        ],
#        [70117.99188511,  8432.        ],
#        [70137.69944901,   331.        ]])


# ##### Now with Scikit Learn
# time to construct ball_tree: 40.51 seconds
# time to search: 0.49 seconds

# (array([[69872.09976808, 69876.11714313, 69876.55885202, 69939.38269673,
#         69939.69744001]]),
#  array([[1844, 7288, 2554, 9227, 5948]]))


# ##### With sklearn KDTree
# time to construct the tree: 51.69 seconds
# time to search: 0.65 seconds

# (array([[69872.09976808, 69876.11714313, 69876.55885202, 69939.38269673,
#         69939.69744001]]),
#  array([[1844, 7288, 2554, 9227, 5948]]))


# ##### basic scikit-learn
# time to fit dataset: 0.24 seconds
# time to search: 4.36 seconds

# (array([[69872.09976808, 69876.11714313, 69876.55885202, 69939.38269673,
#         69939.69744001]]),
#  array([[1844, 7288, 2554, 9227, 5948]]))

LSHashing performs the search a little bit slower than sklearn tree implementations, sometimes better but much faster to construct. However, the main advantage comes when we need to add new entry or remove from our data, i.e. updating the table. In sklearn trees this can be hard as we may need to reconstruct the trees all over again. It is clearly obvious that it takes much more time to construct the trees than creating the buckets with LSHashing. LSHashing also allows addition of new data easily and in no time.

Also as we can see the nearest neighbors returned by the LSHashing are not the exact neighbors, that's why it is called approximate nearest neighbor search. Of course, when dealing with reasonable amount of data it is better to go with normal nearest neighbor searching. However with very big data, this will be time consuming so it is more efficient to approach it differently.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lshashing-1.1.3.tar.gz (123.0 kB view details)

Uploaded Source

Built Distribution

lshashing-1.1.3-cp37-cp37m-win_amd64.whl (76.7 kB view details)

Uploaded CPython 3.7m Windows x86-64

File details

Details for the file lshashing-1.1.3.tar.gz.

File metadata

  • Download URL: lshashing-1.1.3.tar.gz
  • Upload date:
  • Size: 123.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1.post20200604 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.6

File hashes

Hashes for lshashing-1.1.3.tar.gz
Algorithm Hash digest
SHA256 2a80d3358ed1eafc93aa7814d00bb25204ee4b8e5dd1666d695f82c26fb1ee6d
MD5 faddae2f39965da4024d0547e42f071d
BLAKE2b-256 679e7c2264e6fc932c28c747b30c1db394e75baa734f9676fb103526c797182b

See more details on using hashes here.

File details

Details for the file lshashing-1.1.3-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: lshashing-1.1.3-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 76.7 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1.post20200604 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.6

File hashes

Hashes for lshashing-1.1.3-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 824dbcffa8174bbcfbc4d3d8fdd0ca66f7ca487f46c3d96014a6f599af21fe6f
MD5 d6b60a9f85ce410d39ebf921a8fe6303
BLAKE2b-256 b90ea5dc34a3e136a60f9bee06cd54ac2314a52b2c2fec3031c20a30da1d617d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page