Locality Sensitive Hashing
Project description
Gaoya
About
This project implements Locality Sensitive Hashing algorithms and data structures for indexing and querying text documents. The primary use cases for Gaoya are deduplication and clustering.
- 64,32,16,8 bit minhash
- 64,128 bit simhash
- MinHash | SimHash
- Powered by Rust
- Multi-threaded
>>> import gaoya
>>> index = gaoya.minhash.MinHashStringIndex(hash_size=32,
jaccard_threshold=0.5,
num_bands=42,
band_size=3,
num_hashes=42*3,
analyzer='word',
lowercase=True,
ngram_range=(1,1))
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third document.',
... 'Is this the first document?',
... 'This not the first nor the second nor the third, but the fourth document'
... ]
>>>
>>> for i, doc in enumerate(corpus): index.insert_document(i, doc)
...
>>> index.query('This is the first document.')
[0, 1, 2, 3]
>>>
Installation
$ pip3 install gaoya
Examples
Document Deduplication with Gaoya
References
[1] Chapter 3, Mining of Massive Datasets
[2] Similarity Estimation Techniques from Rounding Algorithms
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distributions
gaoya-0.1.1-cp37-abi3-win_amd64.whl
(376.9 kB
view details)
File details
Details for the file gaoya-0.1.1-cp37-abi3-win_amd64.whl
.
File metadata
- Download URL: gaoya-0.1.1-cp37-abi3-win_amd64.whl
- Upload date:
- Size: 376.9 kB
- Tags: CPython 3.7+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 79390ca2575ce5df54ee13dcc84d69b693308c3347444277d2319864723ab235 |
|
MD5 | 49848f231bda7007c37d8c36081ae797 |
|
BLAKE2b-256 | 278eff2639bddfa4f4b1a7628bc604f88aa976df3c57e59e2bae15d586aec677 |
File details
Details for the file gaoya-0.1.1-cp37-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl
.
File metadata
- Download URL: gaoya-0.1.1-cp37-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl
- Upload date:
- Size: 429.0 kB
- Tags: CPython 3.7+, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.12.10-beta.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1bdab0715c794b26eced3e61d1b56d336193ac972c74c8c0bceaaeea91f66ec2 |
|
MD5 | fee68ec5aab6d89b664dcddc0b136b6b |
|
BLAKE2b-256 | ec864b6d1ebabe764aa641dfa5d2c47116b54451c828db19080069bdb8e5f6da |
File details
Details for the file gaoya-0.1.1-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
.
File metadata
- Download URL: gaoya-0.1.1-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
- Upload date:
- Size: 740.5 kB
- Tags: CPython 3.7+, macOS 10.9+ universal2 (ARM64, x86-64), macOS 10.9+ x86-64, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8cc6404a70ca951c12a2e27fd51e088fea9ab5a9de2d1cf9173eaa1042e93a31 |
|
MD5 | af63a7083f0208fba5b2c0a209e162a5 |
|
BLAKE2b-256 | 11e9b7aab839a39cf7e30dd7ac49c0f1969b78aaa4042c2dbd1610a71c15d75f |