Skip to main content

Locality Sensitive Hashing

Project description

Gaoya

About

This project implements Locality Sensitive Hashing algorithms and data structures for indexing and querying text documents. The primary use cases for Gaoya are deduplication and clustering.

  • 64,32,16,8 bit minhash
  • 64,128 bit simhash
  • MinHash | SimHash
  • Powered by Rust
  • Multi-threaded
>>> import gaoya
>>> index = gaoya.minhash.MinHashStringIndex(hash_size=32, 
                                             jaccard_threshold=0.5, 
                                             num_bands=42, 
                                             band_size=3,
                                             num_hashes=42*3,
                                             analyzer='word', 
                                             lowercase=True, 
                                             ngram_range=(1,1))
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third document.',
...     'Is this the first document?',
...     'This not the first nor the second nor the third, but the fourth document'
... ]
>>> 
>>> for i, doc in enumerate(corpus): index.insert_document(i, doc)
... 
>>> index.query('This is the first document.')
[0, 1, 2, 3]
>>> 

Installation

$ pip3 install gaoya

Examples

Document Deduplication with Gaoya

References

[1] Chapter 3, Mining of Massive Datasets

[2] Similarity Estimation Techniques from Rounding Algorithms

[3] Detecting Near-Duplicates for Web Crawling

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

gaoya-0.1.2-cp37-abi3-win_amd64.whl (386.4 kB view details)

Uploaded CPython 3.7+ Windows x86-64

gaoya-0.1.2-cp37-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl (452.4 kB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.5+ x86-64

gaoya-0.1.2-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (767.1 kB view details)

Uploaded CPython 3.7+ macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

File details

Details for the file gaoya-0.1.2-cp37-abi3-win_amd64.whl.

File metadata

  • Download URL: gaoya-0.1.2-cp37-abi3-win_amd64.whl
  • Upload date:
  • Size: 386.4 kB
  • Tags: CPython 3.7+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/0.12.6

File hashes

Hashes for gaoya-0.1.2-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 711d4239114d77a5e201cdd48fa3b67c90997ed5533a2251f022d132cab34e14
MD5 ceb5931d8c533ab7c8cbe04a91ecc573
BLAKE2b-256 8966b9f9ebf50283ce84692572c840b885f2b93b25a164eca5ad9467e0f253e5

See more details on using hashes here.

File details

Details for the file gaoya-0.1.2-cp37-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for gaoya-0.1.2-cp37-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8ae3cb1def04a6a935ad13375dfea828195ae7965e7160c2c7c0e29d17f6233f
MD5 573b04f8a9b22fc27408bf229236eddf
BLAKE2b-256 5217080cd275e93db3e989282fdca63ff6fec8565a6d8440cc3a677f1e1e6b78

See more details on using hashes here.

File details

Details for the file gaoya-0.1.2-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for gaoya-0.1.2-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 5b3729d712448c5060b776cd0f7b55c522ced6b04ed07e00946d4054bfcc6fa2
MD5 8d31ff580d296fce64f78821d54b0ade
BLAKE2b-256 cf341321eac8141d3ccb755001fec958dfc280f76529eb2ab4cdf9bdb09cbae6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page