Skip to main content

Locality Sensitive Hashing

Project description

Gaoya

About

This project implements Locality Sensitive Hashing algorithms and data structures for indexing and querying text documents. The primary use cases for Gaoya are deduplication and clustering.

  • MinHash | SimHash
  • Powered by Rust
  • Multi-threaded
>>> import gaoya
>>> index = gaoya.minhash.MinHashStringIndex(hash_size=32, 
                                             jaccard_threshold=0.5, 
                                             num_bands=42, 
                                             band_size=3,
                                             num_hashes=42*3,
                                             analyzer='word', 
                                             lowercase=True, 
                                             ngram_range=(1,1))
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third document.',
...     'Is this the first document?',
...     'This not the first nor the second nor the third, but the fourth document'
... ]
>>> 
>>> for i, doc in enumerate(corpus): index.insert_document(i, doc)
... 
>>> index.query('This is the first document.')
[0, 1, 2, 3]
>>> 

Installation

$ pip3 install gaoya

Examples

Document Deduplication with Gaoya

References

[1] Chapter 3, Mining of Massive Datasets

[2] Similarity Estimation Techniques from Rounding Algorithms

[3] Detecting Near-Duplicates for Web Crawling

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

gaoya-0.1.0-cp37-abi3-win_amd64.whl (362.6 kB view details)

Uploaded CPython 3.7+ Windows x86-64

gaoya-0.1.0-cp37-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl (422.2 kB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.5+ x86-64

gaoya-0.1.0-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (721.1 kB view details)

Uploaded CPython 3.7+ macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

File details

Details for the file gaoya-0.1.0-cp37-abi3-win_amd64.whl.

File metadata

  • Download URL: gaoya-0.1.0-cp37-abi3-win_amd64.whl
  • Upload date:
  • Size: 362.6 kB
  • Tags: CPython 3.7+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/0.12.6

File hashes

Hashes for gaoya-0.1.0-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 34b39c4853d1b72b4180b3172aae4ffe180f8392adfbbe0f320e61d30a0abf85
MD5 5d3058d3f5b6420d61bc5c3bcba47df7
BLAKE2b-256 f887ad5c96edf54eac77156d248d62d21ad2d5bd63276c131d252fe0463f038a

See more details on using hashes here.

File details

Details for the file gaoya-0.1.0-cp37-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for gaoya-0.1.0-cp37-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 09538d1e0a58f677e8800c390f8710a1121c59744078c75c7c7c337b639f1db3
MD5 626cd5d28f5ed1e38c64528fa225f761
BLAKE2b-256 51961b79c881b9e1a6ef64b525328b379b81d5db3f77eafd19bd970f1bad0a41

See more details on using hashes here.

File details

Details for the file gaoya-0.1.0-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for gaoya-0.1.0-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 9a77335d37dea42ccdb95263b212d715dc298e4cac578c225f490b78f846efc0
MD5 4d1d52b83a7135d91f39b9e5918abc41
BLAKE2b-256 a8fb9237c2badbdbdde6c508f0c59e52e64ae30cea2d4cd46516e50c4a617167

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page