Skip to main content

Locality Sensitive Hashing

Project description

Gaoya

About

This project implements Locality Sensitive Hashing algorithms and data structures for indexing and querying text documents. The primary use cases for Gaoya are deduplication and clustering.

  • 64,32,16,8 bit minhash
  • 64,128 bit simhash
  • MinHash | SimHash
  • Powered by Rust
  • Multi-threaded
>>> import gaoya
>>> index = gaoya.minhash.MinHashStringIndex(hash_size=32, 
                                             jaccard_threshold=0.5, 
                                             num_bands=42, 
                                             band_size=3,
                                             num_hashes=42*3,
                                             analyzer='word', 
                                             lowercase=True, 
                                             ngram_range=(1,1))
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third document.',
...     'Is this the first document?',
...     'This not the first nor the second nor the third, but the fourth document'
... ]
>>> 
>>> for i, doc in enumerate(corpus): index.insert_document(i, doc)
... 
>>> index.query('This is the first document.')
[0, 1, 2, 3]
>>> 

Installation

$ pip3 install gaoya

Examples

Document Deduplication with Gaoya

References

[1] Chapter 3, Mining of Massive Datasets

[2] Similarity Estimation Techniques from Rounding Algorithms

[3] Detecting Near-Duplicates for Web Crawling

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

gaoya-0.2.0-cp37-abi3-win_amd64.whl (485.7 kB view details)

Uploaded CPython 3.7+ Windows x86-64

gaoya-0.2.0-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (998.9 kB view details)

Uploaded CPython 3.7+ macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

File details

Details for the file gaoya-0.2.0-cp37-abi3-win_amd64.whl.

File metadata

  • Download URL: gaoya-0.2.0-cp37-abi3-win_amd64.whl
  • Upload date:
  • Size: 485.7 kB
  • Tags: CPython 3.7+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/0.15.1

File hashes

Hashes for gaoya-0.2.0-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f36b192a2cd7af7abe01b48b937048aa33d8c1b85e1b12708f9227673d171408
MD5 b4b50c4942eb507ed12399d9041f185d
BLAKE2b-256 3d1e63a76bee30b4cdbde44272ba728c2d59b86cab7a2cf2cf2cd08a5a30bbf4

See more details on using hashes here.

File details

Details for the file gaoya-0.2.0-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for gaoya-0.2.0-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 c691cb78d90f0957a36429de420dacd3bd35b3d91af5dd8b662e26b3c974cb30
MD5 d68b04e29b8ee0c53c5583884a885c1c
BLAKE2b-256 c306712d000f2c599e734d9fa3c7a15cc165e7c90724170b38da170ec68cc5b4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page