Skip to main content

Locality Sensitive Hashing

Project description

Gaoya

About

This project implements Locality Sensitive Hashing algorithms and data structures for indexing and querying text documents. The primary use cases for Gaoya are deduplication and clustering.

  • 64,32,16,8 bit minhash
  • 64,128 bit simhash
  • MinHash | SimHash
  • Powered by Rust
  • Multi-threaded
>>> import gaoya
>>> index = gaoya.minhash.MinHashStringIndex(hash_size=32, 
                                             jaccard_threshold=0.5, 
                                             num_bands=42, 
                                             band_size=3,
                                             num_hashes=42*3,
                                             analyzer='word', 
                                             lowercase=True, 
                                             ngram_range=(1,1))
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third document.',
...     'Is this the first document?',
...     'This not the first nor the second nor the third, but the fourth document'
... ]
>>> 
>>> for i, doc in enumerate(corpus): index.insert_document(i, doc)
... 
>>> index.query('This is the first document.')
[0, 1, 2, 3]
>>> 

Installation

$ pip3 install gaoya

Examples

Document Deduplication with Gaoya

References

[1] Chapter 3, Mining of Massive Datasets

[2] Similarity Estimation Techniques from Rounding Algorithms

[3] Detecting Near-Duplicates for Web Crawling

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

gaoya-0.1.1-cp37-abi3-win_amd64.whl (376.9 kB view details)

Uploaded CPython 3.7+ Windows x86-64

gaoya-0.1.1-cp37-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl (429.0 kB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.5+ x86-64

gaoya-0.1.1-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (740.5 kB view details)

Uploaded CPython 3.7+ macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

File details

Details for the file gaoya-0.1.1-cp37-abi3-win_amd64.whl.

File metadata

  • Download URL: gaoya-0.1.1-cp37-abi3-win_amd64.whl
  • Upload date:
  • Size: 376.9 kB
  • Tags: CPython 3.7+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/0.12.6

File hashes

Hashes for gaoya-0.1.1-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 79390ca2575ce5df54ee13dcc84d69b693308c3347444277d2319864723ab235
MD5 49848f231bda7007c37d8c36081ae797
BLAKE2b-256 278eff2639bddfa4f4b1a7628bc604f88aa976df3c57e59e2bae15d586aec677

See more details on using hashes here.

File details

Details for the file gaoya-0.1.1-cp37-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for gaoya-0.1.1-cp37-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 1bdab0715c794b26eced3e61d1b56d336193ac972c74c8c0bceaaeea91f66ec2
MD5 fee68ec5aab6d89b664dcddc0b136b6b
BLAKE2b-256 ec864b6d1ebabe764aa641dfa5d2c47116b54451c828db19080069bdb8e5f6da

See more details on using hashes here.

File details

Details for the file gaoya-0.1.1-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for gaoya-0.1.1-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 8cc6404a70ca951c12a2e27fd51e088fea9ab5a9de2d1cf9173eaa1042e93a31
MD5 af63a7083f0208fba5b2c0a209e162a5
BLAKE2b-256 11e9b7aab839a39cf7e30dd7ac49c0f1969b78aaa4042c2dbd1610a71c15d75f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page