Skip to main content

Efficient MinHashing

Project description

Version Downloads Conda - Platform Conda (channel only) Conda Recipe Docs - GitHub.io

PyMinHash

MinHashing is a very efficient way of finding similar records in a dataset based on Jaccard similarity. PyMinHash implements efficient minhashing for Pandas dataframes. See instructions below or look at the example notebook to get started.

Developed by Frits Hermans

Documentation

Documentation can be found here

Installation

Normal installation

Using PyPI

pip install pyminhash

Using conda

conda install -c conda-forge pyminhash

Install to contribute

Clone this Github repo and install in editable mode:

python -m pip install -e ".[dev]"
python setup.py develop

Usage

Apply record matching to column name of your Pandas dataframe df as follows:

myHasher = MinHash(n_hash_tables=10)
myHasher.fit_predict(df, 'name')

This will return the row pairs from df that have non-zero Jaccard similarity.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyMinHash-0.1.5.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

PyMinHash-0.1.5-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file PyMinHash-0.1.5.tar.gz.

File metadata

  • Download URL: PyMinHash-0.1.5.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.11.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.11

File hashes

Hashes for PyMinHash-0.1.5.tar.gz
Algorithm Hash digest
SHA256 16a9ab842811c6c53d9153ef402e401b853e55808f267f769d2154117e6ec94f
MD5 d1fda4ffd6dc858ea3c1ae1bd4bb1258
BLAKE2b-256 15c5a268e236817ba8f7b51b48fe79e2a5dbfba0afb4d548742ee1dd54c8ce53

See more details on using hashes here.

File details

Details for the file PyMinHash-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: PyMinHash-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.11.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.11

File hashes

Hashes for PyMinHash-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 81eb397ef996fb1f273682f7c5dd59e0abe0d0e6353b5c232ae4a0ea04c1ff13
MD5 56048bbbb16fa50745ed39675279b38f
BLAKE2b-256 07d24f14214f87ae904930d1bef6f24f5a31dcb6983e98f48966678d2c784bec

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page