Project description

PyMinHash

MinHashing is a very efficient way of finding similar records in a dataset based on Jaccard similarity. PyMinHash implements efficient minhashing for Pandas dataframes. See instructions below or look at the example notebook to get started.

Developed by Frits Hermans

Documentation

Documentation can be found here

Installation

Normal installation

Install directly from PyPi:

pip install pyminhash

Install to contribute

Clone this Github repo and install in editable mode:

python -m pip install -e ".[dev]"
python setup.py develop

Usage

Apply record matching to your Pandas dataframe df as follows:

myHasher = MinHash(n_hash_tables=10)
myHasher.fit_predict(df, 'name')

This will return the row pairs from df that have non-zero Jaccard similarity.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.5

Jan 6, 2023

0.1.4

Feb 28, 2022

0.1.3

Jan 30, 2022

0.1.2

Jan 27, 2022

0.1.1

Jan 14, 2022

This version

0.1

Jan 1, 2022

0.0.1

Jan 1, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyMinHash-0.1.tar.gz (16.8 kB view hashes)

Uploaded Jan 1, 2022 Source

Built Distribution

PyMinHash-0.1-py3-none-any.whl (16.3 kB view hashes)

Uploaded Jan 1, 2022 Python 3

Hashes for PyMinHash-0.1.tar.gz

Hashes for PyMinHash-0.1.tar.gz
Algorithm	Hash digest
SHA256	`8f8b5ae5637b5b571157a78151c6a2a0deae00fddd3aca22ada4479e0aafaa8f`
MD5	`f67dbde440d15d0c3c61bf207020f38d`
BLAKE2b-256	`2071cfab01d75fce0fcec513034a0e433934839e64559f5ab3a6244d5a2b9641`

Hashes for PyMinHash-0.1-py3-none-any.whl

Hashes for PyMinHash-0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3306244bad6ca478a5292878e1b2a3022c98fcd55e66a8eb13e79ac93b8db3fd`
MD5	`01b87b073b665e46ffc4c9ccec2af1a3`
BLAKE2b-256	`011480da3a1b7b3e230517b0a5f9267dd268b92b7ebce78942d5a4100b427f4b`