Skip to main content

Remove duplicate documents via popular algorithms such as SimHash, SpotSig, Shingling, etc.

Project description

deduplication

Remove duplicate documents via popular algorithms such as SimHash, SpotSig, Shingling, etc.

Install

Run following commands:

# install current library
pip install deduplication

# install required pretrained NLP models 
python -m spacy download xx_ent_wiki_sm
python -m spacy download en_core_web_sm

Example

SimHash

from deduplication import simhash

hashvalue1 = simhash('this is text')
hashvalue2 = simhash('this is another text', n_block=4)

L-SimHash

from deduplication import lsimhash

hashvalue = lsimhash('this is very long article texts. maybe with a lot of sentences.')

Citation

SimHash

Sadowski C, Levin G. 
Simhash: Hash-based similarity detection[J]. 
Technical report, Google, 2007.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deduplication-0.0.3.tar.gz (2.7 kB view details)

Uploaded Source

Built Distribution

deduplication-0.0.3-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file deduplication-0.0.3.tar.gz.

File metadata

  • Download URL: deduplication-0.0.3.tar.gz
  • Upload date:
  • Size: 2.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for deduplication-0.0.3.tar.gz
Algorithm Hash digest
SHA256 545e75b2e6acd9a9ac0d32dfb9e50c6fcb6d11f79eeec5cef9a1ad3182efc983
MD5 560fc54f419473a488456643ab707690
BLAKE2b-256 6375c2c29b42bcdaf9a9790f74e84e035a76e8be9a3f74402ef05db9cdbb8dd2

See more details on using hashes here.

File details

Details for the file deduplication-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: deduplication-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 7.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for deduplication-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 93d281032bf44c6311b532146a9cb63a39f9b77b1037533f78180b9b3afcdedf
MD5 1d04ecf536ef033ac5539f4847e0800c
BLAKE2b-256 10fa2c13ae4cf01ef31991ab3d7ecbc0fe86e24f6b1f9b26c7dde36797c691b9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page