Skip to main content

Remove duplicate documents via popular algorithms such as SimHash, SpotSig, Shingling, etc.

Project description

deduplication

Remove duplicate documents via popular algorithms such as SimHash, SpotSig, Shingling, etc.

Install

Run following commands:

# install current library
pip install deduplication

# install required pretrained NLP models 
python -m spacy download xx_ent_wiki_sm
python -m spacy download en_core_web_sm

Example

SimHash

from deduplication import simhash

hashvalue1 = simhash('this is text')
hashvalue2 = simhash('this is another text', n_block=4)

L-SimHash

from deduplication import lsimhash

hashvalue = lsimhash('this is very long article texts. maybe with a lot of sentences.')

Citation

SimHash

Sadowski C, Levin G. 
Simhash: Hash-based similarity detection[J]. 
Technical report, Google, 2007.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deduplication-0.0.2.tar.gz (2.7 kB view details)

Uploaded Source

Built Distribution

deduplication-0.0.2-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file deduplication-0.0.2.tar.gz.

File metadata

  • Download URL: deduplication-0.0.2.tar.gz
  • Upload date:
  • Size: 2.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for deduplication-0.0.2.tar.gz
Algorithm Hash digest
SHA256 8b6791a64e858cd42c876b8f8e5bac12aea27c0839ae3d54f92222464c72ab62
MD5 f493ab0b6a9ea52948be3de62112630d
BLAKE2b-256 f951f4c40e406e8fe3e5421d2edfdc0a73944b787c8591b75be680ba25e03b9d

See more details on using hashes here.

File details

Details for the file deduplication-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: deduplication-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 7.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for deduplication-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 00ddbb6bbd71921a21813f97b4dcc691f5e6ad329238c95331351ae523452ddd
MD5 dcc67b4d1eacbfa9dd8d224099db7563
BLAKE2b-256 866059fdd09ec7db2c17811ae56415d95047ab6dd9553552bda32205e4a1b88b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page