Skip to main content

Remove duplicates and near-duplicates from text corpora, no matter the scale.

Project description

NLPDedup

Remove duplicates and near-duplicates from text corpora, no matter the scale.


Documentation License LastCommit Code Coverage

Developers:

Installation

The package is available on PyPI, so you can install the package using your favourite package manager. For instance, pip install nlp_dedup or poetry add nlp_dedup.

Quick Start

If the corpus is stored as corpus.txt (both txt and jsonl files are supported), the following deduplicates the corpus and stores the deduplicates corpus into the folder deduplicated:

$ dedup corpus.txt deduplicated

This defaults to deduplicating based on blocks of 13 consecutive words, where two documents are considered near-duplicate if they have more than 80% of these blocks in common. This can all be changed to your specific needs, however. See $ dedup --help for more information on all the settings.

Deduplication can also be done directly from Python:

>>> from nlp_dedup import Deduper
>>> deduper = Deduper()
>>> corpus = ["Test", "Another test", "Test"]
>>> deduper.deduplicate(corpus=corpus)

Here corpus does not have to be a list, but can also be an iterable or generator of strings, if the corpus is too big to be stored in memory. Dictionaries are also supported instead of strings, in which case the text entry in the dictionaries will be used (change this with the text_column argument when calling deduplicate).

See more in the documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlp_dedup-0.1.2.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

nlp_dedup-0.1.2-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file nlp_dedup-0.1.2.tar.gz.

File metadata

  • Download URL: nlp_dedup-0.1.2.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.0 CPython/3.11.5 Darwin/22.6.0

File hashes

Hashes for nlp_dedup-0.1.2.tar.gz
Algorithm Hash digest
SHA256 deafd5ddbfa2dd3c5958d0e1e18bab2f310106480e373f0370cb81f596df1185
MD5 9511e2792450312c48a3f7795befbd28
BLAKE2b-256 69e7f8bd45e19d2ac135460fe345faa35ed2bc838e2fdd2f2ff8ef28759beb3a

See more details on using hashes here.

File details

Details for the file nlp_dedup-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: nlp_dedup-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.0 CPython/3.11.5 Darwin/22.6.0

File hashes

Hashes for nlp_dedup-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0da694fafe43a2ae2ab157bc049fa334144111864591313fd8aaad0a4a0c4d4d
MD5 1c2684c331cdcd91ed2090d052237b1b
BLAKE2b-256 611cd241e8b13e5e4a5f84a9f53a9575603d891f888b314ce20c744577425127

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page