Skip to main content

Remove duplicates and near-duplicates from text corpora, no matter the scale.

Project description

NLPDedup

Remove duplicates and near-duplicates from text corpora, no matter the scale.


Documentation License LastCommit Code Coverage

Developers:

Installation

The package is available on PyPI, so you can install the package using your favourite package manager. For instance, pip install nlp_dedup or poetry add nlp_dedup.

Quick start

If the corpus is stored as corpus.txt (both txt and jsonl files are supported), the following deduplicates the corpus and stores the deduplicates corpus into the folder deduplicated:

$ dedup corpus.txt deduplicated

This defaults to deduplicating based on blocks of 13 consecutive words, where two documents are considered near-duplicate if they have more than 80% of these blocks in common. This can all be changed to your specific needs, however.

See $ dedup --help for more information on all the settings.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlp_dedup-0.1.1.tar.gz (10.6 kB view hashes)

Uploaded Source

Built Distribution

nlp_dedup-0.1.1-py3-none-any.whl (11.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page