Skip to main content

Remove duplicates and near-duplicates from text corpora, no matter the scale.

Project description

NLPDedup

Remove duplicates and near-duplicates from text corpora, no matter the scale.


Documentation License LastCommit Code Coverage

Developers:

Installation

The package is available on PyPI, so you can install the package using your favourite package manager. For instance, pip install nlp_dedup or poetry add nlp_dedup.

Quick Start

If the corpus is stored as corpus.txt (both txt and jsonl files are supported), the following deduplicates the corpus and stores the deduplicates corpus into the folder deduplicated:

$ dedup corpus.txt deduplicated

This defaults to deduplicating based on blocks of 13 consecutive words, where two documents are considered near-duplicate if they have more than 80% of these blocks in common. This can all be changed to your specific needs, however. See $ dedup --help for more information on all the settings.

Deduplication can also be done directly from Python:

>>> from nlp_dedup import Deduper
>>> deduper = Deduper()
>>> corpus = ["Test", "Another test", "Test"]
>>> deduper.deduplicate(corpus=corpus)

Here corpus does not have to be a list, but can also be an iterable or generator of strings, if the corpus is too big to be stored in memory. Dictionaries are also supported instead of strings, in which case the text entry in the dictionaries will be used (change this with the text_column argument when calling deduplicate).

See more in the documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlp_dedup-0.1.2.tar.gz (10.6 kB view hashes)

Uploaded Source

Built Distribution

nlp_dedup-0.1.2-py3-none-any.whl (11.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page