Remove duplicates and near-duplicates from text corpora, no matter the scale.
Project description
NLPDedup
Remove duplicates and near-duplicates from text corpora, no matter the scale.
Developers:
- Dan Saattrup Nielsen (dan.nielsen@alexandra.dk)
- Kenneth Enevoldsen (kennethcenevoldsen@gmail.com)
Installation
The package is available on PyPI, so you can install the package using your favourite
package manager. For instance, pip install nlp_dedup
or poetry add nlp_dedup
.
Quick start
If the corpus is stored as corpus.txt
(both txt
and jsonl
files are supported),
the following deduplicates the corpus and stores the deduplicates corpus into the
folder deduplicated
:
$ dedup corpus.txt deduplicated
This defaults to deduplicating based on blocks of 13 consecutive words, where two documents are considered near-duplicate if they have more than 80% of these blocks in common. This can all be changed to your specific needs, however.
See $ dedup --help
for more information on all the settings.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for nlp_dedup-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7452ef9925b8a78f76238cdbe5e1b329301bfb1cc43e36797125b71df0eb5811 |
|
MD5 | 6b99d03645c4e54947f48a15c00ff821 |
|
BLAKE2b-256 | 6f4410765d46b677a6227dc68054d7ac55b7e6e397febf76fab7548b62068cb0 |