Remove duplicates and near-duplicates from text corpora, no matter the scale.
Project description
NLPDedup
Remove duplicates and near-duplicates from text corpora, no matter the scale.
Developers:
- Dan Saattrup Nielsen (dan.nielsen@alexandra.dk)
- Kenneth Enevoldsen (kennethcenevoldsen@gmail.com)
Installation
The package is available on PyPI, so you can install the package using your favourite
package manager. For instance, pip install nlp_dedup
or poetry add nlp_dedup
.
Quick start
If the corpus is stored as corpus.txt
(both txt
and jsonl
files are supported),
the following deduplicates the corpus and stores the deduplicates corpus into the
folder deduplicated
:
$ dedup corpus.txt deduplicated
This defaults to deduplicating based on blocks of 13 consecutive words, where two documents are considered near-duplicate if they have more than 80% of these blocks in common. This can all be changed to your specific needs, however.
See $ dedup --help
for more information on all the settings.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for nlp_dedup-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a02ce1265730e8e001623ea63af4929a909f93b60153635e0207ac1d92b3242 |
|
MD5 | 46e4cc6a02d840a2b601292e8e968a60 |
|
BLAKE2b-256 | 117bff21c3e7f6eec0f55eacb9ba792792000645eaa9300b504fcf6705bbb2e1 |