Remove duplicates and near-duplicates from text corpora, no matter the scale.
Project description
NLPDedup
Remove duplicates and near-duplicates from text corpora, no matter the scale.
Developers:
- Dan Saattrup Nielsen (dan.nielsen@alexandra.dk)
- Kenneth Enevoldsen (kennethcenevoldsen@gmail.com)
Installation
The package is available on PyPI, so you can install the package using your favourite
package manager. For instance, pip install nlp_dedup
or poetry add nlp_dedup
.
Quick Start
If the corpus is stored as corpus.txt
(both txt
and jsonl
files are supported),
the following deduplicates the corpus and stores the deduplicates corpus into the
folder deduplicated
:
$ dedup corpus.txt deduplicated
This defaults to deduplicating based on blocks of 13 consecutive words, where two
documents are considered near-duplicate if they have more than 80% of these blocks in
common. This can all be changed to your specific needs, however. See $ dedup --help
for more information on all the settings.
Deduplication can also be done directly from Python:
>>> from nlp_dedup import Deduper
>>> deduper = Deduper()
>>> corpus = ["Test", "Another test", "Test"]
>>> deduper.deduplicate(corpus=corpus)
Here corpus
does not have to be a list, but can also be an iterable or generator of
strings, if the corpus is too big to be stored in memory. Dictionaries are also
supported instead of strings, in which case the text
entry in the dictionaries will
be used (change this with the text_column
argument when calling deduplicate
).
See more in the documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for nlp_dedup-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0da694fafe43a2ae2ab157bc049fa334144111864591313fd8aaad0a4a0c4d4d |
|
MD5 | 1c2684c331cdcd91ed2090d052237b1b |
|
BLAKE2b-256 | 611cd241e8b13e5e4a5f84a9f53a9575603d891f888b314ce20c744577425127 |