Skip to main content

Remove duplicates from parallel corpora

Project description

RemoveDup

A fast, memory efficient Python module to remove duplicates from parallel text corpora.

It's useful for cleaning up datasets that contain duplicate entries for training language models.

Installation

pip install removedup

Usage

from removedup import rdup

src, tgt, removed = rdup("source.txt", "target.txt")
print(src, tgt, removed)
# source.txt.dedup
# target.txt.dedup
# <num lines removed>

Notes

Source and target must have the same number of lines. No validation checks are made.

Duplication checks are only made on the source content. If you want to check for duplicates on the target, simply switch the order of the parameters.

Build

git clone https://github.com/LibreTranslate/RemoveDup
cd RemoveDup
python setup.py build

Standalone Binary

You can also use removedup as a standalone Windows, macOS or Linux application (but you currently need to build from source, we don't provide binaries).

mkdir build
cd build && cmake .. && make -j4
./rdup source.txt target.txt

Contributing

We welcome pull requests!

License

AGPLv3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

removedup-1.0.6.tar.gz (345.3 kB view hashes)

Uploaded Source

Built Distributions

removedup-1.0.6-pp310-pypy310_pp73-win_amd64.whl (70.5 kB view hashes)

Uploaded PyPy Windows x86-64

removedup-1.0.6-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.3 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

removedup-1.0.6-pp310-pypy310_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (90.9 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

removedup-1.0.6-pp310-pypy310_pp73-macosx_10_9_x86_64.whl (63.5 kB view hashes)

Uploaded PyPy macOS 10.9+ x86-64

removedup-1.0.6-pp39-pypy39_pp73-win_amd64.whl (70.6 kB view hashes)

Uploaded PyPy Windows x86-64

removedup-1.0.6-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.2 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

removedup-1.0.6-pp39-pypy39_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (90.9 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

removedup-1.0.6-pp39-pypy39_pp73-macosx_10_9_x86_64.whl (63.5 kB view hashes)

Uploaded PyPy macOS 10.9+ x86-64

removedup-1.0.6-pp38-pypy38_pp73-win_amd64.whl (70.5 kB view hashes)

Uploaded PyPy Windows x86-64

removedup-1.0.6-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.3 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

removedup-1.0.6-pp38-pypy38_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (90.9 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

removedup-1.0.6-pp38-pypy38_pp73-macosx_10_9_x86_64.whl (63.5 kB view hashes)

Uploaded PyPy macOS 10.9+ x86-64

removedup-1.0.6-pp37-pypy37_pp73-win_amd64.whl (70.4 kB view hashes)

Uploaded PyPy Windows x86-64

removedup-1.0.6-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (84.8 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

removedup-1.0.6-pp37-pypy37_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (90.4 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

removedup-1.0.6-pp37-pypy37_pp73-macosx_10_9_x86_64.whl (63.1 kB view hashes)

Uploaded PyPy macOS 10.9+ x86-64

removedup-1.0.6-cp312-cp312-win_amd64.whl (70.4 kB view hashes)

Uploaded CPython 3.12 Windows x86-64

removedup-1.0.6-cp312-cp312-win32.whl (64.4 kB view hashes)

Uploaded CPython 3.12 Windows x86

removedup-1.0.6-cp312-cp312-musllinux_1_1_x86_64.whl (610.5 kB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.1+ x86-64

removedup-1.0.6-cp312-cp312-musllinux_1_1_i686.whl (668.5 kB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.1+ i686

removedup-1.0.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (86.3 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

removedup-1.0.6-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl (92.2 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ i686

removedup-1.0.6-cp312-cp312-macosx_10_9_x86_64.whl (63.5 kB view hashes)

Uploaded CPython 3.12 macOS 10.9+ x86-64

removedup-1.0.6-cp311-cp311-win_amd64.whl (71.5 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

removedup-1.0.6-cp311-cp311-win32.whl (64.9 kB view hashes)

Uploaded CPython 3.11 Windows x86

removedup-1.0.6-cp311-cp311-musllinux_1_1_x86_64.whl (610.7 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

removedup-1.0.6-cp311-cp311-musllinux_1_1_i686.whl (668.7 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ i686

removedup-1.0.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (87.0 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

removedup-1.0.6-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl (92.4 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686

removedup-1.0.6-cp311-cp311-macosx_10_9_x86_64.whl (64.8 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ x86-64

removedup-1.0.6-cp310-cp310-win_amd64.whl (70.7 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

removedup-1.0.6-cp310-cp310-win32.whl (63.9 kB view hashes)

Uploaded CPython 3.10 Windows x86

removedup-1.0.6-cp310-cp310-musllinux_1_1_x86_64.whl (609.7 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

removedup-1.0.6-cp310-cp310-musllinux_1_1_i686.whl (667.3 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ i686

removedup-1.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.3 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

removedup-1.0.6-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl (91.5 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686

removedup-1.0.6-cp310-cp310-macosx_10_9_x86_64.whl (63.4 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

removedup-1.0.6-cp39-cp39-win_amd64.whl (70.7 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

removedup-1.0.6-cp39-cp39-win32.whl (64.1 kB view hashes)

Uploaded CPython 3.9 Windows x86

removedup-1.0.6-cp39-cp39-musllinux_1_1_x86_64.whl (610.0 kB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

removedup-1.0.6-cp39-cp39-musllinux_1_1_i686.whl (667.6 kB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ i686

removedup-1.0.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.4 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

removedup-1.0.6-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl (91.6 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ i686

removedup-1.0.6-cp39-cp39-macosx_10_9_x86_64.whl (63.5 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

removedup-1.0.6-cp38-cp38-win_amd64.whl (70.8 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

removedup-1.0.6-cp38-cp38-win32.whl (63.9 kB view hashes)

Uploaded CPython 3.8 Windows x86

removedup-1.0.6-cp38-cp38-musllinux_1_1_x86_64.whl (609.8 kB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

removedup-1.0.6-cp38-cp38-musllinux_1_1_i686.whl (667.2 kB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ i686

removedup-1.0.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.2 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

removedup-1.0.6-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl (91.4 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ i686

removedup-1.0.6-cp38-cp38-macosx_10_9_x86_64.whl (63.4 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

removedup-1.0.6-cp37-cp37m-win_amd64.whl (71.1 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

removedup-1.0.6-cp37-cp37m-win32.whl (64.8 kB view hashes)

Uploaded CPython 3.7m Windows x86

removedup-1.0.6-cp37-cp37m-musllinux_1_1_x86_64.whl (610.7 kB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ x86-64

removedup-1.0.6-cp37-cp37m-musllinux_1_1_i686.whl (669.3 kB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ i686

removedup-1.0.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.7 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

removedup-1.0.6-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl (92.1 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ i686

removedup-1.0.6-cp37-cp37m-macosx_10_9_x86_64.whl (63.3 kB view hashes)

Uploaded CPython 3.7m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page