Skip to main content

Quickly shuffle parallel corpora

Project description

FastShuffle

A memory efficient Python module to quickly shuffle parallel text corpora. The main advantage is that it never loads the entire dataset in memory, using memory-mapped file offsets instead.

Installation

pip install fastshuffle

Usage

from fastshuffle import file_shuffle

src, tgt = file_shuffle("source.txt", "target.txt")
print(src, tgt)
# source.txt.shuffled target.txt.shuffled

You can also simultaneously sample/isolate a certain number of sentences from the dataset (which are then removed from the shuffled result)

from fastshuffle import file_shuffle_sample

src, tgt, src_sample, tgt_sample = file_shuffle("source.txt", "target.txt", 5) # Sample 5 sentences
print(src, tgt, src_sample, tgt_sample)
# source.txt.shuffled target.txt.shuffled source.txt.shuffled.sample target.txt.shuffled.sample

Notes

Source and target must have the same number of lines. No validation checks are made.

Build

git clone https://github.com/LibreTranslate/FastShuffle
cd FastShuffle
python setup.py build

Standalone Binary

You can also use fastshuffle as a standalone Windows, macOS or Linux application (but you currently need to build from source, we don't provide binaries).

mkdir build
cd build && cmake .. && make -j4
./shuffle source.txt target.txt

Contributing

We welcome pull requests!

License

AGPLv3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastshuffle-1.0.1.tar.gz (344.4 kB view hashes)

Uploaded Source

Built Distributions

fastshuffle-1.0.1-pp310-pypy310_pp73-win_amd64.whl (71.0 kB view hashes)

Uploaded PyPy Windows x86-64

fastshuffle-1.0.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (87.0 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

fastshuffle-1.0.1-pp310-pypy310_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (92.7 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

fastshuffle-1.0.1-pp310-pypy310_pp73-macosx_10_9_x86_64.whl (63.9 kB view hashes)

Uploaded PyPy macOS 10.9+ x86-64

fastshuffle-1.0.1-pp39-pypy39_pp73-win_amd64.whl (70.9 kB view hashes)

Uploaded PyPy Windows x86-64

fastshuffle-1.0.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (87.1 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

fastshuffle-1.0.1-pp39-pypy39_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (92.5 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

fastshuffle-1.0.1-pp39-pypy39_pp73-macosx_10_9_x86_64.whl (63.9 kB view hashes)

Uploaded PyPy macOS 10.9+ x86-64

fastshuffle-1.0.1-pp38-pypy38_pp73-win_amd64.whl (71.0 kB view hashes)

Uploaded PyPy Windows x86-64

fastshuffle-1.0.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (87.1 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

fastshuffle-1.0.1-pp38-pypy38_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (92.5 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

fastshuffle-1.0.1-pp38-pypy38_pp73-macosx_10_9_x86_64.whl (64.0 kB view hashes)

Uploaded PyPy macOS 10.9+ x86-64

fastshuffle-1.0.1-pp37-pypy37_pp73-win_amd64.whl (70.8 kB view hashes)

Uploaded PyPy Windows x86-64

fastshuffle-1.0.1-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (86.5 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

fastshuffle-1.0.1-pp37-pypy37_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (92.2 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

fastshuffle-1.0.1-pp37-pypy37_pp73-macosx_10_9_x86_64.whl (63.6 kB view hashes)

Uploaded PyPy macOS 10.9+ x86-64

fastshuffle-1.0.1-cp312-cp312-win_amd64.whl (70.7 kB view hashes)

Uploaded CPython 3.12 Windows x86-64

fastshuffle-1.0.1-cp312-cp312-win32.whl (64.1 kB view hashes)

Uploaded CPython 3.12 Windows x86

fastshuffle-1.0.1-cp312-cp312-musllinux_1_1_x86_64.whl (612.9 kB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.1+ x86-64

fastshuffle-1.0.1-cp312-cp312-musllinux_1_1_i686.whl (671.1 kB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.1+ i686

fastshuffle-1.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (88.0 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

fastshuffle-1.0.1-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl (93.8 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ i686

fastshuffle-1.0.1-cp312-cp312-macosx_10_9_x86_64.whl (64.0 kB view hashes)

Uploaded CPython 3.12 macOS 10.9+ x86-64

fastshuffle-1.0.1-cp311-cp311-win_amd64.whl (71.8 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

fastshuffle-1.0.1-cp311-cp311-win32.whl (64.9 kB view hashes)

Uploaded CPython 3.11 Windows x86

fastshuffle-1.0.1-cp311-cp311-musllinux_1_1_x86_64.whl (613.4 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

fastshuffle-1.0.1-cp311-cp311-musllinux_1_1_i686.whl (671.2 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ i686

fastshuffle-1.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (88.3 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

fastshuffle-1.0.1-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl (94.5 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686

fastshuffle-1.0.1-cp311-cp311-macosx_10_9_x86_64.whl (65.2 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ x86-64

fastshuffle-1.0.1-cp310-cp310-win_amd64.whl (71.0 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

fastshuffle-1.0.1-cp310-cp310-win32.whl (63.9 kB view hashes)

Uploaded CPython 3.10 Windows x86

fastshuffle-1.0.1-cp310-cp310-musllinux_1_1_x86_64.whl (612.0 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

fastshuffle-1.0.1-cp310-cp310-musllinux_1_1_i686.whl (669.9 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ i686

fastshuffle-1.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (87.1 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

fastshuffle-1.0.1-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl (93.8 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686

fastshuffle-1.0.1-cp310-cp310-macosx_10_9_x86_64.whl (63.9 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

fastshuffle-1.0.1-cp39-cp39-win_amd64.whl (71.0 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

fastshuffle-1.0.1-cp39-cp39-win32.whl (64.0 kB view hashes)

Uploaded CPython 3.9 Windows x86

fastshuffle-1.0.1-cp39-cp39-musllinux_1_1_x86_64.whl (612.3 kB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

fastshuffle-1.0.1-cp39-cp39-musllinux_1_1_i686.whl (670.1 kB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ i686

fastshuffle-1.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (87.1 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

fastshuffle-1.0.1-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl (94.0 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ i686

fastshuffle-1.0.1-cp39-cp39-macosx_10_9_x86_64.whl (64.0 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

fastshuffle-1.0.1-cp38-cp38-win_amd64.whl (71.1 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

fastshuffle-1.0.1-cp38-cp38-win32.whl (63.9 kB view hashes)

Uploaded CPython 3.8 Windows x86

fastshuffle-1.0.1-cp38-cp38-musllinux_1_1_x86_64.whl (611.9 kB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

fastshuffle-1.0.1-cp38-cp38-musllinux_1_1_i686.whl (670.0 kB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ i686

fastshuffle-1.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (87.0 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

fastshuffle-1.0.1-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl (93.7 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ i686

fastshuffle-1.0.1-cp38-cp38-macosx_10_9_x86_64.whl (63.9 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

fastshuffle-1.0.1-cp37-cp37m-win_amd64.whl (71.5 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

fastshuffle-1.0.1-cp37-cp37m-win32.whl (64.7 kB view hashes)

Uploaded CPython 3.7m Windows x86

fastshuffle-1.0.1-cp37-cp37m-musllinux_1_1_x86_64.whl (612.8 kB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ x86-64

fastshuffle-1.0.1-cp37-cp37m-musllinux_1_1_i686.whl (670.9 kB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ i686

fastshuffle-1.0.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (87.8 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

fastshuffle-1.0.1-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl (93.9 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ i686

fastshuffle-1.0.1-cp37-cp37m-macosx_10_9_x86_64.whl (63.8 kB view hashes)

Uploaded CPython 3.7m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page