Skip to main content

Sotastream is a command line tool that augments a batch of text and produces infinite stream of records.

Project description

Sotastream

image License: MIT Read the Docs

Sotastream is a tool for data augmentation for training pipeline. It uses infinibatch internally to generate an infinite stream of shuffled training data and provides a means for on-the-fly data manipulation, augmentation, mixing, and sampling.

Setup

To install from PyPI (https://pypi.org/project/sotastream/)

pip install sotastream

Developer Setup:

# To begin, clone the repository:
git clone https://github.com/marian-nmt/sotastream
cd sotastream
# option 1:
python -m pip install .
# option 2: install in --editable mode
python -m pip install -e .

Entry points

  • As a module: python -m sotastream
  • As a bin in your $PATH: sotastream

Development

Install development tools

python -m pip install -e .[dev,test]   # editable mode

Editable mode (-e / --editable) is recommended for development purposes, pip creates symbolic link to your source code in a way that any edits made are reflected directly to the installed package. [dev,test] installs depencies for development and tests which includes black, pytest etc.

We use black to reformat code to a common code style.

make reformat

Before creating any pull requests, run

make check          # runs reformatter and tests

Running tests

make test           # run unit tests
make regression     # run regression tests

See Makefile for more details.

Usage examples

A folder like split/parallel contains training data in tsv format (src<tab>tgt) split into *.gz files of around 100,000 lines for better shuffling. The below will output an infinite stream of data generated from the gzipped files in these folders, according to the "wmt" recipe found in sotastream/pipelines/example_pipeline.py.

python -m sotastream example split/parallel split/backtrans

You can also provide compressed TSV files directly, in which case sotastream will split them to checksummed folders under /tmp/sotastream/{checksum}:

python -m sotastream example parallel.tsv.gz backtrans.tsv.gz

There are currently two main pipelines: "default", and "wmt". These vary according to the data sources they take as well as the other options available to them.

There are global options that control behavioral aspects such as splitting and parallelization, and also pipeline-specific arguments. You can see these by running

# see global options
python -m sotastream -h

# see default pipeline options
python -m sotastream default -h

# see wmt pipeline options
python -m sotastream wmt -h

Don't cross the streams!

Sotastream workflows build a directed acyclic graph (DAG) consisting of cascades of generators that pass through mutable lines from the graph inputs to the pipeline output. Since each step provides transformations and manipulations of each input line, the only requirement is that modifications along separate branches must not be merged into a single node in the graph, or at least, that great care should be taken when doing so. An example is the Mixer, which does not actually merge modifications from alternate branches, but instead selects across multiple incoming branches using a provided probability distribution.

Custom/private pipelines from own (private) directory

You can create a custom pipeline by adding a file in the current (invocation) directory with a file name matching the pattern "*_pipeline.py". This should follow the interface defined in sotastream/pipelines, namely:

  • Call @pipeline("name") to give your pipeline a name. This name must not conflict with existing names.
  • Inherit from Pipeline base class from sotastream.pipeline. For document pipelines, use DocumentPipeline as base class.

You can find some examples in test/dummy_pipeline.py, as well as the real examples in sotastream/pipelines.

Authors

Sotastream is developed by TextMT Team @ Microsoft Translator.

If you use this tool, please cite:

@misc{post2023sotastream,
      title={SOTASTREAM: A Streaming Approach to Machine Translation Training}, 
      author={Matt Post and Thamme Gowda and Roman Grundkiewicz and Huda Khayrallah and Rohit Jain and Marcin Junczys-Dowmunt},
      year={2023},
      eprint={2308.07489},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Paper link: https://arxiv.org/abs/2308.07489

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sotastream-1.0.1.tar.gz (30.5 kB view details)

Uploaded Source

Built Distribution

sotastream-1.0.1-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file sotastream-1.0.1.tar.gz.

File metadata

  • Download URL: sotastream-1.0.1.tar.gz
  • Upload date:
  • Size: 30.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for sotastream-1.0.1.tar.gz
Algorithm Hash digest
SHA256 f3709874c96f2feb4307dea0f26fbab79c757c0567753b9ca20f93109beba4ad
MD5 26f375a7abf6c7b0351e5d62ec11d25b
BLAKE2b-256 317847bb3daab2f444d193c172394b50693a2661fb8bdb7e7ef459c630d12a34

See more details on using hashes here.

File details

Details for the file sotastream-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: sotastream-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for sotastream-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a03644b40ac960bde0a41217e5f108aaa5fd5202a18a0373207c257ce522c020
MD5 0961c7c67adce7a80fb03775e055be22
BLAKE2b-256 f4153bb4e438a8c5cfece8f4c9f2280f7878282cd634a34d3bc3e1424d129138

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page