Skip to main content

Large-Scale Translation Data Mining.

Project description

stopes

stopes: A library for preparing data for machine translation research

As part of the FAIR No Language Left Behind (NLLB) (Paper, Website, Blog) project to drive inclusion through machine translation, a large amount of data was processed to create training data. We provide the libraries and tools we used to:

  1. create clean monolingual data from web data
  2. mine bitext
  3. easily write scalable pipelines for processing data for machine translation

Full documentation on https://facebookresearch.github.io/stopes

Examples

checkout the demo directory for an example usage with the WMT22 Shared Task: Large-Scale Machine Translation Evaluation for African Languages data.

Requirements

stopes relies on:

  • submitit to schedule jobs when ran on clusters
  • hydra-core version >= 1.2.0 for configuration
  • fairseq to use LASER encoders
  • PyTorch version >= 1.5.0
  • Python version >= 3.8

Installing stopes

stopes uses flit to manage its setup, you will need a recent version of pip for the install to work. We recommend that you first upgrade pip: python -m pip install --upgrade pip

You can install stopes with pip: pip install -e '.[dev,mono,mining]'

You can choose what to install. If you are only interested in mining, you do not need to install dev, and mono.

The mining pipeline relies on fairseq to run LASER encoders, pip cannot install fairseq currently, so you will have to do this manually. Check the fairseq repo for up to date instructions and requirements:

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

If you plan to train a lot of NMT model you will also want to setup apex to get a faster training.

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

How stopes works

stopes is made of a few different parts:

  1. core provides a library to write readable piplines
  2. modules provides a set of modules using the core library and implementing common steps in our mining and evaluation pipelines
  3. pipelines provides pipeline implementation for the data pipelines we use in NLLB:
  • monolingual to preprocess and clean single language data
  • bitext to run the "global mining" pipeline and extract aligned sentences from two monolingual datasets. (inspired by CCMatric)

Full documentation: see https://facebookresearch.github.io/stopes or the websites/docs folder.

Contributing

See the CONTRIBUTING file for how to help out.

Contributors

(in alphabetical order)

Citation

If you use stopes in your work or any models/datasets/artifacts published in NLLB, please cite :

@article{nllb2022,
  title={No Language Left Behind: Scaling Human-Centered Machine Translation},
  author={{NLLB Team} and Costa-jussà, Marta R. and Cross, James and Çelebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and Sun, Anna and Wang, Skyler and Wenzek, Guillaume and Youngblood, Al and Akula, Bapi and Barrault, Loic and Mejia-Gonzalez, Gabriel and Hansanti, Prangthip and Hoffman, John and Jarrett, Semarley and Sadagopan, Kaushik Ram and Rowe, Dirk and Spruit, Shannon and Tran, Chau and Andrews, Pierre and Ayan, Necip Fazil and Bhosale, Shruti and Edunov, Sergey and Fan, Angela and Gao, Cynthia and Goswami, Vedanuj and Guzmán, Francisco and Koehn, Philipp and Mourachko, Alexandre and Ropers, Christophe and Saleem, Safiyyah and Schwenk, Holger and Wang, Jeff},
  year={2022}
}

License

stopes is MIT licensed, as found in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stopes-1.0.1.tar.gz (736.6 kB view details)

Uploaded Source

Built Distribution

stopes-1.0.1-py3-none-any.whl (202.7 kB view details)

Uploaded Python 3

File details

Details for the file stopes-1.0.1.tar.gz.

File metadata

  • Download URL: stopes-1.0.1.tar.gz
  • Upload date:
  • Size: 736.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.27.1

File hashes

Hashes for stopes-1.0.1.tar.gz
Algorithm Hash digest
SHA256 6e0171b30ebc846ba7d4f53e2f195533cc3131dc304fa31b6e61483e60076ee9
MD5 d0f9db9f22fe061765844c101898991c
BLAKE2b-256 b4201f2c93dc1a3113a864797a9bfe21b956aa620e7d7be4948d1fcb71aaaf29

See more details on using hashes here.

File details

Details for the file stopes-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: stopes-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 202.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.27.1

File hashes

Hashes for stopes-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b8646f70af05000617294bc9dc0ee8e187a31e451e1ff05933c53b899cc7767f
MD5 f6183aa32ca47411e968573d72c73020
BLAKE2b-256 10ab1519873ec14ecfd7779696639c672103432212b9a88ba175032ceb059dff

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page