Skip to main content

Sentence-level Multilingual Augmentation

Project description

SMAUG: Sentence-level Multilingual AUGmentation

smaug is a package for multilingual data augmentation. It offers transformations focused on changing specific aspects of sentences, such as Named Entities, Numbers, etc.

Getting Started

To start using smaug, you can install it with pip:

pip install unbabel-smaug

To run a simple pipeline with all transforms and default validations, first create the following yaml file:

pipeline:
- cmd: io-read-lines
  path: <path to input file with single sentence per line>
  lang: <two letter language code for the input sentences>
- cmd: transf-swp-ne
- cmd: transf-swp-num
- cmd: transf-swp-poisson-span
- cmd: transf-neg
- cmd: transf-ins-text
- cmd: transf-del-punct-span
- cmd: io-write-json
  path: <path to output file>
# Remove this line for no seed
seed: <seed for the pipeline>

The run the following command:

augment --cfg <path_to_config_file>

Usage

The smaug package can be used as a command line interface (CLI) or by directly importing and calling the package Python API. To use smaug, first install it by following these instructions.

Command Line Interface

The CLI offers a way to read, transform, validate and write perturbed sentences to files. For more information, see the full details.

Configuration File

The easiest way to run smaug is through a configuration file (see the full specification) that specifies and entire pipeline (as shown in the Getting Started section), using the following command:

augment --cfg <path_to_config_file>

Single transform

As an alternative, you can use the command line to directly specify the pipeline to apply. To apply a single transform to a set of sentences, execute the following command:

augment io-read-lines -p <input_file> -l <input_lang_code> <transf_name> io-write-json -p <output_file>

<transf_name> is the name of the transform to apply (see this section for a list of available transforms).

<input_file> is a text file with one sentence per line.

<input_lang_code> is a two character language code for the input sentences.

<output_file> is a json file to be created with the transformed sentences.

Multiple Transforms

To apply multiple transforms, just specify them in arbitrary order between the read and write operations:

augment io-read-lines -p <input_file> -l <input_lang_code> <transf_name_1> <transf_name_2> ... io-write-json -p <output_file>

Multiple Inputs

To read from multiple input files, also specify them in arbitrary order:

augment io-read-lines -p <input_file_1> -l <input_lang_code_1> read-lines -p <input_file_2> -l <input_lang_code_2> ... <transf_name_1> <transf_name_2> ... io-write-json -p <output_file>

You can further have multiple languages in a given file by having each line with the structure <lang code>,<sentence> and using the following command:

augment io-read-csv -p <input_file> <transf_name_1> <transf_name_2> ... io-write-json -p <output_file>

Developing

To develop this package, execute the following steps:

  • Install the poetry tool for dependency management.

  • Clone this git repository and install the project.

git clone https://github.com/Unbabel/smaug.git
cd smaug
poetry install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unbabel_smaug-0.1.3.tar.gz (33.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unbabel_smaug-0.1.3-py3-none-any.whl (46.5 kB view details)

Uploaded Python 3

File details

Details for the file unbabel_smaug-0.1.3.tar.gz.

File metadata

  • Download URL: unbabel_smaug-0.1.3.tar.gz
  • Upload date:
  • Size: 33.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.9.6 Darwin/22.2.0

File hashes

Hashes for unbabel_smaug-0.1.3.tar.gz
Algorithm Hash digest
SHA256 348e37b2e59e7770363c156e0f3c3ee30688daac2587d45e9ca37163838881b7
MD5 d4f7b57eac5e851575f8a05fc7663397
BLAKE2b-256 2b3c955011efdc0357ab0981c2ea09ae59989f8f9ce5aa3cbc9930dceb6cc65a

See more details on using hashes here.

File details

Details for the file unbabel_smaug-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: unbabel_smaug-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 46.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.9.6 Darwin/22.2.0

File hashes

Hashes for unbabel_smaug-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 59e1522b8caae338114c0b01e4d58272587b2eda57f76beb5aa1628cb4c3bf9e
MD5 e600987da02e30249443d9cf011830cb
BLAKE2b-256 ecd4756429f983063b1617ad33b87d7f30c96dbc00f07f44a184c428444dc32a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page