Sentence-level Multilingual Augmentation
Project description
SMAUG: Sentence-level Multilingual AUGmentation
smaug
is a package for multilingual data augmentation. It offers transformations focused on changing specific aspects of sentences, such as Named Entities, Numbers, etc.
Getting Started
To start using smaug
, you can install it with pip
:
pip install unbabel-smaug
To run a simple pipeline with all transforms and default validations, first create the following yaml
file:
pipeline:
- cmd: io-read-lines
path: <path to input file with single sentence per line>
lang: <two letter language code for the input sentences>
- cmd: transf-swp-ne
- cmd: transf-swp-num
- cmd: transf-swp-poisson-span
- cmd: transf-neg
- cmd: transf-ins-text
- cmd: transf-del-punct-span
- cmd: io-write-json
path: <path to output file>
# Remove this line for no seed
seed: <seed for the pipeline>
The run the following command:
augment --cfg <path_to_config_file>
Usage
The smaug
package can be used as a command line interface (CLI) or by directly importing and calling the package Python API. To use smaug
, first install it by following these instructions.
Command Line Interface
The CLI offers a way to read, transform, validate and write perturbed sentences to files. For more information, see the full details.
Configuration File
The easiest way to run smaug
is through a configuration file (see the full specification) that specifies and entire pipeline (as shown in the Getting Started section), using the following command:
augment --cfg <path_to_config_file>
Single transform
As an alternative, you can use the command line to directly specify the pipeline to apply. To apply a single transform to a set of sentences, execute the following command:
augment io-read-lines -p <input_file> -l <input_lang_code> <transf_name> io-write-json -p <output_file>
<transf_name>
is the name of the transform to apply (see this section for a list of available transforms).
<input_file>
is a text file with one sentence per line.
<input_lang_code>
is a two character language code for the input sentences.
<output_file>
is a json file to be created with the transformed sentences.
Multiple Transforms
To apply multiple transforms, just specify them in arbitrary order between the read and write operations:
augment io-read-lines -p <input_file> -l <input_lang_code> <transf_name_1> <transf_name_2> ... io-write-json -p <output_file>
Multiple Inputs
To read from multiple input files, also specify them in arbitrary order:
augment io-read-lines -p <input_file_1> -l <input_lang_code_1> read-lines -p <input_file_2> -l <input_lang_code_2> ... <transf_name_1> <transf_name_2> ... io-write-json -p <output_file>
You can further have multiple languages in a given file by having each line with the structure <lang code>,<sentence> and using the following command:
augment io-read-csv -p <input_file> <transf_name_1> <transf_name_2> ... io-write-json -p <output_file>
Developing
To develop this package, execute the following steps:
-
Install the poetry tool for dependency management.
-
Clone this git repository and install the project.
git clone https://github.com/Unbabel/smaug.git
cd smaug
poetry install
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for unbabel_smaug-0.1.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59e1522b8caae338114c0b01e4d58272587b2eda57f76beb5aa1628cb4c3bf9e |
|
MD5 | e600987da02e30249443d9cf011830cb |
|
BLAKE2b-256 | ecd4756429f983063b1617ad33b87d7f30c96dbc00f07f44a184c428444dc32a |