A configuration tool designed to simplify the creation of complete OpenNMT-tf pipelines

These details have not been verified by PyPI

Project links

Project description

AutONMT-tf

AutONMT-tf is a configuration tool designed to simplify the creation of complete OpenNMT-tf pipelines (data loading, preprocessing, training, inference...). It can also be used for other tasks not related to OpenNMT-tf, but there are no built-in modules for other NMT frameworks.

It is still at an early development stage, neither stability nor backward-compatibilty are guaranteed.

Requirements

AutONMT-tf requires :

Python 3.7 or above
OpenNMT-tf 2.20 or above

Installation

Using pip

It is the recommanded (and simplest) installation method :

pip install --upgrade pip
pip install AutONMT-tf

From source

You can also install AutONMT-tf directly from source :

git clone https://gitlab.com/mehdidou99/AutONMT-tf.git
cd AutONMT-tf
pip install --upgrade pip
pip install .

Usage

Quickstart

Once installed, you can try to run a simple Transformer model pipeline with some preprocessing :

git clone https://gitlab.com/mehdidou99/AutONMT-tf.git
cd AutONMT-tf
autonmt_cli -v --config examples/pipelines/simple_transformer.yml --pipeline train

The input data used in this example is a toy dataset with several corpora : you can inspect and modify it in examples/data/

Pipeline examples

Some examples are available in examples/pipelines/:

simple_encoder.yml: A very simple example showcasing base functionalities of AutONMT-tf
fren_triple_encoder.yml: A more complex example showcasing the future functionalities of AutONMT-tf, which will allow it to have the flexibility needed for more complex models and pipelines.

Command line

AutONMT-tf is used through the autonmt_cli command line interface.

Simplest usage : autonmt_cli --config path/to/pipeline/config/file.yml --pipeline name_of_the_pipeline_to_run
Key options :
- --until step : stops the execution after step step
- --use_cache : resumes execution using cache instead of launching the pipeline from the beginning

Pipeline elements

Each pipeline configuration file is made of the following elements:

Global configuration
Pipelines made of pipeline blocks
Modules

The simple_transformer example illustrates all of those elements.

Global configuration

The global configuration defines the elements that are used by all the pipelines defined in the file :

Experiment name
Custom directories
Model configuration
Scripts directory
Cache directory

Pipelines

Pipelines are the core element of AutONMT-tf. A pipeline is a list of pipeline blocks which each define a specific step of the process : block is applied to a list of corpora; it receives input through input tags and outputs output tags. See Tags to learn more.

AutONMT-tf currently provides the following block types:

data_query: Loads data : it is usually the first block of a pipeline, and creates the corpora that are later used by the subsequent blocks.
merging: Used to merge data from several datasets into one new dataset, usually used to merge data for training.
vocab_building: Builds a vocabulary using the 'onmt-build-vocab' command from OpenNMT-tf.
splitting: Splits input data into several parts, the intended use is to split train data into train, test and validation sets.
training: Trains the model using the 'onmt-main' command from OpenNMT-tf.
script: Executes custom scripts, usually used for experiment-dependent features such as preprocessing, tokenization, score computation...

Modules

Modules are currently simply configuration modules allowing blocks to delegate their specific configuration to said module. Their use should be extended in future versions, allowing complete blocks to be defined as modules and allowing external module files in order to allow blocks to be reused in different experiments.

Artifacts

Some of the generated files are needed by the user, either to be inspected (e.g training data) or to be used in other pipelines. For example, a tokenizer can be trained with the training data, the output of the training being then needed to tokenize test data. Users can retrieve such files through artifacts, by defining correspondancies between Corpora/Tag pairs and custom filenames to which they want to save their files. See simple_transformer for a concrete example.

AutONMT-tf blocks

Data Query

The data_query can load data from several corpora following a query. Only one kind of query is currently supported, here is an example from simple_transformer :

data_query_train:
    search_root: data/multilingual
    subpaths:
        Tatoeba: 'Tatoeba/Tatoeba'
        GlobalVoices: 'GlobalVoices/GlobalVoices'
        NewsCommentary: 'NewsCommentary/News-Commentary'
        TED2013: 'TED2013/TED2013'
    query:
        src_lgs:
            - fr
        tgt_lgs:
            - en
        max_entries: 1000000
        options:
            - try_reversed_lg_pairs

In order to be found by this kind of queries, corpus files must be named with the following convention : <subpath>.<src>-<tgt>.<src or tgt>

For example, for Tatoeba in the preceding example: Tatoeba/Tatoeba.fr-en.fr and Tatoeba/Tatoeba.fr-en.fr.

If the try_reversed_lg_pairs option is active, source and target languages can be reversed in the suffix. e.g : Tatoeba/Tatoeba.en-fr.fr and Tatoeba/Tatoeba.en-fr.fr.

Other query types will be added in a future version, allowing for less rigid naming conventions.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Jul 30, 2021

0.1.1

Jul 30, 2021

0.1.0

Jul 30, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

AutONMT-tf-0.1.2.tar.gz (16.3 kB view details)

Uploaded Jul 30, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

AutONMT_tf-0.1.2-py3-none-any.whl (21.5 kB view details)

Uploaded Jul 30, 2021 Python 3

File details

Details for the file AutONMT-tf-0.1.2.tar.gz.

File metadata

Download URL: AutONMT-tf-0.1.2.tar.gz
Upload date: Jul 30, 2021
Size: 16.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for AutONMT-tf-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`2f04d5c8b9b06e72ffed5003bf8e82c567a376c69d0c34c2956bc3f0d39c3a06`
MD5	`7a2b8b0781955e0f7473359333d2eeef`
BLAKE2b-256	`48f7a51b43b7708c1ab837d1a9d8d6d3c576ccbb3f1efb82a5de85fd9b7284cb`

See more details on using hashes here.

File details

Details for the file AutONMT_tf-0.1.2-py3-none-any.whl.

File metadata

Download URL: AutONMT_tf-0.1.2-py3-none-any.whl
Upload date: Jul 30, 2021
Size: 21.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for AutONMT_tf-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0c438c6985d15731de04902cd36303bcb3035fde95ee5fcbd3c9925040f94d3f`
MD5	`72cdf63ac32bef272bd6c9156b92f289`
BLAKE2b-256	`235a7631fdfc9988f9154074aa7dbd87f987b0821991ed3b6d994369bcc67bea`

See more details on using hashes here.

AutONMT-tf 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AutONMT-tf

Requirements

Installation

Using pip

From source

Usage

Quickstart

Pipeline examples

Command line

Pipeline elements

Global configuration

Pipelines

Modules

Tags

Artifacts

AutONMT-tf blocks

Data Query

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes