TEA - Translation Engine Architect
Project description
TEA - Translation Engine Architect
A command line tool to create translation engine.
Install
First install pipx then (x being your python version):
pipx install pangeamt-tea
Usage
Step 1: Create a new project
tea new --customer customer --src_lang es --tgt_lang en --flavor automotion --version 2
This command will create the project directory structure:
├── customer_es_en_automotion_2
│ ├── config.yml
│ └── data
Then enter in the directory
cd customer_es_en_automotion_2
Step 2: Configuration
Tokenizer
A tokenizer can be applied to source and target
tea config tokenizer --src mecab --tgt moses
To list all available tokenizer:
tea config tokenizer --help
if you would not like to use tokenizers you can run:
tea config tokenizer -s none -t none
Truecaser
tea config truecaser --src --tgt
if you would not like to use truecaser you can run:
tea config tokenizer
BPE / SentencePiece
For joint BPE:
tea config bpe -j
For not joint BPE:
tea bpe -s -t
For using sentencepiece:
tea config bpe --sentencepiece
and options --model_type TEXT (unigram) --vocab_size INTEGER (8000) if you would like to modify them from default
Processors
tea config processors -s "{processors}"
being processors a list of preprocesses and postprocesses.
To list all available processors:
tea config processors --list
In order to test the processors that will be applied you can run this script in the main TEA project directory:
debug_normalizers.py <config_file> <src_test> <tgt_test>
being config_file the yaml config and src_test and tgt_test the segments to test for source and target text.
Prepare
tea config prepare --shard_size 100000 --src_seq_length 400 --tgt_seq_length 400
Translation model
tea config translation-model -n onmt
Step 3:
Copy some multilingual ressources (.tmx, bilingual files, .af ) into the 'data' directory
Step 4: Run
Create workflow
tea worflow new
Clean the data passing the normalizers and validators:
tea workflow clean -n {clean_th} -d
being clean_th the number of threads.
Preprocess the data (split data in train, dev or test, tokenization, BPE):
tea workflow prepare -n {prepare_th} -s 3
being prepare_th the number of threads.
Training model
tea workflow train --gpu 0
if you do not want to use gpu do not use this parameter.
Evaluate model
tea workflow eval --step {step} --src file.src --ref file.tgt --log file.log --out file.out --gpu 0
Reset
First of all you may check the current status of the workflow using:
tea workflow status
Then you can reset your worflow at any step (clean, prepare, train, eval) using:
tea worflow reset -s {step_name}
Or if you want to make a full reset of the workflow use:
tea workflow reset
If you need some help on how to use reset command:
tea workflow reset --help
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pangeamt-tea-0.2.34.tar.gz
.
File metadata
- Download URL: pangeamt-tea-0.2.34.tar.gz
- Upload date:
- Size: 20.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd8ce352eeb45d9c87acc933351efa82f4a5fa8d4c3a95268eb7a010102165fc |
|
MD5 | 8789652a77c79657d8ade953547f32c1 |
|
BLAKE2b-256 | e83bf639d4995a07b8762e1a0c7b5060d632bc317c3f0036787a2214893cef1d |
File details
Details for the file pangeamt_tea-0.2.34-py3-none-any.whl
.
File metadata
- Download URL: pangeamt_tea-0.2.34-py3-none-any.whl
- Upload date:
- Size: 25.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4194c978f12caeaef6ae7d191c1b5cd07f2d818db59d72ec71584e0688a98282 |
|
MD5 | 6799243b898470e1b04bbdba89b30d35 |
|
BLAKE2b-256 | 3c11df51c78741d987515e7f3df0534f2c8c9c3b9a9834b5a87a7ffbed79e3e3 |