Small-vocabulary neural sequence-to-sequence models

These details have not been verified by PyPI

Project links

homepage

Project description

Yoyodyne 🪀

Yoyodyne provides neural models for small-vocabulary sequence-to-sequence generation with and without feature conditioning.

These models are implemented using PyTorch and Lightning.

While we provide classic LSTM and transformer models, some of the provided models are particularly well-suited for problems where the source-target alignments are roughly monotonic (e.g., transducer and hard_attention_lstm) and/or where source and target vocabularies have substantial overlap (e.g., pointer_generator_lstm).

Philosophy

Yoyodyne is inspired by FairSeq (Ott et al. 2019) but differs on several key points of design:

It is for small-vocabulary sequence-to-sequence generation, and therefore includes no affordances for machine translation or language modeling. Because of this:
- The architectures provided are intended to be reasonably exhaustive.
- There is little need for data preprocessing; it works with TSV files.
It has support for using features to condition decoding, with architecture-specific code for handling feature information.
It supports the use of validation accuracy (not loss) for model selection and early stopping.
Releases are made regularly.
🚧 UNDER CONSTRUCTION 🚧: It has exhaustive test suites.
🚧 UNDER CONSTRUCTION 🚧: It has performance benchmarks.

Authors

Yoyodyne was created by Adam Wiemerslage, Kyle Gorman, Travis Bartley, and other contributors like yourself.

Installation

Local installation

First install dependencies:

pip install -r requirements.txt

Then install:

pip install .

Google Colab

Yoyodyne is compatible with Google Colab GPU runtimes. This notebook provides a worked example. Colab also provides access to TPU runtimes, but this is not yet compatible with Yoyodyne to our knowledge.

Usage

Training

Training is performed by the yoyodyne-train script. One must specify the following required arguments:

--model_dir: path for model metadata and checkpoints
--train: path to TSV file containing training data
--val: path to TSV file containing validation data

The user can also specify various optional training and architectural arguments. See below or run yoyodyne-train --help for more information.

Validation

Validation is run at intervals requested by the user. See --val_check_interval and --check_val_every_n_epoch here. Additional evaluation metrics can also be requested with --eval_metric. For example

yoyodyne-train --eval_metric ser ...

will additionally compute symbol error rate (SER) each time validation is performed. Additional metrics can be added to evaluators.py.

Prediction

Prediction is performed by the yoyodyne-predict script. One must specify the following required arguments:

--arch: architecture, matching the one used for training
--model_dir: path for model metadata
--checkpoint: path to checkpoint
--predict: path to file containing data to be predicted
--output: path for predictions

The --predict file can either be a TSV file or an ordinary TXT file with one source string per line; in the latter case, specify --target_col 0. Run yoyodyne-predict --help for more information.

Beam search is implemented (currently only for LSTM-based models) and can be enabled by setting --beam_width > 1. When using beam search, the log-likelihood for each hypothesis is always returned. The outputs are pairs of hypotheses and the associated log-likelihoods.

Data format

The default data format is a two-column TSV file in which the first column is the source string and the second the target string.

source   target

To enable the use of a features column, one specifies a (non-zero) argument to --features_col, and optionally a --features_sep. For instance, for the SIGMORPHON 2016 shared task data:

source   feat1,feat2,...    target

this format is specified by --features_col 2 --features_sep , --target_col 3.

Alternatively, for the CoNLL-SIGMORPHON 2017 shared task, the first column is the source (a lemma), the second is the target (the inflection), and the third contains semi-colon delimited features strings:

source   target    feat1;feat2;...

this format is specified by --features_col 3 because ; is the default separator for features.

In order to ensure that targets are ignored during prediction, one can specify --target_col 0.

Reserved symbols

Yoyodyne reserves symbols of the form <...> for internal use. Feature-conditioned models also use [...] to avoid clashes between features symbols and source and target symbols, and --no_tie_embeddings uses {...} to avoid clashes between source and target symbols. Therefore, users should not provide any symbols of the form <...>, [...], or {...}.

Model checkpointing

Checkpointing is handled by Lightning. The path for model information, including checkpoints, is specified by --model_dir such that we build the path model_dir/version_n, where each run of an experiment with the same model_dir is namespaced with a new version number. A version stores all of the following:

the index (model_dir/index.pkl),
the hyperparameters (model_dir/lightning_logs/version_n/hparams.yaml),
the metrics (model_dir/lightning_logs/version_n/metrics.csv), and
the checkpoints (model_dir/lightning_logs/version_n/checkpoints).

By default, each run initializes a new model from scratch, unless the --train_from argument is specified. To continue training from a specific checkpoint, the full path to the checkpoint should be specified with for the --train_from argument. This creates a new version, but starts training from the provided model checkpoint.

By default 1 checkpoint is saved. To save more than one checkpoint, use the --num_checkpoints flag. To save a checkpoint every epoch, set --num_checkpoints -1. By default, the checkpoints saved are those which maximize validation accuracy. To instead select checkpoints which minimize validation loss, set --checkpoint_metric loss.

Models

The user specifies the overall architecture for the model using the --arch flag. The value of this flag specifies the decoder's architecture and whether or not an attention mechanism is present. This flag also specifies a default architecture for the encoder(s), but it is possible to override this with additional flags. Supported values for --arch are:

attentive_gru: This is an GRU decoder with GRU encoders (by default) and an attention mechanism. The initial hidden state is treated as a learned parameter.
attentive_lstm: This is similar to the attentive_gru but instead uses an LSTM decoder and encoder (by default).
gru: This is an GRU decoder with GRU encoders (by default); in lieu of an attention mechanism, the last non-padding hidden state of the encoder is concatenated with the decoder hidden state.
hard_attention_gru: This is an GRU encoder/decoder modeling generation as a Markov process. By default, it assumes a non-monotonic progression over the source string, but with --enforce_monotonic the model must progress over each source character in order. A non-zero value of --attention_context (default: 0) widens the context window for conditioning state transitions to include one or more previous states.
hard_attention_lstm: This is similar to the hard_attention_gru but instead uses an LSTM decoder and encoder (by deafult). --attention_context (default: 0) widens the context window for conditioning state transitions to include one or more previous states.
lstm: This is similar to the gru but instead uses an LSTM decoder and encoder (by default).
pointer_generator_gru: This is an GRU decoder with GRU encoders (by default) and a pointer-generator mechanism. Since this model contains a copy mechanism, it may be superior to an ordinary attentive GRU when the source and target vocabularies overlap significantly. Note that this model requires that the number of --encoder_layers and --decoder_layers match.
pointer_generator_lstm: This is similar to the pointer_generator_gru but instead uses an LSTM decoder and encoder (by default).
pointer_generator_transformer: This is similar to the pointer_generator_gru and pointer_generator_lstm but instead uses a transformer decoder and encoder (by default). When using features, the user may wish to specify the number of features attention heads (with --features_attention_heads).
transducer_gru: This is an GRU decoder with GRU encoders (by default) and a neural transducer mechanism. On model creation, expectation maximization is used to learn a sequence of edit operations, and imitation learning is used to train the model to implement the oracle policy, with roll-in controlled by the --oracle_factor flag (default: 1). Since this model assumes monotonic alignment, it may be superior to attentive models when the alignment between input and output is roughly monotonic and when input and output vocabularies overlap significantly.
transducer_lstm: This is similar to the transducer_gru but instead uses an LSTM decoder and encoder (by default).
transformer: This is a transformer decoder with transformer encoders (by default). Sinusodial positional encodings and layer normalization are used. The user may wish to specify the number of attention heads (with --attention_heads; default: 4).

The --arch flag specifies the decoder type; the user can override default encoder types using the --source_encoder_arch flag and, when features are present, the --features_encoder_arch flag. Valid values are:

feature_invariant_transformer (usually used with --features_encoder_arch): a variant of the transformer encoder used with features; it concatenates source and features and uses a learned embedding to distinguish between source and features symbols.
linear (usually used with --features_encoder_arch): a non-contextual encoder with a affine transformation applied to embeddings
gru: a GRU encoder.
lstm: a LSTM encoder.
transformer: a transformer encoder.

For all models, the user may also wish to specify:

--decoder_layers (default: 1): number of decoder layers
--embedding (default: 128): embedding size
--encoder_layers (default: 1): number of encoder layers
--hidden_size (default: 512): hidden layer size

By default, RNN-backed (i.e., GRU and LSTM) encoders are bidirectional. One can disable this with the --no_bidirectional flag.

Training options

A non-exhaustive list includes:

Batch size:
- --batch_size (default: 32)
- --accumulate_grad_batches (default: not enabled)
Regularization:
- --dropout (default: 0.2)
- --label_smoothing (default: 0.0)
- --gradient_clip_val (default: not enabled)
Optimizer:
- --learning_rate (default: 0.001)
- --optimizer (default: "adam")
- --beta1 (default: 0.9): $\beta_1$ hyperparameter for the Adam optimizer (--optimizer adam)
- --beta2 (default: 0.99): $\beta_2$ hyperparameter for the Adam optimizer (--optimizer adam)
- --scheduler (default: not enabled)
Duration:
- --max_epochs
- --min_epochs
- --max_steps
- --min_steps
- --max_time
Seeding:
- --seed
Weights & Biases:
- --log_wandb (default: False): enables Weights & Biases tracking; the "project" name can be specified using the environmental variable $WANDB_PROJECT.

Additional training options are discussed below.

Early stopping

To enable early stopping, use the --patience and --patience_metric flags. Early stopping occurs after --patience epochs with no improvement (when validation loss stops decreasing if --patience_metric loss, or when validation accuracy stops increasing if --patience_metric accuracy). Early stopping is not enabled by default.

Schedulers

By default, Yoyodyne uses a constant learning rate during training, but best practice is to gradually decrease learning rate as the model approaches convergence using a scheduler. The following schedulers are supported and are selected with --scheduler:

reduceonplateau: reduces the learning rate (multiplying it by --reduceonplateau_factor) after --reduceonplateau_patience epochs with no improvement (when validation loss stops decreasing if --reduceonplateau loss, or when validation accuracy stops increasing if --reduceonplateau_metric accuracy) until the learning rate is less than or equal to --min_learning_rate.
warmupinvsqrt: linearly increases the learning rate from 0 to --learning_rate for --warmup_steps steps, then decreases learning rate according to an inverse root square schedule.

Tied embeddings

By default, the source and target vocabularies are shared. This can be disabled with the flag --no_tie_embeddings, which uses {...} to avoid clashes between source and target symbols.

Batch size tricks

Choosing a good batch size is key to fast training and optimal performance. Batch size is specified by the --batch_size flag.

One may wish to train with a larger batch size than will fit in "in core". For example, suppose one wishes to fit with a batch size of 4,096, but this gives an out of memory (OOM) exception. Then, with minimal overhead, one could simulate an effective batch size of 4,096 by using batches of size 1,024, accumulating gradients from 4 batches per update:

yoyodyne-train --batch_size 1024 --accumulate_grad_batches 4 ...

The --find_batch_size flag enables automatically computation of the batch size. With --find_batch_size max, it simply uses the maximum batch size, ignoring --batch_size. With --find_batch_size opt, it finds the maximum batch size, and then interprets it as follows:

If the maximum batch size is greater than --batch_size, then --batch_size is used as the batch size.
However, if the maximum batch size is less than --batch_size, it solves for the optimal gradient accumulation trick and uses the largest batch size and the smallest number of gradient accumulation steps whose product is --batch_size.

If one wishes to solve for these quantities without actually training, pass --find_batch_size opt and --max_epochs 0. This will halt after computing and logging the solution.

Hyperparameter tuning

No neural model should be deployed without proper hyperparameter tuning. However, the default options give a reasonable initial settings for an attentive biLSTM. For transformer-based architectures, experiment with multiple encoder and decoder layers, much larger batches, and the warmup-plus-inverse square root decay scheduler.

Weights & Biases tuning

wandb_sweeps shows how to use Weights & Biases to run hyperparameter sweeps.

Accelerators

Hardware accelerators can be used during training or prediction. In addition to CPU (the default) and GPU (--accelerator gpu), other accelerators may also be supported but not all have been tested yet.

Precision

By default, training uses 32-bit precision. However, the --precision flag allows the user to perform training with half precision (16) or with the bfloat16 half precision format if supported by the accelerator. This may reduce the size of the model and batches in memory, allowing one to use larger batches. Note that only default precision is expected to work with CPU training.

Examples

The examples directory contains interesting examples, including:

concatenate provides sample code for concatenating source and features symbols à la Kann & Schütze (2016).
wandb_sweeps shows how to use Weights & Biases to run hyperparameter sweeps.

Related projects

Maxwell is used to learn a stochastic edit distance model for the neural transducer.
Yoyodyne Pretrained provides a similar interface but uses large pre-trained models to initialize the encoder and decoder modules.

For developers

Developers, developers, developers! - Steve Ballmer

This section contains instructions for the Yoyodyne maintainers.

Design

Yoyodyne is beholden to the heavily object-oriented design of Lightning, and wherever possible uses Torch to keep computations on the user-selected accelerator. Furthermore, since it is developed at "low-intensity" by a geographically-dispersed team, consistency is particularly important. Some consistency decisions made thus far:

Abstract classes overrides are enforced using PEP 3119.
numpy is used for basic mathematical operations and constants even in places where the built-in math would do.

Models and modules

A model in Yoyodyne is a sequence-to-sequence architecture and inherits from yoyodyne.models.BaseModel. These models in turn consist of ("have-a") one or more encoders responsible for building a numerical representation of the source (and features, where appropriate) and a decoder responsible for predicting the target sequence using the representation generated by the encoders. The encoders and decoder are themselves Torch modules.

The model is responsible for constructing the encoders and decoders. The model dictates the type of decoder; each model has a preferred encoder type as well, though it may work with others. The model communicates with its modules by calling them as functions (which invokes their forward methods); however, in some cases it is also necessary for the model to call ancillary members or methods of its modules. The base.ModuleOutput class is used to capture the output of the various modules, and it is this which is essential to, e.g., abstracting between different kinds of encoders which may or may not have hidden or cell state to return.

When features are present, models are responsible for fusing encoded source and features and do so in a model-specific fashion. For example, ordinary RNNs and transformers concatenate source and features encodings on the length dimension whereas hard attention and transducer models average across the features encoding across the length dimension and the concatenate the resulting tensor with the source encoding on the encoding dimension; by doing so they preserve the source length and make it impossible to attend directly to features symbols.

Decoding strategies

Each model supports greedy decoding implemented via a greedy_decode method; some also support beam decoding via beam_decode. Some models (e.g., the hard attention models) require teacher forcing, but most can be trained with either student or teacher forcing.

Releasing

Create a new branch. E.g., if you want to call this branch "release": git checkout -b release
Sync your fork's branch to the upstream master branch. E.g., if the upstream remote is called "upstream": git pull upstream master
Increment the version field in pyproject.toml.
Stage your changes: git add pyproject.toml.
Commit your changes: git commit -m "your commit message here"
Push your changes. E.g., if your branch is called "release": git push origin release
Submit a PR for your release and wait for it to be merged into master.
Tag the master branch's last commit. The tag should begin with v; e.g., if the new version is 3.1.4, the tag should be v3.1.4. This can be done:
- on GitHub itself: click the "Releases" or "Create a new release" link on the right-hand side of the Yoyodyne GitHub page) and follow the dialogues.
- from the command-line using git tag.
Build the new release: python -m build
Upload the result to PyPI: twine upload dist/*

References

Kann, K. and Schütze, H. 2016. Single-model encoder-decoder with explicit morphological representation for reinflection. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 555-560.

Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. 2019. fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48-53.

(See also yoyodyne.bib for more work used during the development of this library.)

Project details

These details have not been verified by PyPI

Project links

homepage

Release history Release notifications | RSS feed

0.5.15

Apr 27, 2026

0.5.14

Apr 23, 2026

0.5.13

Apr 19, 2026

0.5.12

Mar 18, 2026

0.5.11

Mar 15, 2026

0.5.10

Mar 11, 2026

0.5.9

Mar 4, 2026

0.5.8

Feb 28, 2026

0.5.7

Feb 25, 2026

0.5.6

Feb 24, 2026

0.5.5

Feb 23, 2026

0.5.3

Feb 12, 2026

0.5.2

Jan 4, 2026

0.5.1

Jan 3, 2026

0.5.0

Jan 2, 2026

0.4.9

Dec 30, 2025

0.4.8

Dec 30, 2025

0.4.7

Dec 29, 2025

0.4.6

Dec 15, 2025

0.4.5

Dec 10, 2025

0.4.4

Oct 4, 2025

0.4.3

Sep 26, 2025

0.4.2

Sep 26, 2025

0.4.1

Sep 22, 2025

0.4.0

Sep 20, 2025

This version

0.3.3

Jul 15, 2025

0.3.2

Jun 20, 2025

0.3.1

Apr 6, 2025

0.3.0

Mar 9, 2025

0.2.20

Feb 14, 2025

0.2.19

Jan 18, 2025

0.2.18

Dec 10, 2024

0.2.17

Dec 2, 2024

0.2.16

Dec 2, 2024

0.2.15

Nov 28, 2024

0.2.14

Oct 31, 2024

0.2.13

Oct 25, 2024

0.2.12

Jul 30, 2024

0.2.11

Jul 8, 2024

0.2.10

May 9, 2024

0.2.9

Mar 6, 2024

0.2.8

Nov 4, 2023

0.2.7

Oct 23, 2023

0.2.6

Oct 3, 2023

0.2.5

Aug 3, 2023

0.2.4

Jul 19, 2023

0.2.3

Jun 30, 2023

0.2.1

Dec 5, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yoyodyne-0.3.3.tar.gz (77.1 kB view details)

Uploaded Jul 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

yoyodyne-0.3.3-py3-none-any.whl (88.6 kB view details)

Uploaded Jul 15, 2025 Python 3

File details

Details for the file yoyodyne-0.3.3.tar.gz.

File metadata

Download URL: yoyodyne-0.3.3.tar.gz
Upload date: Jul 15, 2025
Size: 77.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for yoyodyne-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`2351b2164ebb87379074d8b64b9306a3998c4b970917163f8802f72062f35f22`
MD5	`550ca3250483c78457a69f546ced722a`
BLAKE2b-256	`f0c77691fe35518b5713b2f6a93f49b85d2c922effbecc5e2cd8f45435874920`

See more details on using hashes here.

File details

Details for the file yoyodyne-0.3.3-py3-none-any.whl.

File metadata

Download URL: yoyodyne-0.3.3-py3-none-any.whl
Upload date: Jul 15, 2025
Size: 88.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for yoyodyne-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`58039cba792fb851b81af219ab6df3720dd504b6a5e6302e6f832a1f62da903c`
MD5	`a1cfc90f2bb041dbf3b8b5c8c67408a1`
BLAKE2b-256	`9cac8b2ed2044837d613f1ef3ddfedb074ac3d66d6aa7017c2554c78f9f3bba5`

See more details on using hashes here.

yoyodyne 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Yoyodyne 🪀

Philosophy

Authors

Installation

Local installation

Google Colab

Usage

Training

Validation

Prediction

Data format

Reserved symbols

Model checkpointing

Models

Training options

Early stopping

Schedulers

Tied embeddings

Batch size tricks

Hyperparameter tuning

Weights & Biases tuning

Accelerators

Precision

Examples

Related projects

For developers

Design

Models and modules

Decoding strategies

Releasing

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes