Neural language modeling and sequence-to-sequence transduction in PyTorch.
Project description
Rau (rhymes with “now”) is a Python module and command-line tool that provides PyTorch implementations of neural network-based language modeling and sequence-to-sequence generation. It is primarily suited for academic researchers. Out of the box, it provides implementations of the simple recurrent neural network (RNN), long short-term memory (LSTM), and transformer architectures. It includes extensible Python APIs and command-line tools for data preprocessing, training, and evaluation. It is very easy to get started with the command-line tools quickly if you provide your data as pretokenized (space-separated) plaintext.
There are, of course, many excellent implementations of neural network training out in the world for you to choose from, but I think Rau does have some neat features and avoids some pitfalls that I have not seen handled well in other codebases. To see if using Rau is a good idea for you, see Technical Details, Features, and Limitations.
Rau has been used in the following papers:
Alexandra Butoi, Ghazal Khalighinejad, Anej Svete, Josef Valvoda, Ryan Cotterell, Brian DuSell. Training Neural Networks as Recognizers of Formal Languages. ICLR 2025.
Taiga Someya, Anej Svete, Brian DuSell, Timothy J. O’Donnell, Mario Giulianelli, Ryan Cotterell. Information Locality as an Inductive Bias for Neural Language Models. ACL 2025.
Rau is based on code that was originally used for adding stack data structures to LSTMs and transformers. A version of Rau that includes implementations of stack-augmented neural network architectures can be found on this branch.
How to Use Rau
Rau can be used to train neural networks from scratch on pretokenized data, save them to disk, and re-load them to process pretokenized data. For example, it can be used to train a transformer language model and then calculate its perplexity on a set of test data. Or, it can be used to train a transformer encoder-decoder on sequence-to-sequence data and then translate a set of source sequences to outputs using beam search.
For language modeling, data simply needs to be formatted with one example sequence per line, where each example is a sequence of whitespace-separated tokens. For sequence-to-sequence transduction, it is formatted the same way, but with separate files for source and target sequences. Rau will automatically take care of building a vocabulary mapping token strings to integer IDs, adding BOS and EOS symbols as needed, grouping examples into batches, adding padding, and undoing all of these transformations at inference time. The end user only needs to deal with sequences of whitespace-separated tokens.
Quickstart
Clone the repo:
git clone git@github.com:bdusell/rau.git cd rau
Optional: To simplify software installation, set up the pre-made Docker container defined in this repo. To do this, install Docker and the NVIDIA Container Toolkit (for GPU support), then run this script in order to start a shell inside of the container:
bash scripts/docker_shell.bash --build
(If you don’t have an NVIDIA GPU, don’t install the NVIDIA Container Toolkit, and run the above command with the flag --cpu added.)
Install Python dependencies. This can be done by installing the package manager Poetry (it’s already installed in the Docker container) and running this script:
bash scripts/install_python_packages.bash
Start a shell inside the Python virtual environment using Poetry:
bash scripts/poetry_shell.bash
You are now ready to use Rau.
Below are some quick examples of setting up pipelines for language modeling and sequence-to-sequence transduction. We will use the pretokenized datasets from McCoy et al., 2020; their simplicity and small size make them convenient for our purposes.
Language Modeling
For this example, we’ll train a transformer language model on simple declarative sentences in English (the data comes from the source side of the question formation task of McCoy et al., 2020).
Download the dataset:
mkdir language-modeling-example curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.train | sed 's/[a-z]\+\t.*//' > language-modeling-example/main.tok mkdir language-modeling-example/datasets mkdir language-modeling-example/datasets/validation curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.dev | sed 's/[a-z]\+\t.*//' > language-modeling-example/datasets/validation/main.tok mkdir language-modeling-example/datasets/test curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.test | sed 's/[a-z]\+\t.*//' > language-modeling-example/datasets/test/main.tok
Note the directory structure:
language-modeling-example/: a directory representing this dataset
main.tok: the pretokenized training set
datasets/: additional datasets that should be processed with the training set’s vocabulary
validation/
main.tok: the pretokenized validation set
test/
main.tok: the pretokenized test set
Now, “prepare” the data by figuring out the vocabulary of the training data and converting all tokens to integers:
python src/rau/tasks/language_modeling/prepare_data.py \ --training-data language-modeling-example \ --more-data validation \ --more-data test \ --never-allow-unk
The flag --training-data refers to the directory containing our dataset. The flag --more-data indicates the name of a directory under language-modeling-example/datasets to prepare using the vocabulary of the training data. The flag --never-allow-unk indicates that the training data does not contain a designated unknown (UNK) token, and out-of-vocabulary tokens should be treated as errors at inference time.
Note the new files generated:
language-modeling-example/
main.prepared: the prepared training set
main.vocab: the vocabulary of the training data
datasets/
validation/
main.prepared: the prepared validation set
test/
main.prepared: the prepared test set
Now, train a transformer language model:
python src/rau/tasks/language_modeling/train.py \ --training-data language-modeling-example \ --architecture transformer \ --num-layers 6 \ --d-model 64 \ --num-heads 8 \ --feedforward-size 256 \ --dropout 0.1 \ --init-scale 0.1 \ --max-epochs 10 \ --max-tokens-per-batch 2048 \ --optimizer Adam \ --initial-learning-rate 0.01 \ --gradient-clipping-threshold 5 \ --early-stopping-patience 2 \ --learning-rate-patience 1 \ --learning-rate-decay-factor 0.5 \ --examples-per-checkpoint 50000 \ --output saved-language-model
This saves a transformer language model to the directory saved-language-model.
Finally, calculate the perplexity of this language model on the test set:
python src/rau/tasks/language_modeling/evaluate.py \ --load-model saved-language-model \ --training-data language-modeling-example \ --input test \ --batching-max-tokens 2048
Sequence-to-Sequence
For this example, we’ll train a transformer encoder-decoder on the question formation task of McCoy et al. (2020), which involves converting a declarative sentence in English to question form.
Download the dataset:
mkdir sequence-to-sequence-example
curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.train > sequence-to-sequence-example/train.tsv
cut -f 1 < sequence-to-sequence-example/train.tsv > sequence-to-sequence-example/source.tok
cut -f 2 < sequence-to-sequence-example/train.tsv > sequence-to-sequence-example/target.tok
mkdir sequence-to-sequence-example/datasets
mkdir sequence-to-sequence-example/datasets/validation
curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.dev > sequence-to-sequence-example/validation.tsv
cut -f 1 < sequence-to-sequence-example/validation.tsv > sequence-to-sequence-example/datasets/validation/source.tok
cut -f 2 < sequence-to-sequence-example/validation.tsv > sequence-to-sequence-example/datasets/validation/target.tok
mkdir sequence-to-sequence-example/datasets/test
curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.test | head -100 > sequence-to-sequence-example/test.tsv
cut -f 1 < sequence-to-sequence-example/test.tsv > sequence-to-sequence-example/datasets/test/source.tok
cut -f 2 < sequence-to-sequence-example/test.tsv > sequence-to-sequence-example/datasets/test/target.tok
rm sequence-to-sequence-example/{train,validation,test}.tsv
Note the directory structure:
sequence-to-sequence-example/: a directory representing this dataset
source.tok: the source side of the pretokenized training set
target.tok: the target side of the pretokenized training set
datasets/: additional datasets that should be processed with the training set’s vocabulary
validation/
source.tok: the source side of the pretokenized validation set
target.tok: the target side of the pretokenized validation set
test/
source.tok: the source side of the pretokenized test set
target.tok: the target side of the pretokenized test set
Now, “prepare” the data by figuring out the vocabulary of the training data and converting all tokens to integers:
python src/rau/tasks/sequence_to_sequence/prepare_data.py \ --training-data sequence-to-sequence-example \ --vocabulary-types shared \ --more-data validation \ --more-source-data test \ --never-allow-unk
The flag --training-data refers to the directory containing our dataset. The flag --vocabulary-types shared means that the script will generate a single vocabulary that is shared by both the source and target sides. This makes it possible to tie source and target embeddings. The flag --more-data indicates the name of a directory under sequence-to-sequence-example/datasets to prepare using the vocabulary of the training data (both the source and target sides will be prepared). The flag --more-source-data does the same thing, but it only prepares the source side (only the source side is necessary for generating translations on a test set). The flag --never-allow-unk indicates that the training data does not contain a designated unknown (UNK) token, and out-of-vocabulary tokens should be treated as errors at inference time.
Note the new files generated:
language-modeling-example/
source.shared.prepared
target.shared.prepared
shared.vocab: a shared vocabulary of tokens that appear in either the source or target side of the training set
datasets/
validation/
source.shared.prepared
target.shared.prepared
test/
source.shared.prepared
target.shared.prepared
Now, train a transformer encoder-decoder model:
python src/rau/tasks/sequence_to_sequence/train.py \ --training-data sequence-to-sequence-example \ --vocabulary-type shared \ --num-encoder-layers 6 \ --num-decoder-layers 6 \ --d-model 64 \ --num-heads 8 \ --feedforward-size 256 \ --dropout 0.1 \ --init-scale 0.1 \ --max-epochs 10 \ --max-tokens-per-batch 2048 \ --optimizer Adam \ --initial-learning-rate 0.01 \ --label-smoothing-factor 0.1 \ --gradient-clipping-threshold 5 \ --early-stopping-patience 2 \ --learning-rate-patience 1 \ --learning-rate-decay-factor 0.5 \ --examples-per-checkpoint 50000 \ --output saved-sequence-to-sequence-model
This saves a model to the directory saved-sequence-to-sequence-model.
Finally, translate the source sequences in the test data using beam search:
python src/rau/tasks/sequence_to_sequence/translate.py \ --load-model saved-sequence-to-sequence-model \ --input sequence-to-sequence-example/datasets/test/source.shared.prepared \ --beam-size 4 \ --max-target-length 50 \ --batching-max-tokens 256 \ --shared-vocabulary-file sequence-to-sequence-example/shared.vocab
Technical Details
This section is for people who want to understand the low-level details of Rau, including details of the neural network architectures, training algorithm, and decoding algorithms. This may be useful for researchers who need to be mindful of these details and describe them in their papers, or for people who are just deciding if Rau is up to snuff.
All language models and decoders operate exclusively on whole sequences ending in EOS, without truncation, and without assigning any probability to tokens that cannot be generated, namely padding and BOS. This means that, mathematically, Rau’s language models always define tight language models, i.e., probability distributions over the set of all strings of tokens. Training examples are never truncated, split across multiple minibatches, or shifted to different positions. This is in contrast to other setups that treat the training data as one long sequence and split it into chunks of fixed size.
The RNN and LSTM use learned initial hidden states.
During training, checkpoints are taken every \(N\) examples, where \(N\) can be configured by --examples-per-checkpoint. At each checkpoint, the model is evaluated on the validation set. The model’s performance on the validation set controls the learning rate schedule and early stopping.
When training ends, the parameters of the best checkpoint have been saved to disk.
Parameters can be optimized using either simple gradient descent or Adam. This can be configured with --optimizer.
An initial learning rate can be set with --initial-learning-rate. The learning rate is reduced every time the validation performance does not improve after a certain number of epochs, which can be configured with --learning-rate-patience. It is reduced by multiplying it by a number in \((0, 1)\), which can be configured with --learning-rate-decay-factor.
Training stops early when the validation performance does not improve after a certain number of epochs, which can be configured with --early-stopping-patience.
Features
Provides a flexible Python API for building neural network architectures by composing simpler ones. In particular, it provides an abstract base class called Unidirectional that represents a unidirectional sequential neural network, which makes it effortless to modify or compose sequential neural network architectures. The Unidirectional class supports both timestep-parallel training and autoregressive decoding. If you have two Unidirectional models that support both of these modes, you can compose them into a model that feeds the outputs of the first model as inputs to the seconds, and the composite model will also support both modes efficiently, for free. See Composable Neural Networks.
The RNN and LSTM use learned initial hidden states.
None of the architectures have upper limits on sequence length. This includes the transformer, which uses sinusoidal positional encodings that can be extended arbitrarily. You can train on short sequences and evaluate on arbitrarily long sequences.
PyTorch uses two bias terms in the recurrent layers of the RNN and LSTM. However, only one is required, and the second one is redundant. Including the second term only serves to effectively double the learning rate of the bias term at the cost of adding additional parameters to the model. This means that RNNs and LSTMs can have speciously high parameter counts, which is undesirable if you are trying to compare different models with comparable parameter counts. Rau takes care to remove these redundant bias parameters, resulting in better parameter counts.
Supports minibatching with padding. For the sake of efficiency, Rau groups sequences of similar length together to reduce the number of padding tokens, and it enforces upper limits on the number of tokens in a minibatch.
Padding is handled correctly, in the sense that there is mathematically no difference between processing \(N\) sequences in a single minibatch with padding and processing the same N sequences individually while accumulating their gradients. Minibatching is simply an implementation detail that increases speed.
Padding tokens do not take up space in the vocabulary or in the embedding matrix of the model. That is, there is no integer ID in the vocabulary that is devoted to padding. Instead, Rau dynamically figures out integer IDs to use for padding that don’t conflict with other tokens. They are an implementation detail that is entirely hidden from the user. Language models and decoders never assign probability to padding tokens and are unaware that padding tokens exist.
Everything is efficiently vectorized and supports both CPU and GPU modes.
Rau is very fast for small model sizes and small dataset sizes, even on CPU. An example of a “small” language modeling experiment would be a model with about 128k parameters and a dataset of about 100k sequences up to length 40. Rau can train hundreds of small models to convergence in under 20 minutes on a scientific computing cluster using only CPU nodes—no GPUs! This is very useful for researchers who train neural networks on small, synthetic experiments.
It is not tied to a particular tokenization algorithm, because it does not implement tokenization at all. It is compatible with datasets preprocessed by external tokenization tools, such as SentencePiece.
The dataset format is deliberately simple: plaintext consisting of one sequence per line, where each sequence consists of whitespace-separated tokens.
Offers different ways of handling UNK tokens. You can declare a particular token, such as <unk>, to represent a catch-all UNK token. Or, you can disable UNK tokens entirely and treat out-of-vocabulary tokens as errors.
Beam search is parallelized across beam elements (but not minibatch elements).
Beam search terminates as soon as EOS is the top beam element, rather than waiting for the beam to fill up with EOS. This is correct because the a beam element can never have a descendant with higher probability than itself. The latter approach is only required if the scores can increase, e.g., when using certain kinds of length normalization.
Composable Neural Networks
This section is yet to be written.
Limitations
The only tasks implemented are language modeling and sequence-to-sequence generation. Generation from language models has not been implemented, although it might be in the future.
The only architectures available for language modeling are the simple RNN, LSTM, and transformer.
The only architecture available for sequence-to-sequence generation is the transformer.
The only algorithm currently implemented for generating outputs is beam search. In the future, other generation algorithms such as ancestral sampling, greedy decoding, and constrained ancestral sampling may be added.
Beam search is not parallelized across minibatch elements.
Due to limitations in the API for PyTorch’s transformer implementation, decoding for transformers is very inefficient. At every step of decoding, all of the hidden representations are re-computed from scratch, and the model generates outputs for all previous timesteps, even though only the most recent one is needed. It does not implement what is commonly called “KV caching.” The only things that are cached are the input embeddings. This might be fixed in the future.
It does not include tokenization and detokenization in the pipeline. You need to handle tokenization and detokenization yourself.
It slurps the entire training set into memory during training, so it will run out of memory on large datasets (~1m sequences). This might be fixed in the future.
Training cannot be stopped and restarted, so it cannot recover from crashes. This feature might be added in the future.
What does the name “Rau” mean?
The name is pronounced /ɹaʊ/ (rhymes with “now”). It’s named after a magical mask that gives the person who wears it the ability to translate languages.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rau-0.1.0.tar.gz.
File metadata
- Download URL: rau-0.1.0.tar.gz
- Upload date:
- Size: 63.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ffaf3d2cde1d7a2a1b18ba5c86d878ba5dc38a5a347c7c0aa522d99aca9707b
|
|
| MD5 |
de15edd861a0ee843ea297ea83db37d4
|
|
| BLAKE2b-256 |
cd905cbf95849d5148b6d742926b1505740691a1f1b5381b327fdf25224413ed
|
Provenance
The following attestation bundles were made for rau-0.1.0.tar.gz:
Publisher:
python-publish.yml on bdusell/rau
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rau-0.1.0.tar.gz -
Subject digest:
5ffaf3d2cde1d7a2a1b18ba5c86d878ba5dc38a5a347c7c0aa522d99aca9707b - Sigstore transparency entry: 226725856
- Sigstore integration time:
-
Permalink:
bdusell/rau@d22df2ee6a1e4f2dcbb44aa0616d36beb4070f45 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/bdusell
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@d22df2ee6a1e4f2dcbb44aa0616d36beb4070f45 -
Trigger Event:
release
-
Statement type:
File details
Details for the file rau-0.1.0-py3-none-any.whl.
File metadata
- Download URL: rau-0.1.0-py3-none-any.whl
- Upload date:
- Size: 82.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
629c58b9aa33bbc520d49eda2d6e856c28371ae8dcd884f05d7e5a030f5bba35
|
|
| MD5 |
ad6b235b4eaec6e0a367e79199424eea
|
|
| BLAKE2b-256 |
aba021a30c3ba8da1ce02d267cf7d55dc391d050dc0b8edae02a08f7ab7df6da
|
Provenance
The following attestation bundles were made for rau-0.1.0-py3-none-any.whl:
Publisher:
python-publish.yml on bdusell/rau
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rau-0.1.0-py3-none-any.whl -
Subject digest:
629c58b9aa33bbc520d49eda2d6e856c28371ae8dcd884f05d7e5a030f5bba35 - Sigstore transparency entry: 226725858
- Sigstore integration time:
-
Permalink:
bdusell/rau@d22df2ee6a1e4f2dcbb44aa0616d36beb4070f45 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/bdusell
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@d22df2ee6a1e4f2dcbb44aa0616d36beb4070f45 -
Trigger Event:
release
-
Statement type: