Skip to main content

Sequence-to-Sequence Framework in Pytorch

Project description


License: MIT Python 3.6

nmtpytorch allows training of various end-to-end neural architectures including but not limited to neural machine translation, image captioning and automatic speech recognition systems. The initial codebase was in Theano and was inspired from the famous dl4mt-tutorial codebase.

nmtpytorch is mainly developed by the Language and Speech Team of Le Mans University but receives valuable contributions from the Grounded Sequence-to-sequence Transduction Team of Frederick Jelinek Memorial Summer Workshop 2018:

Loic Barrault, Ozan Caglayan, Amanda Duarte, Desmond Elliott, Spandana Gella, Nils Holzenberger, Chirag Lala, Jasmine (Sun Jae) Lee, Jindřich Libovický, Pranava Madhyastha, Florian Metze, Karl Mulligan, Alissa Ostapenko, Shruti Palaskar, Ramon Sanabria, Lucia Specia and Josiah Wang.

If you use nmtpytorch, you may want to cite the following paper:

  author    = {Ozan Caglayan and
               Mercedes Garc\'{i}a-Mart\'{i}nez and
               Adrien Bardet and
               Walid Aransa and
               Fethi Bougares and
               Lo\"{i}c Barrault},
  title     = {NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems},
  journal   = {Prague Bull. Math. Linguistics},
  volume    = {109},
  pages     = {15--28},
  year      = {2017},
  url       = {},
  doi       = {10.1515/pralin-2017-0035},
  timestamp = {Tue, 12 Sep 2017 10:01:08 +0100}


nmtpytorch currently requires python>=3.6 and torch==0.4.1. We are not planning to support Python 2.x.

IMPORTANT: After installing nmtpytorch, you need to run nmtpy-install-extra to download METEOR related files into your ${HOME}/.nmtpy folder. This step is only required once.


You can install nmtpytorch from PyPI using pip (or pip3 depending on your operating system and environment):

$ pip install nmtpytorch

This will automatically fetch and install the dependencies as well. For the torch dependency it will specifically install the torch 0.4.1 package from PyPI that ships CUDA 9.0 within.


We provide an environment.yml file in the repository that you can use to create a ready-to-use anaconda environment for nmtpytorch:

$ conda update --all
$ git clone
$ conda env create -f nmtpytorch/environment.yml

Development Mode

For continuous development and testing, it is sufficient to run python develop in the root folder of your GIT checkout. From now on, all modifications to the source tree are directly taken into account without requiring reinstallation.


We currently only provide some preliminary documentation in our wiki.

Release Notes

v3.0.0 (05/10/2018)

Major release that brings support for Pytorch 0.4 and drops support for 0.3.

Training and testing on CPUs are now supported thanks to easier device semantics of Pytorch 0.4: just give -d cpu to nmtpy to switch to CPU mode. NOTE: Training on CPUs is only logical for debugging, otherwise it's very slow.

  • NOTE: device_id is no longer a configuration option. It should be removed from your old configurations.
  • Multi-GPU is not supported. Always restrict to single GPU using CUDA_VISIBLE_DEVICES environment variable.

You can now override the config options used to train a model during inference: Example: nmtpy translate (...) -x model.att_temp:0.9

nmtpy train now detects invalid/old [train] options and refuses to train the model.

New sampler: ApproximateBucketBatchSampler Similar to the default BucketBatchSampler but more efficient for sparsely distributed sequence-lengths as in speech recognition. It bins similar-length items to buckets. It no longer guarantees that the batches are completely made of same-length sequences so care has to be taken in the encoders to support packing/padding/masking. TextEncoder already does this automatically while speech encoder BiLSTMp does not care.

EXPERIMENTAL: You can decode an ASR system using the approximate sampler although the model does not take care of the padded positions (a warning will be printed at each batch). The loss is 0.2% WER for a specific dataset that we tried. So although the computations in the encoder becomes noisy and not totally correct, the model can handle this noise quite robustly:

$ nmtpy translate -s val -o hyp -x model.sampler_type:approximate best_asr.ckpt

This type of batching cuts ASR decoding time almost by a factor of 2-3.

Other changes

  • Vocabularies generated by nmtpy-build-vocab now contains frequency information as well. The code is backward-compatible with old vocab files.
  • Batch objects should now be explicitly moved to the allocated device using .device() method. See and test_performance() from the NMT model.
  • Training no longer shows the cached GPU allocation from nvidia-smi output as it was in the end a hacky thing to call nvidia-smi periodically. We plan to use torch.cuda.* to get an estimate on memory consumption.
  • NOTE: Multi-process data loading is temporarily disabled as it was crashing from time to time so num_workers > 0 does not have an effect in this release.
  • Attention is separated into DotAttention and MLPAttention and a convenience function get_attention() is provided to select between them during model construction.
  • get_activation_fn() should be used to select between non-linearities dynamically instead of doing getattr(nn.functional, activ). The latter will not work for tanh and sigmoid in the next Pytorch releases.
  • Simplification: ASR model is now derived from NMT.

v2.0.0 (26/09/2018)

  • Ability to install through pip.
  • Advanced layers are now organized into subfolders.
  • New basic layers: Convolution over sequence, MaxMargin.
  • New attention layers: Co-attention, multi-head attention, hierarchical attention.
  • New encoders: Arbitrary sequence-of-vectors encoder, BiLSTMp speech feature encoder.
  • New decoders: Multi-source decoder, switching decoder, vector decoder.
  • New datasets: Kaldi dataset (.ark/.scp reader), Shelve dataset, Numpy sequence dataset.
  • Added learning rate annealing: See lr_decay* options in
  • Removed subword-nmt and METEOR files from repository. We now depend on the PIP package for subword-nmt. For METEOR, nmtpy-install-extra should be launched after installation.
  • More multi-task and multi-input/output translate and training regimes.
  • New early-stopping metrics: Character and word error rate (cer,wer) and ROUGE (rouge).
  • Curriculum learning option for the BucketBatchSampler, i.e. length-ordered batches.
  • New models:
    • ASR: Listen-attend-and-spell like automatic speech recognition
    • Multitask*: Experimental multi-tasking & scheduling between many inputs/outputs.

v1.4.0 (09/05/2018)

  • Add environment.yml for easy installation using conda. You can now create a ready-to-use conda environment by just calling conda env create -f environment.yml.
  • Make NumpyDataset memory efficient by keeping float16 arrays as they are until batch creation time.
  • Rename Multi30kRawDataset to Multi30kDataset which now supports both raw image files and pre-extracted visual features file stored as .npy.
  • Add CNN feature extraction script under scripts/.
  • Add doubly stochastic attention to ShowAttendAndTell and multimodal NMT.
  • New model MNMTDecinit to initialize decoder with auxiliary features.
  • New model AMNMTFeatures which is the attentive MMT but with features file instead of end-to-end feature extraction which was memory hungry.

v1.3.2 (02/05/2018)

  • Updates to ShowAttendAndTell model.

v1.3.1 (01/05/2018)

  • Removed old Multi30kDataset.
  • Sort batches by source sequence length instead of target.
  • Fix ShowAttendAndTell model. It should now work.

v1.3 (30/04/2018)

  • Added Multi30kRawDataset for training end-to-end systems from raw images as input.
  • Added NumpyDataset to read .npy/.npz tensor files as input features.
  • You can now pass -S to nmtpy train to produce shorter experiment files with not all the hyperparameters in file name.
  • New post-processing filter option de-spm for Google SentencePiece (SPM) processed files.
  • sacrebleu is now a dependency as it is now accepted as an early-stopping metric. It only makes sense to use it with SPM processed files since they are detokenized once post-processed.
  • Added sklearn as a dependency for some metrics.
  • Added momentum and nesterov parameters to [train] section for SGD.
  • ImageEncoder layer is improved in many ways. Please see the code for further details.
  • Added unmerged upstream PR for ModuleDict() support.
  • METEOR will now fallback to English if language can not be detected from file suffixes.
  • -f now produces a separate numpy file for token frequencies when building vocabulary files with nmtpy-build-vocab.
  • Added new command nmtpy test for non beam-search inference modes.
  • Removed nmtpy resume command and added pretrained_file option for [train] to initialize model weights from a checkpoint.
  • Added freeze_layers option for [train] to give comma-separated list of layer name prefixes to freeze.
  • Improved seeding: seed is now printed in order to reproduce the results.
  • Added IPython notebook for attention visualization.
  • Layers
    • New shallow SimpleGRUDecoder layer.
    • TextEncoder: Ability to set maxnorm and gradscale of embeddings and work with or without sorted-length batches.
    • ConditionalDecoder: Make it work with GRU/LSTM, allow setting maxnorm/gradscale for embeddings.
    • ConditionalMMDecoder: Same as above.
  • nmtpy translate
    • --avoid-double and --avoid-unk removed for now.
    • Added Google's length penalty normalization switch --lp-alpha.
    • Added ensembling which is enabled automatically if you give more than 1 model checkpoints.
  • New machine learning metric wrappers in utils/
    • Label-ranking average precision lrap
    • Coverage error
    • Mean reciprocal rank

v1.2 (20/02/2018)

  • You can now use $HOME and $USER in your configuration files.
  • Fixed an overflow error that would cause NMT with more than 255 tokens to fail.
  • METEOR worker process is now correctly killed after validations.
  • Many runs of an experiment are now suffixed with a unique random string instead of incremental integers to avoid race conditions in cluster setups.
  • Replaced utils.nn.get_network_topology() with a new Topology class that will parse the direction string of the model in a more smart way.
  • If CUDA_VISIBLE_DEVICES is set, the GPUManager will always honor it.
  • Dropped creation of temporary/advisory lock files under /tmp for GPU reservation.
  • Time measurements during training are now structered into batch overhead, training and evaluation timings.
  • Datasets
    • Added TextDataset for standalone text file reading.
    • Added OneHotDataset, a variant of TextDataset where the sequences are not prefixed/suffixed with <bos> and <eos> respectively.
    • Added experimental MultiParallelDataset that merges an arbitrary number of parallel datasets together.
  • nmtpy translate
    • .nodbl and .nounk suffixes are now added to output files for --avoid-double and --avoid-unk arguments respectively.
    • A model-agnostic enough beam_search() is now separated out into its own file nmtpytorch/
    • max_len default is increased to 200.

v1.1 (25/01/2018)

  • New experimental Multi30kDataset and ImageFolderDataset classes
  • torchvision dependency added for CNN support
  • nmtpy-coco-metrics now computes one METEOR without norm=True
  • Mainloop mechanism is completely refactored with backward-incompatible configuration option changes for [train] section:
    • patience_delta option is removed
    • Added eval_batch_size to define batch size for GPU beam-search during training
    • eval_freq default is now 3000 which means per 3000 minibatches
    • eval_metrics now defaults to loss. As before, you can provide a list of metrics like bleu,meteor,loss to compute all of them and early-stop based on the first
    • Added eval_zero (default: False) which tells to evaluate the model once on dev set right before the training starts. Useful for sanity checking if you fine-tune a model initialized with pre-trained weights
    • Removed save_best_n: we no longer save the best N models on dev set w.r.t. early-stopping metric
    • Added save_best_metrics (default: True) which will save best models on dev set w.r.t each metric provided in eval_metrics. This kind of remedies the removal of save_best_n
    • checkpoint_freq now to defaults to 5000 which means per 5000 minibatches.
    • Added n_checkpoints (default: 5) to define the number of last checkpoints that will be kept if checkpoint_freq > 0 i.e. checkpointing enabled
  • Added ExtendedInterpolation support to configuration files:
    • You can now define intermediate variables in .conf files to avoid typing same paths again and again. A variable can be referenced from within its section using tensorboard_dir: ${save_path}/tb notation Cross-section references are also possible: ${data:root} will be replaced by the value of the root variable defined in the [data] section.
  • Added -p/--pretrained to nmtpy train to initialize the weights of the model using another checkpoint .ckpt.
  • Improved input/output handling for nmtpy translate:
    • -s accepts a comma-separated test sets defined in the configuration file of the experiment to translate them at once. Example: -s val,newstest2016,newstest2017
    • The mutually exclusive counterpart of -s is -S which receives a single input file of source sentences.
    • For both cases, an output prefix should now be provided with -o. In the case of multiple test sets, the output prefix will be appended the name of the test set and the beam size. If you just provide a single file with -S the final output name will only reflect the beam size information.
  • Two new arguments for nmtpy-build-vocab:
    • -f: Stores frequency counts as well inside the final json vocabulary
    • -x: Does not add special markers <eos>,<bos>,<unk>,<pad> into the vocabulary


  • Added Fusion() layer to concat,sum,mul an arbitrary number of inputs
  • Added experimental ImageEncoder() layer to seamlessly plug a VGG or ResNet CNN using torchvision pretrained models
  • Attention layer arguments improved. You can now select the bottleneck dimensionality for MLP attention with att_bottleneck. The dot attention is still not tested and probably broken.

New layers/architectures:

Changes in NMT:

  • dec_init defaults to mean_ctx, i.e. the decoder will be initialized with the mean context computed from the source encoder
  • enc_lnorm which was just a placeholder is now removed since we do not provided layer-normalization for now
  • Beam Search is completely moved to GPU

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
nmtpytorch-3.0.0-py3-none-any.whl (149.4 kB) Copy SHA256 hash SHA256 Wheel py3
nmtpytorch-3.0.0.tar.gz (102.1 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page