Skip to main content

Neural diacritization

Project description

Neudia ✒️

CircleCI

Neudia is a neural network-based diacritization system.

Philosophy

Neudia is closely inspired by the Nakdimon diacritization system for Hebrew (Gershuni & Pinter 2022), but is intended to be much more general.

Design

The Neudia model consists of an encoder which feeds into a tagger layer.

Neudia supports RNN (GRU and LSTM) and transformer (vanilla and rotary) transformers adapted from Yoyodyne. It also support ByT5, a pre-trained transformer encoder.

Lightning is used to generate the training, validation, inference, and evaluation loops. The LightningCLI interface is used to provide a user interface and manage configuration.

Below, we use YAML to specify configuration options, and we strongly recommend users do the same. However, most configuration options can also be specified using POSIX-style command-line flags.

Authors

Neudia was created by Kyle Gorman and other contributors like you.

Installation

To install Neudia and its dependencies, run the following command;

pip install .

File formats

YAML configuration files

Neudia uses YAML configuration files; see the example configuration files for examples, and see the Yoyodyne documentation for information on variable interpolation.

TSV data files

Neudia operates on basic tab-separated values (TSV) data files in which the first column is the source string and the second the target string.

source   target

One can specify different 1-indexed column indices using arguments to data::

...
data:
  source_col: 2
  target_col: 1
  ...

Usage

The neudia command-line tool uses a subcommand interface, with four different modes. To see a full set of options available for each subcommand, use the --print_config flag. For example:

neudia fit --print_config

will show all configuration options (and their default values) for the fit subcommand.

For more detailed examples, see the configs directory.

Training (fit)

In fit mode, one trains a model, either from scratch or optionally, resuming from a pre-existing checkpoint. Naturally, most configuration options need to be set at training time.

This mode is invoked using the fit subcommand, like so.

neudia fit --config path/to/config.yaml

Alternatively, one can resume training from a pre-existing checkpoint so long as it matches the specification of the configuration file.

neudia fit --config path/to/config.yaml --ckpt_path path/to/checkpoint.ckpt

Seeding

Setting the seed_everything: argument to some fixed value ensures a reproducible experiment (modulo hardware non-determinism).

Model architecture

A specification for a model goes under model:, and includes:

  • the dimensionality of the embeddings (embedding_size)
  • label_smoothing probability
  • the class_path of the encoder

There are five types of encoders supported:

  • GRU (neudia.encoders.GRUEncoder)
  • LSTM (neudia.encoders.LSTMEncoder)
  • Transformer (neudia.encoders.TransformerEncoder)
  • Rotary transformer (neudia.encoders.RotaryTransformerEncoder)
  • ByT5 (neudia.encoders.ByT5Encoder)

One provides the class path to the encoder, and then under init_args:, includes:

  • the dropout probability (NB: all dropout occurs within the encoder)
  • the number of encoder layers
  • (for GRU and LSTM encoders) whether to use a bidirectional encoder
  • (for the transformer encoders) the number of attention_heads
  • (for ByT5) the number of pooling_layers

Optimization

Neudia requires an optimizer and a learning rate scheduler. The system is borrowed from Yoyodyne; see here for more information.

Checkpointing

A checkpoint config must be specified or no checkpoints will be generated; see here for more information.

Callbacks

See here for more information.

Logging

See here for more information.

Other options

Batch size is specified using data: batch_size: ....

By default, training uses 32-bit precision. However, the trainer: precision: flag allows the user to perform training with half precision (16), or with mixed-precision formats like bf16-mixed if supported by the accelerator. This might reduce the size of the model and batches in memory, allowing one to use larger batches, or it may simply provide small speed-ups.

There are a number of ways to specify how long a model should train for. For example, the following YAML snippet specifies that training should run for 100 epochs or 6 wall-clock hours, whichever comes first:

...
trainer:
  max_epochs: 100
  max_time: 00:06:00:00
  ...

Validation (validate)

In validation mode, one runs the validation step over labeled validation data (specified as data: val: path/to/validation.tsv) using a previously trained checkpoint (--ckpt_path path/to/checkpoint.ckpt from the command line), recording loss and other statistics for the validation set. In practice this is mostly useful for debugging.

This mode is invoked using the validate subcommand, like so:

neudia validate --config path/to/config.yaml --ckpt_path path/to/checkpoint.ckpt

Evaluation (test)

In test mode, one computes accuracy over held-out test data (specified as data: test: path/to/test.tsv) using a previously trained checkpoint (--ckpt_path path/to/checkpoint.ckpt from the command line); it differs from validation mode in that it uses the test file rather than the val file.

This mode is invoked using the test subcommand, like so:

neudia test --config path/to/config.yaml --ckpt_path path/to/checkpoint.ckpt

Inference (predict)

In predict mode, a previously trained model checkpoint (--ckpt_path path/to/checkpoint.ckpt from the command line) is used to label an input file. One must also specify the path where the predictions will be written:

...
predict:
  path: path/to/predictions.txt
...

This mode is invoked using the predict subcommand, like so:

neudia predict --config path/to/config.yaml --ckpt_path path/to/checkpoint.ckpt

Examples

The examples directory contains some relevant examples.

Related projects

  • Neudia is closely based on Yoyodyne and reuses much of its core code.

License

Neudia is distributed under an Apache 2.0 license.

For developers

We welcome contributions using the fork-and-pull model.

Testing

An integration test diacritizes lines of the Aeneid. This test unfortunately cannot be run on continuous integration. To run the test, run the following:

pytest -vvv tests

Releasing

We welcome contributions using the fork-and-pull model.

Releasing

  1. Create a new branch. E.g., if you want to call this branch "release": git checkout -b release
  2. Sync your fork's branch to the upstream master branch. E.g., if the upstream remote is called "upstream": git pull upstream master
  3. Increment the version field in pyproject.toml.
  4. Stage your changes: git add pyproject.toml.
  5. Commit your changes: git commit -m "your commit message here"
  6. Push your changes. E.g., if your branch is called "release": git push origin release
  7. Submit a PR for your release and wait for it to be merged into master.
  8. Tag the master branch's last commit. The tag should begin with v; e.g., if the new version is 3.1.4, the tag should be v3.1.4. This can be done:
    • on GitHub itself: click the "Releases" or "Create a new release" link on the right-hand side of the GitHub page) and follow the dialogues.
    • from the command-line using git tag.
  9. Build the new release: python -m build
  10. Upload the result to PyPI: twine upload dist/*

References

Gershuni, E. and Pinter, Y. 2022. Restoring Hebrew diacritics without a dictionary. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1010-1018.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neudia-0.0.10.tar.gz (25.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

neudia-0.0.10-py3-none-any.whl (26.7 kB view details)

Uploaded Python 3

File details

Details for the file neudia-0.0.10.tar.gz.

File metadata

  • Download URL: neudia-0.0.10.tar.gz
  • Upload date:
  • Size: 25.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for neudia-0.0.10.tar.gz
Algorithm Hash digest
SHA256 38e86e68d821fb4eb237dcb401a3c3a0fcf3d389c3dcdafb4a4f22735fe74e9a
MD5 af9d390e41b737faeeea8abd35a54439
BLAKE2b-256 a61e2a0580490e7a51e728bbfe51daaf54c379efcd69f475d928cf21c1a49975

See more details on using hashes here.

File details

Details for the file neudia-0.0.10-py3-none-any.whl.

File metadata

  • Download URL: neudia-0.0.10-py3-none-any.whl
  • Upload date:
  • Size: 26.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for neudia-0.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 a08b44c3e438d5a5d9f80c6f64bf4c2c59634f321e75388914482a6f191732c8
MD5 84455d63a27fd7ac6432890665c8795b
BLAKE2b-256 a9fb453eb89ca929faa108f7cfbf4a3c3f8d70473a220c83f1c8770e8e80edfd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page