Skip to main content

Diacritic restoration for Hebrew

Project description

Nakdimon: a simple Hebrew diacritizer

Repository for the paper Restoring Hebrew Diacritics Without a Dictionary by Elazar Gershuni and Yuval Pinter.

Demo: https://nakdimon.org/

Requires Python 3.12+. The runtime uses ONNX Runtime (no TensorFlow required to run inference).

Locally:

$ pip install nakdimon
$ diacritize input_file.txt -o=output_file.txt

Building and running docker container

$ docker build -t nakdimon .
$ docker run --rm -it nakdimon /bin/bash

Development setup (with uv)

$ uv sync                  # install runtime deps
$ uv sync --extra train    # add TensorFlow stack (Python 3.12–3.13 only)
$ uv sync --extra research # add matplotlib/seaborn for plots

Training and evaluating

Training requires the [train] extra (TensorFlow + wandb + tf2onnx):

$ pip install 'nakdimon[train]'

Then:

> python -m nakdimon train --model=models/Nakdimon.keras
> python scripts/convert_to_onnx.py models/Nakdimon.keras models/Nakdimon.onnx
> python -m nakdimon run_test --test_set=tests/new --model=models/Nakdimon.onnx
> python -m nakdimon results --test_set=tests/new --systems Snopi Morfix Dicta MajAllWithDicta Nakdimon

The trained .h5 is converted to .onnx once; the runtime predictor consumes .onnx. By default, the bundled model is nakdimon/data/Nakdimon.onnx (shipped in the wheel).

The second step asks the Nakdimon server to predict the diacritics for the test set. You may skip this step. A folder for the results is created in the chosen test folder, with the same name as the model; in this case, tests/new/NakdimonNew. By default, the test set is the one used in the paper (tests/new); you can use tests/dicta instead. If the test results already exist, you may skip this step. If you are not sure, you can use the --skip_existing flag.

The third step calculates and prints the results (DEC, CHA, WOR and VOC metrics, as well as OOV_WOR and OOV_VOC). By default, the systems are the folders in the chosen test folder. For the Dicta test set (/tests/dicta) you should use MajAllNoDicta instead of MajAllWithDicta, otherwise the vocabulary for the Majority would include the test set itself.

Diacritizing a single file

> python nakdimon predict input_file.txt output_file.txt

Using other systems

You can use the run_test command to run the test set on other systems, such as Dicta:

> python nakdimon run_test --test_set=tests/new --system=Dicta

This will create a folder named Dicta for the results in the tests/new folder. Note that Morfix cannot be used in this manner, as its license prohibit automatic use.

Running ablation tests

You can use the --ablation flag to train different models for the ablation tests and other experiments:

> python -m nakdimon train --model=models/SingleLayer.keras --ablation=SingleLayer

See the file ablation.py for the list of available ablation parameters.

Important folders

  • hebrew_diacritized is the training set.
  • tests contains three tests sets: new, dicta and validation. Each test set has an expected folder that describes the ground truth. The results of python nakdimon run_test are stored in sibling folder, named after the model.
  • models contains the trained model.
  • nakdimon holds the source code.

Citation

@inproceedings{gershuni2022restoring,
  title={Restoring Hebrew Diacritics Without a Dictionary},
  author={Gershuni, Elazar and Pinter, Yuval},
  booktitle={Findings of the Association for Computational Linguistics: NAACL 2022},
  pages={1010--1018},
  year={2022}
}

Gershuni, Elazar, and Yuval Pinter. "Restoring Hebrew Diacritics Without a Dictionary." Findings of the Association for Computational Linguistics: NAACL 2022. 2022.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nakdimon-0.2.1.tar.gz (19.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nakdimon-0.2.1-py3-none-any.whl (19.8 MB view details)

Uploaded Python 3

File details

Details for the file nakdimon-0.2.1.tar.gz.

File metadata

  • Download URL: nakdimon-0.2.1.tar.gz
  • Upload date:
  • Size: 19.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nakdimon-0.2.1.tar.gz
Algorithm Hash digest
SHA256 51c01ccdeeb82049dd244c66385c2d79e1e8527a7d88911c878b3b10e6be6963
MD5 190c20089383603dcd0e75e74e19ca88
BLAKE2b-256 e89f765955f66d19e8e8ca5712d4576024d62244b9ed1ce88c39d65efc843d1c

See more details on using hashes here.

Provenance

The following attestation bundles were made for nakdimon-0.2.1.tar.gz:

Publisher: python-publish.yml on elazarg/nakdimon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nakdimon-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: nakdimon-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 19.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nakdimon-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2d8460ba9a2e9de4ecdd2d230f83c5a5582e9efcc549ed4478a21764809a3ddd
MD5 6f6636c39db3f7086229896e5e289af0
BLAKE2b-256 c9fa901f223f7c0074b54fb2d09b5324b80087fc2280ddfd80007e195ba1de80

See more details on using hashes here.

Provenance

The following attestation bundles were made for nakdimon-0.2.1-py3-none-any.whl:

Publisher: python-publish.yml on elazarg/nakdimon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page