Diacritic restoration for Hebrew
Project description
Nakdimon: a simple Hebrew diacritizer
Repository for the paper Restoring Hebrew Diacritics Without a Dictionary by Elazar Gershuni and Yuval Pinter.
Demo: https://nakdimon.org/
Requires Python 3.12+. The runtime uses ONNX Runtime (no TensorFlow required to run inference).
Locally:
$ pip install nakdimon
$ diacritize input_file.txt -o=output_file.txt
Building and running docker container
$ docker build -t nakdimon .
$ docker run --rm -it nakdimon /bin/bash
Development setup (with uv)
$ uv sync # install runtime deps
$ uv sync --extra train # add TensorFlow stack (Python 3.12–3.13 only)
$ uv sync --extra research # add matplotlib/seaborn for plots
Training and evaluating
Training requires the [train] extra (TensorFlow + wandb + tf2onnx):
$ pip install 'nakdimon[train]'
Then:
> python -m nakdimon train --model=models/Nakdimon.keras
> python scripts/convert_to_onnx.py models/Nakdimon.keras models/Nakdimon.onnx
> python -m nakdimon run_test --test_set=tests/new --model=models/Nakdimon.onnx
> python -m nakdimon results --test_set=tests/new --systems Snopi Morfix Dicta MajAllWithDicta Nakdimon
The trained .h5 is converted to .onnx once; the runtime predictor consumes .onnx.
By default, the bundled model is nakdimon/data/Nakdimon.onnx (shipped in the wheel).
The second step asks the Nakdimon server to predict the diacritics for the test set. You may skip this step.
A folder for the results is created in the chosen test folder, with the same name as the model; in this case, tests/new/NakdimonNew.
By default, the test set is the one used in the paper (tests/new); you can use tests/dicta instead.
If the test results already exist, you may skip this step. If you are not sure, you can use the --skip_existing flag.
The third step calculates and prints the results (DEC, CHA, WOR and VOC metrics, as well as OOV_WOR and OOV_VOC).
By default, the systems are the folders in the chosen test folder.
For the Dicta test set (/tests/dicta) you should use MajAllNoDicta instead of MajAllWithDicta, otherwise the vocabulary for the Majority would include the test set itself.
Diacritizing a single file
> python nakdimon predict input_file.txt output_file.txt
Using other systems
You can use the run_test command to run the test set on other systems, such as Dicta:
> python nakdimon run_test --test_set=tests/new --system=Dicta
This will create a folder named Dicta for the results in the tests/new folder.
Note that Morfix cannot be used in this manner, as its license prohibit automatic use.
Running ablation tests
You can use the --ablation flag to train different models for the ablation tests and other experiments:
> python -m nakdimon train --model=models/SingleLayer.keras --ablation=SingleLayer
See the file ablation.py for the list of available ablation parameters.
Important folders
hebrew_diacritizedis the training set.testscontains three tests sets:new,dictaandvalidation. Each test set has anexpectedfolder that describes the ground truth. The results ofpython nakdimon run_testare stored in sibling folder, named after the model.modelscontains the trained model.nakdimonholds the source code.
Citation
@inproceedings{gershuni2022restoring,
title={Restoring Hebrew Diacritics Without a Dictionary},
author={Gershuni, Elazar and Pinter, Yuval},
booktitle={Findings of the Association for Computational Linguistics: NAACL 2022},
pages={1010--1018},
year={2022}
}
Gershuni, Elazar, and Yuval Pinter. "Restoring Hebrew Diacritics Without a Dictionary." Findings of the Association for Computational Linguistics: NAACL 2022. 2022.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nakdimon-0.2.0.tar.gz.
File metadata
- Download URL: nakdimon-0.2.0.tar.gz
- Upload date:
- Size: 19.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9da5758c0e911d5d3d8ffe68bbb9a40834fccdde230cf8060e079ba51fe6e87
|
|
| MD5 |
3e3eae592f3b98ab7ebfa9a1c3895000
|
|
| BLAKE2b-256 |
f89c01241cb90015a3b9b650c2931ca48291282c4d94e0d4de2c6a1fef96db08
|
Provenance
The following attestation bundles were made for nakdimon-0.2.0.tar.gz:
Publisher:
python-publish.yml on elazarg/nakdimon
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nakdimon-0.2.0.tar.gz -
Subject digest:
e9da5758c0e911d5d3d8ffe68bbb9a40834fccdde230cf8060e079ba51fe6e87 - Sigstore transparency entry: 1579087250
- Sigstore integration time:
-
Permalink:
elazarg/nakdimon@2e1816b0d65f3b658535922ca6de46128d2984c9 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/elazarg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@2e1816b0d65f3b658535922ca6de46128d2984c9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file nakdimon-0.2.0-py3-none-any.whl.
File metadata
- Download URL: nakdimon-0.2.0-py3-none-any.whl
- Upload date:
- Size: 19.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
870996192dad34e5eb73eda92c9555218894d7f061c2e5021defa7521d99dfbd
|
|
| MD5 |
a1f60e7ac3c7bac12ba4bf721c603ac2
|
|
| BLAKE2b-256 |
e0faa9518df00c143e6e69cbab9e109d92bc7caa3965485bfe433e716aac6033
|
Provenance
The following attestation bundles were made for nakdimon-0.2.0-py3-none-any.whl:
Publisher:
python-publish.yml on elazarg/nakdimon
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nakdimon-0.2.0-py3-none-any.whl -
Subject digest:
870996192dad34e5eb73eda92c9555218894d7f061c2e5021defa7521d99dfbd - Sigstore transparency entry: 1579087562
- Sigstore integration time:
-
Permalink:
elazarg/nakdimon@2e1816b0d65f3b658535922ca6de46128d2984c9 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/elazarg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@2e1816b0d65f3b658535922ca6de46128d2984c9 -
Trigger Event:
push
-
Statement type: