Skip to main content

Diacritic restoration for Hebrew

Project description

Nakdimon: a simple Hebrew diacritizer

Repository for the paper Restoring Hebrew Diacritics Without a Dictionary by Elazar Gershuni and Yuval Pinter.

Demo: https://nakdimon.org/

Building and running docker container

Build the docker container:

$ docker build -t nakdimon .

Run the docker container:

$ docker run --rm --gpus all --user 1000:1000 -it nakdimon /bin/bash

The --gpus all flag is required to run the container with GPU support.

Training and evaluating

To train, test and evaluate the system, run the following commands:

> python nakdimon train --model=models/Nakdimon.h5
> python nakdimon run_test --test_set=tests/new --model=models/Nakdimon.h5
> python nakdimon results --test_set=tests/new --systems Snopi Morfix Dicta MajAllWithDicta Nakdimon

The first step trains the model and create a file named Nakdimon.h5 in the models directory. By default, the model is the one described in the paper: models/Nakdimon.h5. If the model already exists, you may skip this step.

The second step asks the Nakdimon server to predict the diacritics for the test set. You may skip this step. A folder for the results is created in the chosen test folder, with the same name as the model; in this case, tests/new/NakdimonNew. By default, the test set is the one used in the paper (tests/new); you can use tests/dicta instead. If the test results already exist, you may skip this step. If you are not sure, you can use the --skip_existing flag.

The third step calculates and prints the results (DEC, CHA, WOR and VOC metrics, as well as OOV_WOR and OOV_VOC). By default, the systems are the folders in the chosen test folder. For the Dicta test set (/tests/dicta) you should use MajAllNoDicta instead of MajAllWithDicta, otherwise the vocabulary for the Majority would include the test set itself.

Diacritizing a single file

> python nakdimon predict input_file.txt output_file.txt

Using other systems

You can use the run_test command to run the test set on other systems, such as Dicta:

> python nakdimon run_test --test_set=tests/new --system=Dicta

This will create a folder named Dicta for the results in the tests/new folder. Note that Morfix cannot be used in this manner, as its license prohibit automatic use.

Running ablation tests

You can use the --ablation flag to train different models for the ablation tests and other experiments:

> python nakdimon train --model=models/SingleLayer.h5 --ablation=SingleLayer

See the file ablation.py for the list of available ablation parameters.

Important folders

  • hebrew_diacritized is the training set.
  • tests contains three tests sets: new, dicta and validation. Each test set has an expected folder that describes the ground truth. The results of python nakdimon run_test are stored in sibling folder, named after the model.
  • models contains the trained model.
  • nakdimon holds the source code.

Citation

@inproceedings{gershuni2022restoring,
  title={Restoring Hebrew Diacritics Without a Dictionary},
  author={Gershuni, Elazar and Pinter, Yuval},
  booktitle={Findings of the Association for Computational Linguistics: NAACL 2022},
  pages={1010--1018},
  year={2022}
}

Gershuni, Elazar, and Yuval Pinter. "Restoring Hebrew Diacritics Without a Dictionary." Findings of the Association for Computational Linguistics: NAACL 2022. 2022.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nakdimon-0.1.2.tar.gz (59.1 MB view details)

Uploaded Source

Built Distribution

nakdimon-0.1.2-py3-none-any.whl (59.1 MB view details)

Uploaded Python 3

File details

Details for the file nakdimon-0.1.2.tar.gz.

File metadata

  • Download URL: nakdimon-0.1.2.tar.gz
  • Upload date:
  • Size: 59.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for nakdimon-0.1.2.tar.gz
Algorithm Hash digest
SHA256 f4d76793325039f2d12f9fc5f9c0bcdf5bbaadf261f16292707f0b7efed7572c
MD5 6c7f1a26f70d7640178c5441a6332c67
BLAKE2b-256 1dc83148685b25c8d780c04d29b1173f0c468e804757b31e1367324ccfe86a17

See more details on using hashes here.

File details

Details for the file nakdimon-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: nakdimon-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 59.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for nakdimon-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bc5f7220d468307a0d3b88db61f76b92dc4ee283c56ebebb4138613d183aa0e8
MD5 10c84bd3a83492cd4e618cb7727b2a7d
BLAKE2b-256 bbb49daaf336ff7e17c0dea891852b082bd66d3c01fcb63e8956ab3bfff0817d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page