Diacritic restoration for Hebrew
Project description
Nakdimon: a simple Hebrew diacritizer
Repository for the paper Restoring Hebrew Diacritics Without a Dictionary by Elazar Gershuni and Yuval Pinter.
Demo: https://nakdimon.org/
Building and running docker container
Build the docker container:
$ docker build -t nakdimon .
Run the docker container:
$ docker run --rm --gpus all --user 1000:1000 -it nakdimon /bin/bash
The --gpus all
flag is required to run the container with GPU support.
Training and evaluating
To train, test and evaluate the system, run the following commands:
> python nakdimon train --model=models/Nakdimon.h5
> python nakdimon run_test --test_set=tests/new --model=models/Nakdimon.h5
> python nakdimon results --test_set=tests/new --systems Snopi Morfix Dicta MajAllWithDicta Nakdimon
The first step trains the model and create a file named Nakdimon.h5
in the models
directory.
By default, the model is the one described in the paper: models/Nakdimon.h5
.
If the model already exists, you may skip this step.
The second step asks the Nakdimon server to predict the diacritics for the test set. You may skip this step.
A folder for the results is created in the chosen test folder, with the same name as the model; in this case, tests/new/NakdimonNew
.
By default, the test set is the one used in the paper (tests/new
); you can use tests/dicta
instead.
If the test results already exist, you may skip this step. If you are not sure, you can use the --skip_existing
flag.
The third step calculates and prints the results (DEC, CHA, WOR and VOC metrics, as well as OOV_WOR and OOV_VOC).
By default, the systems are the folders in the chosen test folder.
For the Dicta test set (/tests/dicta
) you should use MajAllNoDicta
instead of MajAllWithDicta
, otherwise the vocabulary for the Majority would include the test set itself.
Diacritizing a single file
> python nakdimon predict input_file.txt output_file.txt
Using other systems
You can use the run_test
command to run the test set on other systems, such as Dicta:
> python nakdimon run_test --test_set=tests/new --system=Dicta
This will create a folder named Dicta
for the results in the tests/new
folder.
Note that Morfix
cannot be used in this manner, as its license prohibit automatic use.
Running ablation tests
You can use the --ablation
flag to train different models for the ablation tests and other experiments:
> python nakdimon train --model=models/SingleLayer.h5 --ablation=SingleLayer
See the file ablation.py
for the list of available ablation parameters.
Important folders
hebrew_diacritized
is the training set.tests
contains three tests sets:new
,dicta
andvalidation
. Each test set has anexpected
folder that describes the ground truth. The results ofpython nakdimon run_test
are stored in sibling folder, named after the model.models
contains the trained model.nakdimon
holds the source code.
Citation
@inproceedings{gershuni2022restoring,
title={Restoring Hebrew Diacritics Without a Dictionary},
author={Gershuni, Elazar and Pinter, Yuval},
booktitle={Findings of the Association for Computational Linguistics: NAACL 2022},
pages={1010--1018},
year={2022}
}
Gershuni, Elazar, and Yuval Pinter. "Restoring Hebrew Diacritics Without a Dictionary." Findings of the Association for Computational Linguistics: NAACL 2022. 2022.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nakdimon-0.1.2.tar.gz
.
File metadata
- Download URL: nakdimon-0.1.2.tar.gz
- Upload date:
- Size: 59.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f4d76793325039f2d12f9fc5f9c0bcdf5bbaadf261f16292707f0b7efed7572c |
|
MD5 | 6c7f1a26f70d7640178c5441a6332c67 |
|
BLAKE2b-256 | 1dc83148685b25c8d780c04d29b1173f0c468e804757b31e1367324ccfe86a17 |
File details
Details for the file nakdimon-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: nakdimon-0.1.2-py3-none-any.whl
- Upload date:
- Size: 59.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc5f7220d468307a0d3b88db61f76b92dc4ee283c56ebebb4138613d183aa0e8 |
|
MD5 | 10c84bd3a83492cd4e618cb7727b2a7d |
|
BLAKE2b-256 | bbb49daaf336ff7e17c0dea891852b082bd66d3c01fcb63e8956ab3bfff0817d |