Diacritic restoration for Hebrew
Project description
Nakdimon: a simple Hebrew diacritizer
Repository for the paper Restoring Hebrew Diacritics Without a Dictionary by Elazar Gershuni and Yuval Pinter.
Demo: https://nakdimon.org/
Building and running docker container
Build the docker container:
$ docker build -t nakdimon .
Run the docker container:
$ docker run --rm --gpus all --user 1000:1000 -it nakdimon /bin/bash
The --gpus all flag is required to run the container with GPU support.
Training and evaluating
To train, test and evaluate the system, run the following commands:
> python nakdimon train --model=models/Nakdimon.h5
> python nakdimon run_test --test_set=tests/new --model=models/Nakdimon.h5
> python nakdimon results --test_set=tests/new --systems Snopi Morfix Dicta MajAllWithDicta Nakdimon
The first step trains the model and create a file named Nakdimon.h5 in the models directory.
By default, the model is the one described in the paper: models/Nakdimon.h5.
If the model already exists, you may skip this step.
The second step asks the Nakdimon server to predict the diacritics for the test set. You may skip this step.
A folder for the results is created in the chosen test folder, with the same name as the model; in this case, tests/new/NakdimonNew.
By default, the test set is the one used in the paper (tests/new); you can use tests/dicta instead.
If the test results already exist, you may skip this step. If you are not sure, you can use the --skip_existing flag.
The third step calculates and prints the results (DEC, CHA, WOR and VOC metrics, as well as OOV_WOR and OOV_VOC).
By default, the systems are the folders in the chosen test folder.
For the Dicta test set (/tests/dicta) you should use MajAllNoDicta instead of MajAllWithDicta, otherwise the vocabulary for the Majority would include the test set itself.
Diacritizing a single file
> python nakdimon predict input_file.txt output_file.txt
Using other systems
You can use the run_test command to run the test set on other systems, such as Dicta:
> python nakdimon run_test --test_set=tests/new --system=Dicta
This will create a folder named Dicta for the results in the tests/new folder.
Note that Morfix cannot be used in this manner, as its license prohibit automatic use.
Running ablation tests
You can use the --ablation flag to train different models for the ablation tests and other experiments:
> python nakdimon train --model=models/SingleLayer.h5 --ablation=SingleLayer
See the file ablation.py for the list of available ablation parameters.
Important folders
hebrew_diacritizedis the training set.testscontains three tests sets:new,dictaandvalidation. Each test set has anexpectedfolder that describes the ground truth. The results ofpython nakdimon run_testare stored in sibling folder, named after the model.modelscontains the trained model.nakdimonholds the source code.
Citation
@inproceedings{gershuni2022restoring,
title={Restoring Hebrew Diacritics Without a Dictionary},
author={Gershuni, Elazar and Pinter, Yuval},
booktitle={Findings of the Association for Computational Linguistics: NAACL 2022},
pages={1010--1018},
year={2022}
}
Gershuni, Elazar, and Yuval Pinter. "Restoring Hebrew Diacritics Without a Dictionary." Findings of the Association for Computational Linguistics: NAACL 2022. 2022.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nakdimon-0.1.2.tar.gz.
File metadata
- Download URL: nakdimon-0.1.2.tar.gz
- Upload date:
- Size: 59.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4d76793325039f2d12f9fc5f9c0bcdf5bbaadf261f16292707f0b7efed7572c
|
|
| MD5 |
6c7f1a26f70d7640178c5441a6332c67
|
|
| BLAKE2b-256 |
1dc83148685b25c8d780c04d29b1173f0c468e804757b31e1367324ccfe86a17
|
File details
Details for the file nakdimon-0.1.2-py3-none-any.whl.
File metadata
- Download URL: nakdimon-0.1.2-py3-none-any.whl
- Upload date:
- Size: 59.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc5f7220d468307a0d3b88db61f76b92dc4ee283c56ebebb4138613d183aa0e8
|
|
| MD5 |
10c84bd3a83492cd4e618cb7727b2a7d
|
|
| BLAKE2b-256 |
bbb49daaf336ff7e17c0dea891852b082bd66d3c01fcb63e8956ab3bfff0817d
|