Skip to main content

Hebrew nakdan with Shva Na and Atama'a

Project description

Phonikud

Phonikud is a Hebrew diacritizer based on dictabert-large-char-menaked with added phonetic symbols for Shva Na and Hat'ama (Stress).

Added Symbols

  • Hat'ama (Stress): \u05ab also called ole
  • Mobile Shva (Shva Na): \u05bd also called meteg
  • Prefix: vertical bar |

Example: סֵ֫לֵרִי בְּֽ|מַעְבַּד מָזוֹן

Setup

pip install uv
uv sync

Prepare data

Add text files with diacritics, including Hat'ama and Shva Na, to data/train.

Example input: סֵ֫לֵרִי בְּֽ|מַעְבַּד מָזוֹן

wget https://huggingface.co/datasets/thewh1teagle/phonikud-data/resolve/main/knesset_nikud_v4.txt.7z
sudo apt install p7zip-full -y
7z x knesset_nikud_v4.txt.7z
mv knesset_nikud_v4.txt data/train/

Train

uv run src/train/main.py

Monitor loss

uv run tensorboard  --logdir ./ckpt

Monitor GPU

uv pip install nvitop
uv run nvitop

Run

Run the model with:

uv run src/run/main.py -m path/to/checkpoint/

Export onnx

See onnx_lib

Upload to HuggingFace

uv pip install huggingface_hub
git config --global credential.helper store # Allow clone private repo from HF
huggingface-cli login --token "token" --add-to-git-credential # https://huggingface.co/settings/tokens
uv run huggingface-cli upload --repo-type model phonikud ./ckpt/last ./ckpt/last

# Fetch the model by
git lfs install
git clone https://huggingface.co/user/phonikud

# Fetch file by
huggingface-cli download --repo-type dataset user/some-dataset some_file.7z --local-dir .
sudo apt install p7zip-full
7z x some_file.7z

Gotchas

  1. Hebrew not printed in terminal when using SSH

Run

sudo locale-gen en_US.UTF-8
sudo update-locale LANG=en_US.UTF-8
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8

Then, close the terminal and reconnect.

TODO:

  • Organize train/val/test splits -- track val performance over time, log to tensorboard/wandb, ...
  • Check that hatama/shva targets are guaranteed to be aligned with tokenized characters (use return_offsets_mapping=True? cf. dictabert code)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phonikud-0.1.0.tar.gz (39.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phonikud-0.1.0-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file phonikud-0.1.0.tar.gz.

File metadata

  • Download URL: phonikud-0.1.0.tar.gz
  • Upload date:
  • Size: 39.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for phonikud-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3b125be7032ec78dd56048165634696ca7a6d602446e67a14f6e8d228ae8a98a
MD5 e858b27061a55339f7040547c3ce3545
BLAKE2b-256 5a3f3805db5041043f9bbf3bd1e8ba1256394c0a02f889e0884b2fa97f0c8656

See more details on using hashes here.

File details

Details for the file phonikud-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: phonikud-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for phonikud-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6f0133038c5cc4a322e005d511020474feb9fe44ef7eac2fc735aded3fc1774f
MD5 c2883458ae732a6c8390c0f35449ac24
BLAKE2b-256 ca9f49804166295c3f403f515ca3e590ab7c32738e123148e82c9601240f677a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page