Hebrew nakdan with Shva Na and Atama'a
Project description
Phonikud
Phonikud is a Hebrew diacritizer based on dictabert-large-char-menaked with added phonetic symbols for Shva Na and Hat'ama (Stress).
Added Symbols
- Hat'ama (Stress):
\u05abalso calledole - Mobile Shva (Shva Na):
\u05bdalso calledmeteg - Prefix: vertical bar
|
Example: סֵ֫לֵרִי בְּֽ|מַעְבַּד מָזוֹן
Setup
pip install uv
uv sync
Prepare data
Add text files with diacritics, including Hat'ama and Shva Na, to data/train.
Example input: סֵ֫לֵרִי בְּֽ|מַעְבַּד מָזוֹן
wget https://huggingface.co/datasets/thewh1teagle/phonikud-data/resolve/main/knesset_nikud_v4.txt.7z
sudo apt install p7zip-full -y
7z x knesset_nikud_v4.txt.7z
mv knesset_nikud_v4.txt data/train/
Train
uv run src/train/main.py
Monitor loss
uv run tensorboard --logdir ./ckpt
Monitor GPU
uv pip install nvitop
uv run nvitop
Run
Run the model with:
uv run src/run/main.py -m path/to/checkpoint/
Export onnx
See onnx_lib
Upload to HuggingFace
uv pip install huggingface_hub
git config --global credential.helper store # Allow clone private repo from HF
huggingface-cli login --token "token" --add-to-git-credential # https://huggingface.co/settings/tokens
uv run huggingface-cli upload --repo-type model phonikud ./ckpt/last ./ckpt/last
# Fetch the model by
git lfs install
git clone https://huggingface.co/user/phonikud
# Fetch file by
huggingface-cli download --repo-type dataset user/some-dataset some_file.7z --local-dir .
sudo apt install p7zip-full
7z x some_file.7z
Gotchas
- Hebrew not printed in terminal when using SSH
Run
sudo locale-gen en_US.UTF-8
sudo update-locale LANG=en_US.UTF-8
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
Then, close the terminal and reconnect.
TODO:
- Organize train/val/test splits -- track val performance over time, log to tensorboard/wandb, ...
- Check that hatama/shva targets are guaranteed to be aligned with tokenized characters (use
return_offsets_mapping=True? cf. dictabert code)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phonikud-0.1.0.tar.gz.
File metadata
- Download URL: phonikud-0.1.0.tar.gz
- Upload date:
- Size: 39.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b125be7032ec78dd56048165634696ca7a6d602446e67a14f6e8d228ae8a98a
|
|
| MD5 |
e858b27061a55339f7040547c3ce3545
|
|
| BLAKE2b-256 |
5a3f3805db5041043f9bbf3bd1e8ba1256394c0a02f889e0884b2fa97f0c8656
|
File details
Details for the file phonikud-0.1.0-py3-none-any.whl.
File metadata
- Download URL: phonikud-0.1.0-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f0133038c5cc4a322e005d511020474feb9fe44ef7eac2fc735aded3fc1774f
|
|
| MD5 |
c2883458ae732a6c8390c0f35449ac24
|
|
| BLAKE2b-256 |
ca9f49804166295c3f403f515ca3e590ab7c32738e123148e82c9601240f677a
|