Convert Hebrew text into IPA for TTS systems and learning
Project description
Phonikud
Grapheme to phoneme in Hebrew
Convert Hebrew text into IPA for TTS systems and learning.
Features
- Nikud model with phonetic marks 🧠
- Convert nikud text to modern spoken phonemes 🗣️
- Expand dates, numbers, etc 📚
- Handle mixed English/Hebrew with fallback 🌍
- Real time onnx model support 💫
Install
Due to ongoing development, it is recommend to install from git directly.
pip install git+https://github.com/thewh1teagle/phonikud
You can find the package as well in pypi.org/project/phonikud-hebrew
Play 🕹️
See Phonemize with Hebrew Space
Usage
from phonikud import phonemize
phonemes = phonemize('שָׁלוֹם עוֹלָם')
print(phonemes) # ʃalˈom olˈam
Note: Phonikud expects diacritics and phonetics marks.
Please use phonikud-onnx for adding diacritics.
Examples
See examples
Docs 📚
- It is recommend to add nikud with phonikud-onnx model
- Hebrew nikud is normalized
- Most of the Hebrew rules happen in
phonemize.py - It is highly recommend to normalize Hebrew using
phonikud.normalize('שָׁלוֹם')when training models
Nikud set and symbols
- Chars from
\u05b0to\u05ea(Letters and nikud) '"(Gershaim),\u05ab(Hat'ama)\u05bd(Shva Na)|(Prefix letters)
\u05ab and \u05bd are not standard - we invented them to mark Hat'ama and Shva Na clearly.
See Hebrew UTF-8
Hebrew phonemes 🔠
Stress marks (1)
ˈ- stress, visually looks like single quote, but it's\u02c8
Vowels (5)
a- Shamare- Shemeri- Shimero- Shomeru- Shumar
Consonants (24)
b- Betv- Vet, Vavd- Daledh- Heyz- Zainχ- Het, Haft- Taf, Tetj- Yudk- Kuf, Kafl- Lamedm- Memn- Nuns- Sin, Samekhf- Feyp- Peyts- Tsadiktʃ- Tsadik with Geresh (צִ'יפְּס)w- Example:וָואלָהʔ- Alef/Ayin, visually looks like?, but it's\u0294ɡ- Gimel, visually looks likeg, but it's actually\u0261ʁ- Resh\u0281ʃ- Shin\u0283ʒ- Zain with Geresh (בֵּז׳)\u0292dʒ- Gimel with Geresh (גִּ׳ירָפָה)
Mixed English 🌎
You can mix the phonemization of English by providing a fallback function that accepts an English string and returns phonemes.
Note: if you use this with TTS, it is recommended to train the model on phonemized English. Otherwise, the model may not recognize the phonemes correctly.
Cool fact: modern Hebrew phonemes mostly exist in English except ʔ (Alef/Ayin), Resh ʁ and χ (Het).
How It Works 🔧
To train TTS models, it’s essential to represent speech accurately. Plain Hebrew text is ambiguous without diacritics, and even with them, Shva Na and Hat'ama can cause confusion. For example, "אני אוהב אורז" (I like rice) and "אני אורז מזוודה" (I pack a suitcase) share the same diacritics for "אורז" but have different Hat'ama.
The workflow is as follows:
-
Add diacritics using a standard Nakdan.
-
Enhance the diacritics with an enhanced Nakdan that adds invented diacritics for Hat'ama and Shva Na. See phonikud
-
Convert the text with diacritics to phonemes (alphabet characters that represent sounds) using this library, based on coding rules.
-
Train the TTS model on phonemes, and at runtime, feed the model phonemes to generate speech.
This ensures accurate and clear speech synthesis. Since the output phonemes are similar to English, we can fine tune an English model with as little as one hour of Hebrew data.
ℹ️ Limitations
- Some of the nikud may sound a bit formal - similar to other models
- Some words get the same nikud but different hatama - not always accurate
- Basic support for non-words (gibberish, typos) - not always handled
- Names and non-Hebrew words are sometimes predicted incorrectly
💡 You can always pass your own phonemes using markdown-like syntax:
[...title](/ʔantsiklopˈedja/)
🧠 Future Ideas
-
Multilingual LLM Expander
Expand numbers, emojis, dates, times, and more using a lightweight multilingual LLM or transformer.
The idea is to train a small model on pairs of raw text → expanded text, making it easier to generate speech-friendly inputs. -
Transformer/LLM G2P
Skip coding rules - make a dataset with current G2P, then train a model end-to-end on text to phonemes.
Datasets
- ILSpeech (speech, MIT)
- RanSpeech (speech, non commercial)
- Saspeech (speech, non commercial)
- phonikud-data (nikud and phonetics, cc-4.0)
Notes
- The default schema is
modern. you can useplainschema for simplicify (eg.xinstead ofχ). usephonemize(..., schema='plain') - There's no secondary stress (only
MilelandMilra) - The
ʔ/hphonemes trimmed from the suffix - Stress placed usually on the last syllable -
Milra, sometimes on one before -Mileland rarely one beforeMilel - Stress should be placed in the syllable always before vowel and NOT in the first character of the syllable
- See Unicode Hebrew table
- See Modern Hebrew phonology
Paper 📑
See phonikud-paper
Testing 🧪
Run uv run pytest
Credits
Special thanks ❤️ to dicta-il for their amazing Hebrew diacritics model ✨ and the dataset that made this possible!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phonikud-0.3.7.tar.gz.
File metadata
- Download URL: phonikud-0.3.7.tar.gz
- Upload date:
- Size: 133.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b953ebc08c0b98e6883a2472864693d326d4f3f7ba03d952322309f1237f2612
|
|
| MD5 |
ff2a4a2a4638be55356694fb251fc8e8
|
|
| BLAKE2b-256 |
8a4da5f4a38dc2b45f53e04496453bfe25759dec511da641663262a0a833f270
|
File details
Details for the file phonikud-0.3.7-py3-none-any.whl.
File metadata
- Download URL: phonikud-0.3.7-py3-none-any.whl
- Upload date:
- Size: 28.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e62a55a0f23270926ca7d969cb7525d4080dc1d58ae80a66f2c2d473dbc53d0
|
|
| MD5 |
bbd8c0b95764d3f255cf520184997ce8
|
|
| BLAKE2b-256 |
e7a798d6a92a9654e4e7e1d9833b9456c287de5fec7388e35581acc2ca698155
|