Skip to main content

Convert Hebrew text into IPA for TTS systems and learning

Project description

Phonikud

Phonikud is a Hebrew grapheme-to-phoneme (G2P) engine that converts text into IPA for TTS and learning.

Project Page | Paper

Features

  • Nikud model with phonetic marks 🧠
  • Convert nikud text to modern spoken phonemes 🗣️
  • Expand dates, numbers, etc 📚
  • Handle mixed English/Hebrew with fallback 🌍
  • Real time onnx model support 💫
  • Lightweight TTS library: phonikud-tts 🎤

You can find the package as well in pypi.org/project/phonikud

Install

Due to ongoing development, it is recommend to install from git directly.

pip install git+https://github.com/thewh1teagle/phonikud

Usage

from phonikud import phonemize
phonemes = phonemize('שָׁלוֹם עוֹלָם')
print(phonemes) # ʃalˈom olˈam

Note: Phonikud expects diacritics and phonetics marks.

Please use phonikud-onnx for adding diacritics.

Examples

See examples

Play 🕹️

See TTS with Hebrew Space

See Phonemize with Hebrew Space

Docs 📚

  • It is recommend to add nikud with phonikud-onnx model
  • Hebrew nikud is normalized
  • Most Hebrew rules are handled in phonemize.py - a fast rule-based FST for converting text to phonemes.
  • It is highly recommend to normalize Hebrew using phonikud.normalize('שָׁלוֹם') when training models

Nikud set and symbols

  • Chars from \u05b0 to \u05ea (Letters and diacritics)
  • '" (Gershaim),
  • \u05ab (Hat'ama eg. טח֫ינה != טחינ֫ה tahini != grinding)
  • \u05bd (Vocal Shva eg. תְֽפרְסם notice Meteg in ת)
  • | (Prefix letters eg. ב|ירושלים)

\u05ab and \u05bd are not standard - we invented them to mark Hat'ama and Vocal Shva clearly.

See Hebrew UTF-8

Hebrew phonemes 🔠

Stress marks (1)

  • ˈ - stress, visually looks like single quote, but it's \u02c8

Vowels (5)

  • a - Shamar
  • e - Shemer
  • i - Shimer
  • o - Shomer
  • u - Shumar

Consonants (24)

  • b - Bet
  • v - Vet, Vav
  • d - Daled
  • h - Hey
  • z - Zain
  • χ - Het, Haf
  • t - Taf, Tet
  • j - Yud
  • k - Kuf, Kaf
  • l - Lamed
  • m - Mem
  • n - Nun
  • s - Sin, Samekh
  • f - Fey
  • p - Pey
  • ts - Tsadik
  • - Tsadik with Geresh (צִ'יפְּס)
  • w - Example: וָואלָה
  • ʔ - Alef/Ayin, visually looks like ?, but it's \u0294
  • ɡ - Gimel, visually looks like g, but it's actually \u0261
  • ʁ - Resh \u0281
  • ʃ - Shin \u0283
  • ʒ - Zain with Geresh (בֵּז׳) \u0292
  • - Gimel with Geresh (גִּ׳ירָפָה)

Mixed English 🌎

You can mix the phonemization of English by providing a fallback function that accepts an English string and returns phonemes. Note: if you use this with TTS, it is recommended to train the model on phonemized English. Otherwise, the model may not recognize the phonemes correctly. Cool fact: modern Hebrew phonemes mostly exist in English except ʔ (Alef/Ayin), Resh ʁ and χ (Het).

How It Works 🔧

To train TTS models, it’s essential to represent speech accurately. Plain Hebrew text is ambiguous without diacritics, and even with them, Vocal Shva and Hat'ama can cause confusion. For example, "אני אוהב אורז" (I like rice) and "אני אורז מזוודה" (I pack a suitcase) share the same diacritics for "אורז" but have different Hat'ama.

The workflow is as follows:

  1. Add diacritics using a standard Nakdan.

  2. Enhance the diacritics with an enhanced Nakdan that adds invented diacritics for Hat'ama and Vocal Shva. See phonikud

  3. Convert the text with diacritics to phonemes (alphabet characters that represent sounds) using this library, based on coding rules.

  4. Train the TTS model on phonemes, and at runtime, feed the model phonemes to generate speech.

This ensures accurate and clear speech synthesis. Since the output phonemes are similar to English, we can fine tune an English model with as little as one hour of Hebrew data.

ℹ️ Limitations

  • Some of the nikud may sound a bit formal - similar to other models
  • Some words get the same nikud but different hatama - not always accurate
  • Basic support for non-words (gibberish, typos) - not always handled
  • Names and non-Hebrew words are sometimes predicted incorrectly

💡 You can always pass your own phonemes using markdown-like syntax:
[...title](/ʔantsiklopˈedja/)

🧠 Future Ideas

  • Multilingual LLM Expander

    Expand numbers, emojis, dates, times, and more using a lightweight multilingual LLM or transformer.
    The idea is to train a small model on pairs of raw text → expanded text, making it easier to generate speech-friendly inputs.

  • Punctuation model

    Train model to restore missing punctuation for better intonations

  • Transformer/LLM G2P

    Skip coding rules - make a dataset with current G2P, then train a end-to-end model on text to phonemes.

Datasets

Notes

  • The default schema is modern. you can use plain schema for simplicify (eg. x instead of χ). use phonemize(..., schema='plain')
  • There's no secondary stress (only Milel and Milra)
  • The ʔ/h phonemes trimmed from the suffix
  • Stress placed usually on the last syllable - Milra, sometimes on one before - Milel and rarely one before Milel
  • Stress should be placed in the syllable always before vowel and NOT in the first character of the syllable
  • See Unicode Hebrew table
  • See Modern Hebrew phonology
  • Initially we called Vocal Shva as Shva Na, but we learned that in modern Hebrew spoken Shva is different from written Shva Na, catchy name for it: שווא נשמע. See Shva#Pronunciation_in_Modern_Hebrew
  • To type Hebrew diacritics, use Right ALT (Windows), Left Option (macOS), or Long Press on the corresponding letter (Google Keyboard) based on the diacritic's name. eg. for Katmaz use Alt + ק. for Hatama use Alt + ^. for Vocal Shva use Alt + &

Testing 🧪

Run uv run pytest

Citation

If you find this code or our data helpful in your research or work, please cite the following paper.

@misc{kolani2025phonikud,
  title={Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech},
  author={Yakov Kolani and Maxim Melichov and Cobi Calev and Morris Alper},
  year={2025},
  eprint={2506.12311},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.12311},
}

Credits

Special thanks ❤️ to dicta-il for their amazing Hebrew diacritics model ✨ and the dataset that made this possible!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phonikud-0.3.9-py3-none-any.whl (28.7 kB view details)

Uploaded Python 3

File details

Details for the file phonikud-0.3.9-py3-none-any.whl.

File metadata

  • Download URL: phonikud-0.3.9-py3-none-any.whl
  • Upload date:
  • Size: 28.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.10

File hashes

Hashes for phonikud-0.3.9-py3-none-any.whl
Algorithm Hash digest
SHA256 e7b287e001a34b549c050ea47a4dd7d975436da12af890a0a77662092bc87d91
MD5 7b1782066a36f6cf629909594b113d08
BLAKE2b-256 dfce35b1d3c447f8cb291bf4b7fcb09d10ac7235962681584dea501b6c693b46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page