Skip to main content

A multilingual phonemizer combining lexica, NLP, and probabilistic scoring for improved phonemization accuracy..

Project description

OLaPh — Optimal Language Phonemizer

PyPI version Python versions License: MIT

OLaPh (Optimal Language Phonemizer) is a multilingual phonemization framework that converts text into phonemes surpassing the quality of comparable frameworks.


NEWS

05/2026: The instruction finetuning dataset for OlaphLLM is available here 05/2026: A new version of OLaPhLLM is available here


Overview

Traditional phonemizers rely on simple rule-based mappings or lexicon lookups. Neural and hybrid approaches improve generalization but still struggle with:

  • Names and foreign words
  • Abbreviations and acronyms
  • Loanwords and compounds
  • Ambiguous homographs

OLaPh tackles these challenges by combining:

  • Extensive language-specific dictionaries
  • Abbreviation, number, and letter normalization
  • Compound resolution with probabilistic scoring
  • Cross-language handling
  • NLP-based preprocessing via spaCy and Lingua

Evaluations on the Wikipron dataset show improved accuracy and robustness over existing phonemizers, including on OOV words.


Features

  • Multilingual phonemization (DE, EN-US, EN-UK, FR, ES, NL, SV, DA, PL, IT, FI)
  • Abbreviation and letter pronunciation dictionaries
  • Number normalization
  • Cross-language acronym detection
  • Compound splitting with statistical scoring
  • Freely available lexica for research and development derived from wiktionary.org.

Large Language Model

A LLM based on OLaPh output is also available. It is a GemmaX 2B Model trained on ~10M sentences derived from the FineWeb Corpus phonemized with the OLaPh framework.

Find it here on huggingface (DE, EN, FR, US. Training for additional languages planned)


Installation

From PyPI

pip install olaph

spaCy models are downloaded on demand.

From source

git clone https://github.com/iisys-hof/olaph.git
cd olaph
pip install -e .

Example Usage

from olaph import Olaph

phonemizer = Olaph()

output = phonemizer.phonemize_text("He ordered a Brezel and a beer in a tavern near München.", lang="en-us")

print(output)

Dependencies


Dictionars Sources

Research Summary

Phonemization is a critical component in text-to-speech synthesis. Traditional approaches rely on deterministic transformations and lexica, while neural methods offer potential for higher generalization on out-of-vocabulary (OOV) terms. This work introduces OLaPh (Optimal Language Phonemizer), a hybrid framework that integrates extensive multilingual lexica with advanced NLP techniques and a statistical subword segmentation function. Evaluations on the WikiPron benchmark show that the OLaPh framework significantly outperforms established baselines in overall accuracy and maintains robustness on OOV data through advanced fallback mechanisms. To further explore neural generalization, we utilize the framework to synthesize a high-consistency training corpus for an instruction-tuned Large Language Model (LLM). While the deterministic framework remains more accurate overall, the LLM demonstrates strong generalization, matching or partly exceeding the framework’s performance. This suggests that the LLM successfully internalized phonetic intuitions from the synthetic data that transcend the framework’s capabilities. Together, these tools provide a comprehensive, open-source resource for multilingual G2P research.


Citation

If you use OLaPh in academic work, please cite:

@misc{wirth2026olaphoptimallanguagephonemizer,
      title={OLaPh: Optimal Language Phonemizer}, 
      author={Johannes Wirth},
      year={2026},
      eprint={2509.20086},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.20086}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

olaph-0.2.19.tar.gz (42.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

olaph-0.2.19-py3-none-any.whl (43.0 MB view details)

Uploaded Python 3

File details

Details for the file olaph-0.2.19.tar.gz.

File metadata

  • Download URL: olaph-0.2.19.tar.gz
  • Upload date:
  • Size: 42.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for olaph-0.2.19.tar.gz
Algorithm Hash digest
SHA256 241e7055ee39f6cac0a0f9f2216eb8f4d2fcbda47c6982d22a0c4e63ea1b11cc
MD5 ee1eb67896698d76c52346a81b0e5d6b
BLAKE2b-256 f450a0795042f2b4d2651fa8eddd0082b362bed7c342a78dce3e72c6177e6c25

See more details on using hashes here.

File details

Details for the file olaph-0.2.19-py3-none-any.whl.

File metadata

  • Download URL: olaph-0.2.19-py3-none-any.whl
  • Upload date:
  • Size: 43.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for olaph-0.2.19-py3-none-any.whl
Algorithm Hash digest
SHA256 cea388a6a3416739a8cbc921b5f1021d4cc8e229b1c32da1df814e5a9ffde330
MD5 cf5d430ed507be1c1f60d65ff75b6173
BLAKE2b-256 84ca44dcc30c9d1ac5da1bd45eccda27a535cf9d06d28a4aed28b5e45ffb6154

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page