Rule-based Arabic (MSA) text-to-IPA with tashkeel diacritization — an orthography2ipa G2P plugin
Project description
arbtok
Rule-based Arabic (MSA) text→IPA with tashkeel diacritization — a downstream Arabic engine built on orthography2ipa.
A context-sensitive token tree (sentence → words → characters) implements the morpho-phonological rules that table lookups cannot express: sun-letter assimilation, hamzat al-waṣl elision, tanwīn pausal forms, tāʾ marbūṭa, mater-lectionis vowel lengthening, definite-article waṣl, and idgham/iqlab nasal assimilation. Bare (undiacritized) text is diacritized first via text2tashkeel — a model picker over bundled ONNX diacritization models.
Honesty note: the gold IPA reference set was LLM-generated and has not been validated by a native MSA speaker. If you speak MSA, pull requests are very welcome.
Installation
pip install arbtok
Usage
arbtok is built on orthography2ipa
(spec data and the shared G2PPlugin/WordContext base types) and owns the
Arabic pipeline — orthography2ipa stays the language-agnostic base library.
Engine class
from arbtok.tokenizer import Sentence
Sentence("اَلسَّلَامُ عَلَيْكُمْ").ipa
Bare text is handled by diacritizing first:
from arbtok.plugin import ArbtokG2PPlugin
plugin = ArbtokG2PPlugin()
plugin.transcribe("كتاب جميل") # auto-tashkeel + IPA
Diacritization only
from arbtok.tashkeel import TashkeelDiacritizer # wraps text2tashkeel
TashkeelDiacritizer().diacritize("كتاب جميل")
Quality benchmarks
The test suite pins a gold sentence set (CER target ≤ 5% against the
reference transcriptions) and benchmarks against espeak-ng. See
tests/test_ipa_fuzzy.py and docs/ for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arbtok-0.0.0a2.tar.gz.
File metadata
- Download URL: arbtok-0.0.0a2.tar.gz
- Upload date:
- Size: 104.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed05542099740ec34d0e3dbc20e12984d7235f953c2e3ec7db2d35e58e10d5b6
|
|
| MD5 |
44783674a61635c7fd0d54633547925a
|
|
| BLAKE2b-256 |
aca41fe89f1ff494f2805e181d676668e086f13b718bddec039f5fc90f1cedc8
|
File details
Details for the file arbtok-0.0.0a2-py3-none-any.whl.
File metadata
- Download URL: arbtok-0.0.0a2-py3-none-any.whl
- Upload date:
- Size: 100.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb59999eb028d4702647d2265d293a9458e65da17c7fa20aac136d0f91893a14
|
|
| MD5 |
da4ba0ff7e1f61a1b5270b9ced33b196
|
|
| BLAKE2b-256 |
860407328a4796a229775e9b7b188e5391b1e8a73fcbbb8cf0b9d315f20c4c39
|