Skip to main content

Rule-based Arabic (MSA) text-to-IPA with tashkeel diacritization — an orthography2ipa G2P plugin

Project description

arbtok

Rule-based Arabic (MSA) text→IPA with tashkeel diacritization — a downstream Arabic engine built on orthography2ipa.

A context-sensitive token tree (sentence → words → characters) implements the morpho-phonological rules that table lookups cannot express: sun-letter assimilation, hamzat al-waṣl elision, tanwīn pausal forms, tāʾ marbūṭa, mater-lectionis vowel lengthening, definite-article waṣl, and idgham/iqlab nasal assimilation. Bare (undiacritized) text is diacritized first via text2tashkeel — a model picker over bundled ONNX diacritization models.

Honesty note: the gold IPA reference set was LLM-generated and has not been validated by a native MSA speaker. If you speak MSA, pull requests are very welcome.

Installation

pip install arbtok

Usage

arbtok is built on orthography2ipa (spec data and the shared G2PPlugin/WordContext base types) and owns the Arabic pipeline — orthography2ipa stays the language-agnostic base library.

Engine class

from arbtok.tokenizer import Sentence

Sentence("اَلسَّلَامُ عَلَيْكُمْ").ipa

Bare text is handled by diacritizing first:

from arbtok.plugin import ArbtokG2PPlugin

plugin = ArbtokG2PPlugin()
plugin.transcribe("كتاب جميل")    # auto-tashkeel + IPA

Diacritization only

from arbtok.tashkeel import TashkeelDiacritizer   # wraps text2tashkeel

TashkeelDiacritizer().diacritize("كتاب جميل")

Quality benchmarks

The test suite pins a gold sentence set (CER target ≤ 5% against the reference transcriptions) and benchmarks against espeak-ng. See tests/test_ipa_fuzzy.py and docs/ for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arbtok-0.0.0a2.tar.gz (104.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arbtok-0.0.0a2-py3-none-any.whl (100.0 kB view details)

Uploaded Python 3

File details

Details for the file arbtok-0.0.0a2.tar.gz.

File metadata

  • Download URL: arbtok-0.0.0a2.tar.gz
  • Upload date:
  • Size: 104.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arbtok-0.0.0a2.tar.gz
Algorithm Hash digest
SHA256 ed05542099740ec34d0e3dbc20e12984d7235f953c2e3ec7db2d35e58e10d5b6
MD5 44783674a61635c7fd0d54633547925a
BLAKE2b-256 aca41fe89f1ff494f2805e181d676668e086f13b718bddec039f5fc90f1cedc8

See more details on using hashes here.

File details

Details for the file arbtok-0.0.0a2-py3-none-any.whl.

File metadata

  • Download URL: arbtok-0.0.0a2-py3-none-any.whl
  • Upload date:
  • Size: 100.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arbtok-0.0.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 eb59999eb028d4702647d2265d293a9458e65da17c7fa20aac136d0f91893a14
MD5 da4ba0ff7e1f61a1b5270b9ced33b196
BLAKE2b-256 860407328a4796a229775e9b7b188e5391b1e8a73fcbbb8cf0b9d315f20c4c39

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page