Skip to main content

Rule-based Arabic (MSA) text-to-IPA with tashkeel diacritization — an orthography2ipa G2P plugin

Project description

arbtok

Rule-based Arabic (MSA) text→IPA with tashkeel diacritization — a downstream Arabic engine built on orthography2ipa.

A context-sensitive token tree (sentence → words → characters) implements the morpho-phonological rules that table lookups cannot express: sun-letter assimilation, hamzat al-waṣl elision, tanwīn pausal forms, tāʾ marbūṭa, mater-lectionis vowel lengthening, definite-article waṣl, and idgham/iqlab nasal assimilation. Bare (undiacritized) text is diacritized first via text2tashkeel — a model picker over bundled ONNX diacritization models.

Honesty note: the gold IPA reference set was LLM-generated and has not been validated by a native MSA speaker. If you speak MSA, pull requests are very welcome.

Installation

pip install arbtok

Usage

arbtok is built on orthography2ipa (spec data and the shared G2PPlugin/WordContext base types) and owns the Arabic pipeline — orthography2ipa stays the language-agnostic base library.

Engine class

from arbtok.tokenizer import Sentence

Sentence("اَلسَّلَامُ عَلَيْكُمْ").ipa

Bare text is handled by diacritizing first:

from arbtok.plugin import ArbtokG2PPlugin

plugin = ArbtokG2PPlugin()
plugin.transcribe("كتاب جميل")    # auto-tashkeel + IPA

Diacritization only

from arbtok.tashkeel import TashkeelDiacritizer   # wraps text2tashkeel

TashkeelDiacritizer().diacritize("كتاب جميل")

Quality benchmarks

The test suite pins a gold sentence set (CER target ≤ 5% against the reference transcriptions) and benchmarks against espeak-ng. See tests/test_ipa_fuzzy.py and docs/ for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arbtok-0.0.0a4.tar.gz (113.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arbtok-0.0.0a4-py3-none-any.whl (107.8 kB view details)

Uploaded Python 3

File details

Details for the file arbtok-0.0.0a4.tar.gz.

File metadata

  • Download URL: arbtok-0.0.0a4.tar.gz
  • Upload date:
  • Size: 113.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arbtok-0.0.0a4.tar.gz
Algorithm Hash digest
SHA256 756280d9b13165ae3e438aa3426da9207c5f86c6685bc4f28bc40b5252804aa1
MD5 11fa9728b584cbbbcec6f589613af8d1
BLAKE2b-256 8e7ca6168648c9c8b93ff8fd0ee10619c5d24cf6c958fbe5f4859eebd713a279

See more details on using hashes here.

File details

Details for the file arbtok-0.0.0a4-py3-none-any.whl.

File metadata

  • Download URL: arbtok-0.0.0a4-py3-none-any.whl
  • Upload date:
  • Size: 107.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arbtok-0.0.0a4-py3-none-any.whl
Algorithm Hash digest
SHA256 44a616ea48ffdd164344f92d1c2a3157a5510f47ecf70b0cdfecc2e9c995e173
MD5 f144ddcba16553ebd3d7d04364192e47
BLAKE2b-256 a41c1da08606a4fb714f29f75ce5c44f4aa1351436d4bf3808785135cbf12282

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page