Skip to main content

Rule-based Arabic (MSA) text-to-IPA with tashkeel diacritization — an orthography2ipa G2P plugin

Project description

arbtok

Rule-based Arabic (MSA) text→IPA with tashkeel diacritization — a downstream Arabic engine built on orthography2ipa.

A context-sensitive token tree (sentence → words → characters) implements the morpho-phonological rules that table lookups cannot express: sun-letter assimilation, hamzat al-waṣl elision, tanwīn pausal forms, tāʾ marbūṭa, mater-lectionis vowel lengthening, definite-article waṣl, and idgham/iqlab nasal assimilation. Bare (undiacritized) text is diacritized first via text2tashkeel — a model picker over bundled ONNX diacritization models.

Honesty note: the gold IPA reference set was LLM-generated and has not been validated by a native MSA speaker. If you speak MSA, pull requests are very welcome.

Installation

pip install arbtok

Usage

arbtok is built on orthography2ipa (spec data and the shared G2PPlugin/WordContext base types) and owns the Arabic pipeline — orthography2ipa stays the language-agnostic base library.

Engine class

from arbtok.tokenizer import Sentence

Sentence("اَلسَّلَامُ عَلَيْكُمْ").ipa

Bare text is handled by diacritizing first:

from arbtok.plugin import ArbtokG2PPlugin

plugin = ArbtokG2PPlugin()
plugin.transcribe("كتاب جميل")    # auto-tashkeel + IPA

Diacritization only

from arbtok.tashkeel import TashkeelDiacritizer   # wraps text2tashkeel

TashkeelDiacritizer().diacritize("كتاب جميل")

Quality benchmarks

The test suite pins a gold sentence set (CER target ≤ 5% against the reference transcriptions) and benchmarks against espeak-ng. See tests/test_ipa_fuzzy.py and docs/ for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arbtok-0.0.0a3.tar.gz (105.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arbtok-0.0.0a3-py3-none-any.whl (101.3 kB view details)

Uploaded Python 3

File details

Details for the file arbtok-0.0.0a3.tar.gz.

File metadata

  • Download URL: arbtok-0.0.0a3.tar.gz
  • Upload date:
  • Size: 105.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arbtok-0.0.0a3.tar.gz
Algorithm Hash digest
SHA256 43de8dbd2cfb19d60f4e36c6b02e0706568e2b484316161d910f46b4b85dd907
MD5 ed90ed66002d279d38940406637484cf
BLAKE2b-256 25d35801c3a346ed563a769cf42a9864e5dcda17e11b5a995091b849b33fcb74

See more details on using hashes here.

File details

Details for the file arbtok-0.0.0a3-py3-none-any.whl.

File metadata

  • Download URL: arbtok-0.0.0a3-py3-none-any.whl
  • Upload date:
  • Size: 101.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arbtok-0.0.0a3-py3-none-any.whl
Algorithm Hash digest
SHA256 11c919578d808d824299150e2f4c537d56bd3b7065764ca5c74308f8da93d7a8
MD5 5602246b275ad47d17e594e15f255a8c
BLAKE2b-256 fa198c4b66d7bdf2f06725731d9c8344004e137fa17a41c3793f575eee5c4050

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page