Skip to main content

Lightweight Arabic diacritization (tashkeel) — a model picker over bundled ONNX models, no PyTorch, no network

Project description

text2tashkeel

A utility for lightweight Arabic diacritization (tashkeel) — it puts the missing vowel marks back into Arabic text. Not one model but a model picker: a single tiny API over interchangeable diacritization models, all running on onnxruntimeno PyTorch, no API keys, offline by default. Pick the model that fits your accuracy/speed/size budget; the only runtime dependencies are numpy and onnxruntime.

from text2tashkeel import Diacritizer
Diacritizer().diacritize("بسم الله الرحمن الرحيم")              # default model - 2.04% DER
Diacritizer("rawi-v2-int8").diacritize("بسم الله الرحمن الرحيم")  # lean single model

More than vowels. Most diacritizers only add the short-vowel marks to text that is already spelled correctly. The default rawi models also restore the hamza (ء) and the silent dagger-alef — so they fix real, inconsistently-spelled input (e.g. a bare ا typed for أ), not just clean text. This is rare among diacritizers; here's exactly why and how.

Install

pip install text2tashkeel

The wheel is small (~10 MB): it bundles our best models which work fully offline (no downloads, no torch). The full-precision (fp32) variants are fetched from Hugging Face on first use if you opt in:

pip install text2tashkeel        # int8 + flagship, offline
pip install text2tashkeel[hf]    # + auto-download fp32 models on demand

Without [hf], asking for a non-bundled model raises a clear message with its Hugging Face link. You can also point at your own model (e.g. one trained on a different corpus) with register_model(...) — see below. For development: pip install -e ".[test]" then pytest.

Models

Two models cover almost every use; both ship in the wheel and run offline:

Use case Model DER ↓ latency size
best accuracy (default) rawi-ensemble 2.04% ~2 ms 4.9 MB
fastest & smallest rawi-v2-int8 2.30% ~1 ms 2.5 MB

22 model configurations are available — the rawi family (V1/V2/V3 + INT8), two independent diacritizers (bilstm and libtashkeel), and gated ensembles of them — for comparison, research, or special cases:

from text2tashkeel import available_models, Diacritizer
available_models()                 # all models
available_models(bundled_only=True)  # the models that ship in the wheel (offline)
Diacritizer("rawi-v2-int8").diacritize("بسم الله الرحمن الرحيم")

Bundled vs fetched. available_models(bundled_only=True) lists the models that ship in the wheel. Everything else downloads from Hugging Face on first use with [hf] installed; each model's weights live in its own repo (rawi, rawi-v2, rawi-v3, rawi-ensemble, bilstm, libtashkeel), all grouped in the Arabic Diacritizers collection.

Bring your own model. Trained a diacritizer on a different corpus? Point at it:

from text2tashkeel import register_model, Diacritizer
register_model("my-rawi", "my_model.onnx", "my_vocab.json", arch="rawi")  # or arch="rawi-v3"
Diacritizer("my-rawi").diacritize("نص عربي")

Diacritizer is callable (d("...")) and lazily builds one onnxruntime session it reuses — construct once, call many times. Full credits and licenses for every model: docs/07-credits-and-license.md.

CLI

text2tashkeel "الحمد لله رب العالمين"          # flagship default
echo "محمد رسول الله" | text2tashkeel
text2tashkeel -m rawi-v2-int8 < input.txt > output.txt

Benchmarks

Measured DER/WER for every model across the corpus's train/test/val splits is in benchmarks/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text2tashkeel-0.1.0a2.tar.gz (10.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text2tashkeel-0.1.0a2-py3-none-any.whl (10.5 MB view details)

Uploaded Python 3

File details

Details for the file text2tashkeel-0.1.0a2.tar.gz.

File metadata

  • Download URL: text2tashkeel-0.1.0a2.tar.gz
  • Upload date:
  • Size: 10.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for text2tashkeel-0.1.0a2.tar.gz
Algorithm Hash digest
SHA256 49d23418fc3b3e278f7e9d61bf5ca7ab08bc532214a1271d7916fce2e0fcf835
MD5 bc6cda05d07b92332d73977cfd8e23de
BLAKE2b-256 771493ecbb9199f1daaecf90f5f699530e419f81792eff084e73b19b06ea4f38

See more details on using hashes here.

File details

Details for the file text2tashkeel-0.1.0a2-py3-none-any.whl.

File metadata

File hashes

Hashes for text2tashkeel-0.1.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 54356372e5a9bc95952fcd250a71b7330c95388b07c76d1f9ed7eb77fe7618e7
MD5 761169df0c7b5d02e070143cab18825c
BLAKE2b-256 f2a7db082f8d4c3ba35c76563d4d2ffaa618a52bea27214e24df42aab0e2af10

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page