Lightweight Arabic diacritization (tashkeel) — a model picker over bundled ONNX models, no PyTorch, no network
Project description
text2tashkeel
A utility for lightweight Arabic diacritization (tashkeel) — it puts the
missing vowel marks back into Arabic text. Not one model but a model picker: a
single tiny API over interchangeable diacritization models, all running on
onnxruntime — no PyTorch, no API keys, offline by default. Pick the model
that fits your accuracy/speed/size budget; the only runtime
dependencies are numpy and onnxruntime.
from text2tashkeel import Diacritizer
Diacritizer().diacritize("بسم الله الرحمن الرحيم") # default model - 2.04% DER
Diacritizer("rawi-v2-int8").diacritize("بسم الله الرحمن الرحيم") # lean single model
More than vowels. Most diacritizers only add the short-vowel marks to text that is already spelled correctly. The default rawi models also restore the hamza (ء) and the silent dagger-alef — so they fix real, inconsistently-spelled input (e.g. a bare
اtyped forأ), not just clean text. This is rare among diacritizers; here's exactly why and how.
Install
pip install text2tashkeel
The wheel is small (~10 MB): it bundles our best models which work fully offline (no downloads, no torch). The full-precision (fp32) variants are fetched from Hugging Face on first use if you opt in:
pip install text2tashkeel # int8 + flagship, offline
pip install text2tashkeel[hf] # + auto-download fp32 models on demand
Without [hf], asking for a non-bundled model raises a clear message with its
Hugging Face link. You can also point at your own model (e.g. one trained on a
different corpus) with register_model(...) — see below. For development:
pip install -e ".[test]" then pytest.
Models
Two models cover almost every use; both ship in the wheel and run offline:
| Use case | Model | DER ↓ | latency | size |
|---|---|---|---|---|
| best accuracy (default) ⭐ | rawi-ensemble |
2.04% | ~2 ms | 4.9 MB |
| fastest & smallest | rawi-v2-int8 |
2.30% | ~1 ms | 2.5 MB |
22 model configurations are available — the rawi family (V1/V2/V3 + INT8), two
independent diacritizers (bilstm and libtashkeel), and gated ensembles of them —
for comparison, research, or special cases:
from text2tashkeel import available_models, Diacritizer
available_models() # all models
available_models(bundled_only=True) # the models that ship in the wheel (offline)
Diacritizer("rawi-v2-int8").diacritize("بسم الله الرحمن الرحيم")
Bundled vs fetched. available_models(bundled_only=True) lists the models that
ship in the wheel. Everything else downloads from Hugging Face on first use with [hf] installed;
each model's weights live in its own repo
(rawi,
rawi-v2,
rawi-v3,
rawi-ensemble,
bilstm,
libtashkeel), all
grouped in the Arabic Diacritizers collection.
Bring your own model. Trained a diacritizer on a different corpus? Point at it:
from text2tashkeel import register_model, Diacritizer
register_model("my-rawi", "my_model.onnx", "my_vocab.json", arch="rawi") # or arch="rawi-v3"
Diacritizer("my-rawi").diacritize("نص عربي")
Diacritizer is callable (d("...")) and lazily builds one onnxruntime
session it reuses — construct once, call many times. Full credits and licenses for
every model: docs/07-credits-and-license.md.
CLI
text2tashkeel "الحمد لله رب العالمين" # flagship default
echo "محمد رسول الله" | text2tashkeel
text2tashkeel -m rawi-v2-int8 < input.txt > output.txt
Benchmarks
Measured DER/WER for every model across the corpus's train/test/val splits is in benchmarks/.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file text2tashkeel-0.1.0a2.tar.gz.
File metadata
- Download URL: text2tashkeel-0.1.0a2.tar.gz
- Upload date:
- Size: 10.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
49d23418fc3b3e278f7e9d61bf5ca7ab08bc532214a1271d7916fce2e0fcf835
|
|
| MD5 |
bc6cda05d07b92332d73977cfd8e23de
|
|
| BLAKE2b-256 |
771493ecbb9199f1daaecf90f5f699530e419f81792eff084e73b19b06ea4f38
|
File details
Details for the file text2tashkeel-0.1.0a2-py3-none-any.whl.
File metadata
- Download URL: text2tashkeel-0.1.0a2-py3-none-any.whl
- Upload date:
- Size: 10.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54356372e5a9bc95952fcd250a71b7330c95388b07c76d1f9ed7eb77fe7618e7
|
|
| MD5 |
761169df0c7b5d02e070143cab18825c
|
|
| BLAKE2b-256 |
f2a7db082f8d4c3ba35c76563d4d2ffaa618a52bea27214e24df42aab0e2af10
|