Skip to main content

Lightweight Arabic diacritization (tashkeel) — a model picker over bundled ONNX models, no PyTorch, no network

Project description

text2tashkeel

A utility for lightweight Arabic diacritization (tashkeel) — it puts the missing vowel marks back into Arabic text. Not one model but a model picker: a single tiny API over interchangeable diacritization models, all running on onnxruntimeno PyTorch, no API keys, offline by default. Pick the model that fits your accuracy/speed/size budget; the only runtime dependencies are numpy and onnxruntime.

from text2tashkeel import Diacritizer
Diacritizer().diacritize("بسم الله الرحمن الرحيم")              # default model - 2.04% DER
Diacritizer("rawi-v2-int8").diacritize("بسم الله الرحمن الرحيم")  # lean single model

More than vowels. Most diacritizers only add the short-vowel marks to text that is already spelled correctly. The default rawi models also restore the hamza (ء) and the silent dagger-alef — so they fix real, inconsistently-spelled input (e.g. a bare ا typed for أ), not just clean text. This is rare among diacritizers; here's exactly why and how.

Install

pip install text2tashkeel

The wheel is small (~10 MB): it bundles our best models which work fully offline (no downloads, no torch). The full-precision (fp32) variants are fetched from Hugging Face on first use if you opt in:

pip install text2tashkeel        # int8 + flagship, offline
pip install text2tashkeel[hf]    # + auto-download fp32 models on demand

Without [hf], asking for a non-bundled model raises a clear message with its Hugging Face link. You can also point at your own model (e.g. one trained on a different corpus) with register_model(...) — see below. For development: pip install -e ".[test]" then pytest.

Models

Two models cover almost every use; both ship in the wheel and run offline:

Use case Model DER ↓ latency size
best accuracy (default) rawi-ensemble 2.04% ~2 ms 4.9 MB
fastest & smallest rawi-v2-int8 2.30% ~1 ms 2.5 MB

22 model configurations are available — the rawi family (V1/V2/V3 + INT8), two independent diacritizers (bilstm and libtashkeel), and gated ensembles of them — for comparison, research, or special cases:

from text2tashkeel import available_models, Diacritizer
available_models()                 # all models
available_models(bundled_only=True)  # the models that ship in the wheel (offline)
Diacritizer("rawi-v2-int8").diacritize("بسم الله الرحمن الرحيم")

Bundled vs fetched. available_models(bundled_only=True) lists the models that ship in the wheel. Everything else downloads from Hugging Face on first use with [hf] installed; each model's weights live in its own repo (rawi, rawi-v2, rawi-v3, rawi-ensemble, bilstm, libtashkeel), all grouped in the Arabic Diacritizers collection.

Bring your own model. Trained a diacritizer on a different corpus? Point at it:

from text2tashkeel import register_model, Diacritizer
register_model("my-rawi", "my_model.onnx", "my_vocab.json", arch="rawi")  # or arch="rawi-v3"
Diacritizer("my-rawi").diacritize("نص عربي")

Diacritizer is callable (d("...")) and lazily builds one onnxruntime session it reuses — construct once, call many times. Full credits and licenses for every model: docs/07-credits-and-license.md.

CLI

text2tashkeel "الحمد لله رب العالمين"          # flagship default
echo "محمد رسول الله" | text2tashkeel
text2tashkeel -m rawi-v2-int8 < input.txt > output.txt

Benchmarks

Measured DER/WER for every model across the corpus's train/test/val splits is in benchmarks/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text2tashkeel-0.1.0.tar.gz (10.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text2tashkeel-0.1.0-py3-none-any.whl (10.5 MB view details)

Uploaded Python 3

File details

Details for the file text2tashkeel-0.1.0.tar.gz.

File metadata

  • Download URL: text2tashkeel-0.1.0.tar.gz
  • Upload date:
  • Size: 10.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for text2tashkeel-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bf82666fa943e5648d3c49b1e0b8677ee517bd2532a4cb59575ab25b51ef3529
MD5 f31fefa3c027fb09750a73f001c6be0e
BLAKE2b-256 a34c636c6f202f0c40a149bbde6a074ef3db182c8281394f7bc677302da115a5

See more details on using hashes here.

File details

Details for the file text2tashkeel-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: text2tashkeel-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for text2tashkeel-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dd1ad22b6b012a71f1467d95fbaceecc38b22f4882c31389a6d35cdc1669967c
MD5 cc818057a465115c0bcecfc7527816fc
BLAKE2b-256 6301a905b0f46eb9051d15d8ad127331a0db48816184086f34c136f847205e6d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page