Skip to main content

Character-based Arabic Tashkeel Transformer (CATT) by Abjad AI

Project description

CATT: Character-based Arabic Tashkeel Transformer

License Open in Spaces

This is the official implementation of the paper CATT: Character-based Arabic Tashkeel Transformer.

How to Run?

You need first to download models. You can find them in the Releases section of this repo.
The best checkpoint for Encoder-Decoder (ED) model is best_ed_mlm_ns_epoch_178.pt.
For the Encoder-Only (EO) model, the best checkpoint is best_eo_mlm_ns_epoch_193.pt.
use the following bash script to download models:

mkdir models/
wget -P models/ https://github.com/abjadai/catt/releases/download/v2/best_ed_mlm_ns_epoch_178.pt
wget -P models/ https://github.com/abjadai/catt/releases/download/v2/best_eo_mlm_ns_epoch_193.pt

You can use the inference code examples: predict_ed.py for ED models and predict_eo.py for EO models.
Both examples are provided with batch inference support. Read the source code to gain a better understanding.

python predict_ed.py
python predict_eo.py

EO models are recommended for faster inference.
ED models are recommended for better accuracy of the predicted diacritics.

Converting Models to ONNX

Export PyTorch Models

To convert your trained PyTorch models to ONNX format, use the export script:

python export_to_onnx.py

This script will:

  • Load your trained PyTorch model checkpoints
  • Export separate ONNX models for encoder and decoder components
  • Validate the exported models for correctness
  • Save the ONNX models in the onnx_models/ directory

Output files:

  • encoder.onnx - The encoder component
  • decoder.onnx - The decoder component (or linear layer for encoder-only models)

Running ONNX Models

To test and run inference with the exported ONNX models:

python run_onnx.py

This script will:

  • Load the exported ONNX models
  • Run inference using ONNX Runtime

For more details on the export process, check the export_to_onnx.py script configuration.

How to Train?

To start trainnig, you need to download the dataset from the Releases section of this repo.

wget https://github.com/abjadai/catt/releases/download/v2/dataset.zip
unzip dataset.zip

Then, edit the script train_catt.py and adjest the default values:

# Model's Configs
model_type = 'ed' # 'eo' for Encoder-Only OR 'ed' for Encoder-Decoder
dl_num_workers = 32
batch_size = 32
max_seq_len = 1024
threshold = 0.6

# Pretrained Char-Based BERT
pretrained_mlm_pt = None # Use None if you want to initialize weights randomly OR the path to the char-based BERT
#pretrained_mlm_pt = 'char_bert_model_pretrained.pt'

Finally, run the training script.

python train_catt.py

Resources

ToDo

  • inference script
  • upload our pretrained models
  • upload CATT dataset
  • upload DER scripts
  • training script

License Change: CC-BY-NC to Apache 2.0

This repository has updated its license from Creative Commons Attribution-NonCommercial (CC-BY-NC) to Apache 2.0 License. This change removes commercial use restrictions and adopts an industry-standard open-source license, enabling broader adoption and collaboration. This transition supports the project's growth while maintaining our commitment to open-source principles.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catt_tashkeel-1.0.1.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

catt_tashkeel-1.0.1-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file catt_tashkeel-1.0.1.tar.gz.

File metadata

  • Download URL: catt_tashkeel-1.0.1.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for catt_tashkeel-1.0.1.tar.gz
Algorithm Hash digest
SHA256 eb25fb284296357b0a1cc28e91ca564caf2a32f7fe0e60c548aa50588adcfcb8
MD5 029af029cd9b7ee4959c8e4d0f874768
BLAKE2b-256 67f6f70604885af507507f7cdb84ad396f518ddf6efbd7a3881ff72d139a3b18

See more details on using hashes here.

File details

Details for the file catt_tashkeel-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: catt_tashkeel-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for catt_tashkeel-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6545b45de33d83d4c21997ea4e0307fdad06675e375ef6003d0ab7dd4752a528
MD5 5b94f359c22f17c6590193c1c12543e6
BLAKE2b-256 8c7dcc53ef37b7843af78966849af71954af1ef73f95ea0dc20aeaacf72b488a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page