Skip to main content

High-performance Arabic-first tokenizer with morphology awareness

Project description

Deeplatent

High-performance Arabic tokenizer with morphology and parity awareness. Built with Rust for speed, with Python bindings for ease of use.

Features

  • Arabic-Optimized: Designed specifically for Arabic and morphologically-rich languages
  • Fast: Rust core with Python bindings (up to 43,000+ texts/sec with parallel processing)
  • Accurate: 100% roundtrip accuracy on 1,000,000 test samples
  • Edge Case Handling: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
  • Unicode Support: Full support for Arabic diacritics, and mixed scripts
  • Parallel Processing: Excellent thread scaling (5x+ speedup with 8 threads)

Installation

pip install deeplatent-nlp

Quick Start

from deeplatent import SARFTokenizer

# Load tokenizer
tok = SARFTokenizer.from_pretrained("SARFTokenizer")

# Encode text
ids = tok.encode("مرحبا بالعالم")
print(ids)

# Decode back
text = tok.decode(ids)
print(text)

Edge Cases Handled

Case Example Handling
Diacritics بِسْمِ Properly normalized
Arabic-Indic digits ٠١٢٣٤٥ Preserved
Alef variants أ إ آ ا Normalized to ا
Taa marbuta ة Optionally normalized
Tatweel (kashida) كـتـاب Removed
Mixed Arabic/English Hello مرحبا Both handled

Performance

Tokenizer Benchmark Results

Comparison with state-of-the-art tokenizers (5 runs, 5000 samples each).

Dataset used: almaghrabima/deeplatent-benchmark-data (60k samples: 30k Arabic + 30k English)

Rank Tokenizer Vocab AR Fertility EN Fertility AR C/T EN C/T Parity
1 SARFTokenizer 64,641 1.71 1.57 3.45 2.99 1.155
2 Gemma-3-4B 262,145 2.78 1.33 2.42 3.01 0.804
3 Fanar-1-9B 128,256 2.85 1.36 2.27 2.94 0.774
4 GPT-4o 200,019 2.81 1.44 2.45 3.38 0.725
5 Command-R-Arabic 255,033 3.00 1.33 2.17 3.04 0.713
6 Qwen3-4B 151,669 3.05 1.50 2.04 2.93 0.696
7 GPT-4 100,277 4.59 1.50 1.35 3.25 0.416

Metrics explained:

  • Fertility: Average tokens per word (lower is better)
  • C/T: Characters per token (higher is better - more compression)
  • Parity: AR C/T ÷ EN C/T (1.0 = equal treatment of both languages)

Key findings:

  • SARFTokenizer achieves parity closest to 1.0 (1.155), meaning near-equal treatment of Arabic and English
  • SARF tokenizers have the lowest Arabic fertility (1.7 tokens/word vs 2.8+ for others)
  • Morpheme-aware encoding significantly improves Arabic tokenization efficiency

Throughput Benchmark (1M samples, 680 MB)

Comparison with tiktoken on 1,000,000 documents:

Tokenizer 1 Thread 2 Threads 4 Threads 8 Threads
SARFTokenizer 3.14 MB/s 5.57 MB/s 9.00 MB/s 13.72 MB/s
tiktoken (o200k) 6.23 MB/s 10.55 MB/s 14.90 MB/s 10.60 MB/s
tiktoken (cl100k) 7.99 MB/s 11.68 MB/s 12.02 MB/s 8.47 MB/s
HF tokenizers 1.88 MB/s 3.97 MB/s 9.27 MB/s 17.47 MB/s

Key findings:

  • SARFTokenizer outperforms tiktoken at 8 threads (13.72 MB/s vs 8.47-10.60 MB/s)
  • Excellent parallel scaling: 4.4x speedup from 1 to 8 threads
  • tiktoken degrades with more threads (peaks at 4T, drops at 8T)

Million-Scale Roundtrip Accuracy

Tested on 999,999 samples from real-world data:

Category Samples Success Accuracy
Arabic 333,333 333,333 100.00%
English 333,333 333,333 100.00%
Mixed 333,333 333,333 100.00%
TOTAL 999,999 999,999 100.00%

Edge Case Tests (58/58 Passed)

All 12 edge case categories pass with 100% success:

Category Tests Status
Unicode Normalization 6 PASS
Zero-Width Characters 6 PASS
Unicode Whitespace 6 PASS
Grapheme Clusters 6 PASS
Apostrophes 4 PASS
Dashes 4 PASS
Decimal Separators 3 PASS
URLs/Emails 4 PASS
File Paths 3 PASS
Code Identifiers 4 PASS
Mixed Scripts/RTL 6 PASS
Robustness 6 PASS

Reproduce Benchmark Results

Datasets:

# Install dependencies
pip install deeplatent-nlp pyarrow tiktoken transformers huggingface-hub

# Run parity benchmark (vs GPT-4o, Gemma, etc.)
python benchmark_pypi.py

# Run throughput benchmark (vs tiktoken)
python benchmark_tiktoken_style.py --samples 1000000 --threads 1 2 4 8

# Run comprehensive tests (roundtrip + edge cases)
python test_comprehensive_million.py --samples 1000000 --report

Requirements

  • Python 3.9+
  • Rust 1.70+ (for building from source)

License

CC-BY-NC-4.0

Citation

@misc{sarf-tokenizer-2026,
  title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
  author={Almaghrabi, Mohammed},
  year={2026},
  url={https://huggingface.co/almaghrabima/SARFTokenizer},
  note={Independent research, part of Suhail Project}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplatent_nlp-0.3.9.tar.gz (71.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplatent_nlp-0.3.9-cp311-cp311-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

File details

Details for the file deeplatent_nlp-0.3.9.tar.gz.

File metadata

  • Download URL: deeplatent_nlp-0.3.9.tar.gz
  • Upload date:
  • Size: 71.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for deeplatent_nlp-0.3.9.tar.gz
Algorithm Hash digest
SHA256 abda791b783dee7d145e3f0f905aaec041247abd799ea65834881da9c61db86a
MD5 2ac04a8920c5db0c1f8ef64d8aea8221
BLAKE2b-256 3d940cc5bf369ef73a0bbccc649119e7e0c909d708db23b79d39468c8d2937ec

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.9-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.3.9-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 6be221c04f73fbb7d29f23d420c9c9786be1c4acb8d22a75bd033bbae65d2190
MD5 6cde50ec0960ae0e7f4cea26637f39fb
BLAKE2b-256 c95fdcffa456a9ad734d516b42a46f5dda0347251715de90cfe835b8bf5ff467

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page