High-performance Arabic-first tokenizer with morphology awareness
Project description
Deeplatent
High-performance Arabic tokenizer with morphology and parity awareness. Built with Rust for speed, with Python bindings for ease of use.
Features
- Arabic-Optimized: Designed specifically for Arabic and morphologically-rich languages
- Fast: Rust core with Python bindings (up to 43,000+ texts/sec with parallel processing)
- Accurate: 100% roundtrip accuracy on 1,000,000 test samples
- Edge Case Handling: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
- Unicode Support: Full support for Arabic diacritics, and mixed scripts
- Parallel Processing: Excellent thread scaling (5x+ speedup with 8 threads)
Installation
pip install deeplatent-nlp
Quick Start
from deeplatent import SARFTokenizer
# Load tokenizer
tok = SARFTokenizer.from_pretrained("SARFTokenizer")
# Encode text
ids = tok.encode("مرحبا بالعالم")
print(ids)
# Decode back
text = tok.decode(ids)
print(text)
Edge Cases Handled
| Case | Example | Handling |
|---|---|---|
| Diacritics | بِسْمِ | Properly normalized |
| Arabic-Indic digits | ٠١٢٣٤٥ | Preserved |
| Alef variants | أ إ آ ا | Normalized to ا |
| Taa marbuta | ة | Optionally normalized |
| Tatweel (kashida) | كـتـاب | Removed |
| Mixed Arabic/English | Hello مرحبا | Both handled |
Performance
Tokenizer Benchmark Results
Comparison with state-of-the-art tokenizers (5 runs, 5000 samples each).
Dataset used: almaghrabima/deeplatent-benchmark-data (60k samples: 30k Arabic + 30k English)
| Tokenizer | Vocab | AR Fert | EN Fert | Avg Fert | Parity | Fert Rank | Parity Rank |
|---|---|---|---|---|---|---|---|
| SARFTokenizer | 64,641 | 1.71 | 1.57 | 1.64 | 1.155 | #1 | #2 |
| ALLaM-7B | 64,000 | 1.81 | 1.48 | 1.65 | 1.162 | #2 | #3 |
| Falcon-H1-7B | 130,049 | 2.64 | 1.55 | 2.10 | 0.926 | #3 | #1 |
| Fanar-1-9B | 128,256 | 2.85 | 1.36 | 2.10 | 0.774 | #4 | #4 |
| Hala-9B | 128,256 | 2.85 | 1.36 | 2.10 | 0.774 | #5 | #5 |
| GPT-4o | 200,019 | 2.81 | 1.44 | 2.12 | 0.725 | #6 | #6 |
| Command-R-Arabic | 255,033 | 3.00 | 1.33 | 2.16 | 0.713 | #7 | #7 |
| Qwen3-4B | 151,669 | 3.05 | 1.50 | 2.28 | 0.696 | #8 | #8 |
| GPT-4 | 100,277 | 4.59 | 1.50 | 3.05 | 0.416 | #9 | #10 |
| Mistral-7B-v0.3 | 32,768 | 5.56 | 1.48 | 3.52 | 0.417 | #10 | #9 |
Metrics explained:
- Fertility: Average tokens per word (lower is better)
- Parity: AR chars/token ÷ EN chars/token (1.0 = equal treatment of both languages)
Key findings:
- SARFTokenizer ranks #1 in fertility (1.64 avg tokens/word) and #2 in parity (1.155)
- Falcon-H1-7B has best parity (0.926) but lower fertility efficiency
- SARFTokenizer achieves best Arabic fertility (1.71 tokens/word vs 2.6+ for others)
- Morpheme-aware encoding significantly improves Arabic tokenization efficiency
- SARFTokenizer uses smallest vocab (64k) among top performers
Throughput Benchmark (1M samples, 680 MB)
Comparison with tiktoken on 1,000,000 documents:
| Tokenizer | 1 Thread | 2 Threads | 4 Threads | 8 Threads |
|---|---|---|---|---|
| SARFTokenizer | 3.14 MB/s | 5.57 MB/s | 9.00 MB/s | 13.72 MB/s |
| tiktoken (o200k) | 6.23 MB/s | 10.55 MB/s | 14.90 MB/s | 10.60 MB/s |
| tiktoken (cl100k) | 7.99 MB/s | 11.68 MB/s | 12.02 MB/s | 8.47 MB/s |
| HF tokenizers | 1.88 MB/s | 3.97 MB/s | 9.27 MB/s | 17.47 MB/s |
Key findings:
- SARFTokenizer outperforms tiktoken at 8 threads (13.72 MB/s vs 8.47-10.60 MB/s)
- Excellent parallel scaling: 4.4x speedup from 1 to 8 threads
- tiktoken degrades with more threads (peaks at 4T, drops at 8T)
Million-Scale Roundtrip Accuracy
Tested on 999,999 samples from real-world data:
| Category | Samples | Success | Accuracy |
|---|---|---|---|
| Arabic | 333,333 | 333,333 | 100.00% |
| English | 333,333 | 333,333 | 100.00% |
| Mixed | 333,333 | 333,333 | 100.00% |
| TOTAL | 999,999 | 999,999 | 100.00% |
Edge Case Tests (58/58 Passed)
All 12 edge case categories pass with 100% success:
| Category | Tests | Status |
|---|---|---|
| Unicode Normalization | 6 | PASS |
| Zero-Width Characters | 6 | PASS |
| Unicode Whitespace | 6 | PASS |
| Grapheme Clusters | 6 | PASS |
| Apostrophes | 4 | PASS |
| Dashes | 4 | PASS |
| Decimal Separators | 3 | PASS |
| URLs/Emails | 4 | PASS |
| File Paths | 3 | PASS |
| Code Identifiers | 4 | PASS |
| Mixed Scripts/RTL | 6 | PASS |
| Robustness | 6 | PASS |
Reproduce Benchmark Results
Datasets:
- Benchmark data (60k samples): almaghrabima/deeplatent-benchmark-data
- Eval test data: almaghrabima/eval-test-data
# Install dependencies
pip install deeplatent-nlp pyarrow tiktoken transformers huggingface-hub
# Run parity benchmark (vs GPT-4o, Gemma, etc.)
python benchmark_pypi.py
# Run throughput benchmark (vs tiktoken)
python benchmark_tiktoken_style.py --samples 1000000 --threads 1 2 4 8
# Run comprehensive tests (roundtrip + edge cases)
python test_comprehensive_million.py --samples 1000000 --report
Requirements
- Python 3.9+
- Rust 1.70+ (for building from source)
License
CC-BY-NC-4.0
Citation
@misc{sarf-tokenizer-2026,
title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
author={Almaghrabi, Mohammed},
year={2026},
url={https://huggingface.co/almaghrabima/SARFTokenizer},
note={Independent research, part of Suhail Project}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file deeplatent_nlp-0.3.10.tar.gz.
File metadata
- Download URL: deeplatent_nlp-0.3.10.tar.gz
- Upload date:
- Size: 71.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
913bdb79285b9f9147e5fa037a2eb159a24c9e2c432e9013a76e16fe8b5acfa8
|
|
| MD5 |
5ea423ae7dffa56e104b1b0f0d8e18ff
|
|
| BLAKE2b-256 |
c981d5a79c5355e901228ce9225e227d318712a7640f8a6711ae1beb71d0f745
|