Skip to main content

High-performance Arabic-first tokenizer with morphology awareness

Project description

DeepLatent SARF Tokenizer

Part of Suhail Project - Independent Research by Mohammed Almaghrabi

This is the SARF (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data.

What is SARF?

SARF (صَرْف) is the Arabic term for morphology. In classical and modern Arabic linguistics, ṣarf refers to the system that governs:

  • Word formation
  • Roots and patterns (جذر / وزن)
  • Prefixes, suffixes, infixes
  • Tense, gender, number, and derivation

Ṣarf is the exact linguistic layer that makes Arabic hard for naive tokenizers.

SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic.

Most tokenizers treat Arabic as bytes or characters. SARF treats Arabic as a language.

Features

  • Arabic-Optimized: Designed specifically for Arabic and morphologically-rich languages
  • Fast: Rust core with Python bindings (up to 43,000+ texts/sec with parallel processing)
  • Accurate: 100% roundtrip accuracy on 1,000,000 test samples
  • Edge Case Handling: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
  • Unicode Support: Full support for Arabic diacritics, and mixed scripts
  • Parallel Processing: Excellent thread scaling (5x+ speedup with 8 threads)

Installation

pip install deeplatent-nlp

Quick Start

from deeplatent import SARFTokenizer

# Load tokenizer
tok = SARFTokenizer.from_pretrained("SARFTokenizer")

# Encode text
ids = tok.encode("مرحبا بالعالم")
print(ids)

# Decode back
text = tok.decode(ids)
print(text)

Edge Cases Handled

Case Example Handling
Diacritics بِسْمِ Properly normalized
Arabic-Indic digits ٠١٢٣٤٥ Preserved
Alef variants أ إ آ ا Normalized to ا
Taa marbuta ة Optionally normalized
Tatweel (kashida) كـتـاب Removed
Mixed Arabic/English Hello مرحبا Both handled

Performance

Tokenizer Benchmark Results

Comparison with state-of-the-art tokenizers on 60,000 samples (30k Arabic + 30k English).

Dataset: almaghrabima/deeplatent-benchmark-data

Tokenizer Vocab AR Fert EN Fert Avg Fert AR C/T EN C/T Parity
SARFTokenizer 64,641 1.72 1.57 1.64 3.45 2.99 1.156
ALLaM-7B 64,000 1.82 1.48 1.65 3.08 2.65 1.163
Gemma-3-4B 262,145 2.78 1.33 2.05 2.42 3.00 0.805
Falcon-H1-7B 130,049 2.65 1.55 2.10 2.55 2.75 0.926
Fanar-1-9B 128,256 2.85 1.36 2.11 2.27 2.93 0.775
Hala-9B 128,256 2.85 1.36 2.11 2.27 2.93 0.775
GPT-4o 200,019 2.81 1.44 2.12 2.45 3.37 0.726
Command-R-Arabic 255,033 3.00 1.33 2.16 2.17 3.04 0.714
Qwen3-4B 151,669 3.06 1.50 2.28 2.04 2.92 0.697
GPT-4 100,277 4.59 1.50 3.05 1.35 3.24 0.417
Mistral-7B-v0.3 32,768 5.56 1.48 3.52 1.11 2.64 0.418

Metrics explained:

  • Fertility: Average tokens per word (lower is better - more efficient encoding)
  • C/T: Characters per token (higher is better - more characters encoded per token)
  • Parity: AR chars/token ÷ EN chars/token (1.0 = equal treatment of both languages)

Key findings:

  • SARFTokenizer achieves best Arabic fertility (1.72 tokens/word) - 35% better than GPT-4o
  • Lowest average fertility (1.64) among all tokenizers tested
  • Best Arabic characters/token (3.45) - encodes more Arabic per token than any competitor
  • Compact vocabulary (64k) while maintaining top performance
  • ALLaM-7B shows similar efficiency (both use morpheme-aware approaches)
  • Falcon-H1-7B has best parity (0.926) but 28% higher fertility than SARF
  • GPT-4 and Mistral struggle with Arabic (4.6-5.6 tokens/word vs 1.7 for SARF)

Throughput Benchmark (1M samples, 680 MB)

Comparison with tiktoken on 1,000,000 documents:

Tokenizer 1 Thread 2 Threads 4 Threads 8 Threads
SARFTokenizer 3.14 MB/s 5.57 MB/s 9.00 MB/s 13.72 MB/s
tiktoken (o200k) 6.23 MB/s 10.55 MB/s 14.90 MB/s 10.60 MB/s
tiktoken (cl100k) 7.99 MB/s 11.68 MB/s 12.02 MB/s 8.47 MB/s
HF tokenizers 1.88 MB/s 3.97 MB/s 9.27 MB/s 17.47 MB/s

Key findings:

  • SARFTokenizer outperforms tiktoken at 8 threads (13.72 MB/s vs 8.47-10.60 MB/s)
  • Excellent parallel scaling: 4.4x speedup from 1 to 8 threads
  • tiktoken degrades with more threads (peaks at 4T, drops at 8T)

Million-Scale Roundtrip Accuracy

Tested on 999,999 samples from real-world data:

Category Samples Success Accuracy
Arabic 333,333 333,333 100.00%
English 333,333 333,333 100.00%
Mixed 333,333 333,333 100.00%
TOTAL 999,999 999,999 100.00%

Edge Case Tests (58/58 Passed)

All 12 edge case categories pass with 100% success:

Category Tests Status
Unicode Normalization 6 PASS
Zero-Width Characters 6 PASS
Unicode Whitespace 6 PASS
Grapheme Clusters 6 PASS
Apostrophes 4 PASS
Dashes 4 PASS
Decimal Separators 3 PASS
URLs/Emails 4 PASS
File Paths 3 PASS
Code Identifiers 4 PASS
Mixed Scripts/RTL 6 PASS
Robustness 6 PASS

Reproduce Benchmark Results

Datasets:

# Install dependencies
pip install deeplatent-nlp pyarrow tiktoken transformers huggingface-hub

# Run parity benchmark (vs GPT-4o, Gemma, etc.)
python benchmark_pypi.py

# Run throughput benchmark (vs tiktoken)
python benchmark_tiktoken_style.py --samples 1000000 --threads 1 2 4 8

# Run comprehensive tests (roundtrip + edge cases)
python test_comprehensive_million.py --samples 1000000 --report

Requirements

  • Python 3.9+
  • Rust 1.70+ (for building from source)

License

CC-BY-NC-4.0

Citation

@misc{sarf-tokenizer-2026,
  title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
  author={Almaghrabi, Mohammed},
  year={2026},
  url={https://huggingface.co/almaghrabima/SARFTokenizer},
  note={Independent research, part of Suhail Project}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

deeplatent_nlp-0.3.15-cp313-cp313-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

deeplatent_nlp-0.3.15-cp312-cp312-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

deeplatent_nlp-0.3.15-cp311-cp311-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

deeplatent_nlp-0.3.15-cp310-cp310-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

deeplatent_nlp-0.3.15-cp39-cp39-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.34+ x86-64

File details

Details for the file deeplatent_nlp-0.3.15-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.3.15-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 04aca61cf001c3caa3959457b9e831b9ef57e1d1455717b52531ca00bd5b766a
MD5 7add68cab7d10d88f200815483f73497
BLAKE2b-256 a2540d75097fdb3e0cf22cc44fd05d420dd464b04aeab5c7b5b50b84b31a88b7

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.15-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.3.15-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 5b53a8459a7367af6c18cd0b48bd3dc7f4aef9bd1c9861c53ae9694ee94e3677
MD5 a75568b3724ce4c72b99ecd332f8b483
BLAKE2b-256 4325dd91b6b32257123925cf0fb35b7ff61b3b44bf0ed61d7b7911327cb149bc

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.15-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.3.15-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 17f9694662c06535efe05546d3cdd26c154a4e45a52f29bd9c58b26a6012094b
MD5 aa887e5144d40058ce371c11cec40832
BLAKE2b-256 a300d5badeebe07c1de87638214986b4642d1a655b51d528c0156254bffac2cc

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.15-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.3.15-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 f6767b7f94e4cde37517562f3e1a40055f964511897d2110563946fe63993ab8
MD5 71deffdc7cc0c8882beaafc2637aa13e
BLAKE2b-256 d8b675dc0941e3f1d2a20ec13e2c1e04c53cb15c23c500c774b5c911f7695c95

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.15-cp39-cp39-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.3.15-cp39-cp39-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 032ba6658055cf15af234e37be486d7e5af05c61a75d6e86b123cf54882fdf8f
MD5 1dd8f4540f31c048566d65d39a5d719f
BLAKE2b-256 34d4911f55c3fac016a6ea92705921d38e0e247a8c55ecd0fbfca56d8ea80324

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page