Skip to main content

High-performance Arabic-first tokenizer with morphology awareness

Project description

DeepLatent SARF Tokenizer

Part of Suhail Project - Independent Research by Mohammed Almaghrabi

This is the SARF (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data.

What is SARF?

SARF (صَرْف) is the Arabic term for morphology. In classical and modern Arabic linguistics, ṣarf refers to the system that governs:

  • Word formation
  • Roots and patterns (جذر / وزن)
  • Prefixes, suffixes, infixes
  • Tense, gender, number, and derivation

Ṣarf is the exact linguistic layer that makes Arabic hard for naive tokenizers.

SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic.

Most tokenizers treat Arabic as bytes or characters. SARF treats Arabic as a language.

Features

  • Arabic-Optimized: Designed specifically for Arabic and morphologically-rich languages
  • Fast: Rust core with Python bindings (up to 43,000+ texts/sec with parallel processing)
  • Accurate: 100% roundtrip accuracy on 1,000,000 test samples
  • Edge Case Handling: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
  • Unicode Support: Full support for Arabic diacritics, and mixed scripts
  • Parallel Processing: Excellent thread scaling (5x+ speedup with 8 threads)

Installation

pip install deeplatent-nlp

Quick Start

from deeplatent import SARFTokenizer

# Load tokenizer
tok = SARFTokenizer.from_pretrained("SARFTokenizer")

# Encode text
ids = tok.encode("مرحبا بالعالم")
print(ids)

# Decode back
text = tok.decode(ids)
print(text)

Edge Cases Handled

Case Example Handling
Diacritics بِسْمِ Properly normalized
Arabic-Indic digits ٠١٢٣٤٥ Preserved
Alef variants أ إ آ ا Normalized to ا
Taa marbuta ة Optionally normalized
Tatweel (kashida) كـتـاب Removed
Mixed Arabic/English Hello مرحبا Both handled

Performance

Tokenizer Benchmark Results

Comparison with state-of-the-art tokenizers on 60,000 samples (30k Arabic + 30k English).

Dataset: almaghrabima/deeplatent-benchmark-data

Tokenizer Vocab AR Fert EN Fert Avg Fert AR C/T EN C/T Parity
SARFTokenizer 64,641 1.72 1.57 1.64 3.45 2.99 1.156
ALLaM-7B 64,000 1.82 1.48 1.65 3.08 2.65 1.163
Gemma-3-4B 262,145 2.78 1.33 2.05 2.42 3.00 0.805
Falcon-H1-7B 130,049 2.65 1.55 2.10 2.55 2.75 0.926
Fanar-1-9B 128,256 2.85 1.36 2.11 2.27 2.93 0.775
Hala-9B 128,256 2.85 1.36 2.11 2.27 2.93 0.775
GPT-4o 200,019 2.81 1.44 2.12 2.45 3.37 0.726
Command-R-Arabic 255,033 3.00 1.33 2.16 2.17 3.04 0.714
Qwen3-4B 151,669 3.06 1.50 2.28 2.04 2.92 0.697
GPT-4 100,277 4.59 1.50 3.05 1.35 3.24 0.417
Mistral-7B-v0.3 32,768 5.56 1.48 3.52 1.11 2.64 0.418

Metrics explained:

  • Fertility: Average tokens per word (lower is better - more efficient encoding)
  • C/T: Characters per token (higher is better - more characters encoded per token)
  • Parity: AR chars/token ÷ EN chars/token (1.0 = equal treatment of both languages)

Key findings:

  • SARFTokenizer achieves best Arabic fertility (1.72 tokens/word) - 35% better than GPT-4o
  • Lowest average fertility (1.64) among all tokenizers tested
  • Best Arabic characters/token (3.45) - encodes more Arabic per token than any competitor
  • Compact vocabulary (64k) while maintaining top performance
  • ALLaM-7B shows similar efficiency (both use morpheme-aware approaches)
  • Falcon-H1-7B has best parity (0.926) but 28% higher fertility than SARF
  • GPT-4 and Mistral struggle with Arabic (4.6-5.6 tokens/word vs 1.7 for SARF)

Throughput Benchmark (1M samples, 680 MB)

Comparison with tiktoken on 1,000,000 documents:

Tokenizer 1 Thread 2 Threads 4 Threads 8 Threads
SARFTokenizer 3.14 MB/s 5.57 MB/s 9.00 MB/s 13.72 MB/s
tiktoken (o200k) 6.23 MB/s 10.55 MB/s 14.90 MB/s 10.60 MB/s
tiktoken (cl100k) 7.99 MB/s 11.68 MB/s 12.02 MB/s 8.47 MB/s
HF tokenizers 1.88 MB/s 3.97 MB/s 9.27 MB/s 17.47 MB/s

Key findings:

  • SARFTokenizer outperforms tiktoken at 8 threads (13.72 MB/s vs 8.47-10.60 MB/s)
  • Excellent parallel scaling: 4.4x speedup from 1 to 8 threads
  • tiktoken degrades with more threads (peaks at 4T, drops at 8T)

Million-Scale Roundtrip Accuracy

Tested on 999,999 samples from real-world data:

Category Samples Success Accuracy
Arabic 333,333 333,333 100.00%
English 333,333 333,333 100.00%
Mixed 333,333 333,333 100.00%
TOTAL 999,999 999,999 100.00%

Edge Case Tests (58/58 Passed)

All 12 edge case categories pass with 100% success:

Category Tests Status
Unicode Normalization 6 PASS
Zero-Width Characters 6 PASS
Unicode Whitespace 6 PASS
Grapheme Clusters 6 PASS
Apostrophes 4 PASS
Dashes 4 PASS
Decimal Separators 3 PASS
URLs/Emails 4 PASS
File Paths 3 PASS
Code Identifiers 4 PASS
Mixed Scripts/RTL 6 PASS
Robustness 6 PASS

Reproduce Benchmark Results

Datasets:

# Install dependencies
pip install deeplatent-nlp pyarrow tiktoken transformers huggingface-hub

# Run parity benchmark (vs GPT-4o, Gemma, etc.)
python benchmark_pypi.py

# Run throughput benchmark (vs tiktoken)
python benchmark_tiktoken_style.py --samples 1000000 --threads 1 2 4 8

# Run comprehensive tests (roundtrip + edge cases)
python test_comprehensive_million.py --samples 1000000 --report

Requirements

  • Python 3.9+
  • Rust 1.70+ (for building from source)

License

CC-BY-NC-4.0

Citation

@misc{sarf-tokenizer-2026,
  title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
  author={Almaghrabi, Mohammed},
  year={2026},
  url={https://huggingface.co/almaghrabima/SARFTokenizer},
  note={Independent research, part of Suhail Project}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

deeplatent_nlp-0.3.14-cp313-cp313-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

deeplatent_nlp-0.3.14-cp312-cp312-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

deeplatent_nlp-0.3.14-cp311-cp311-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

deeplatent_nlp-0.3.14-cp310-cp310-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

deeplatent_nlp-0.3.14-cp39-cp39-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.34+ x86-64

File details

Details for the file deeplatent_nlp-0.3.14-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.3.14-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 aafb86ecfcb2ba431dabe675fb2cc47f571632f662d37166991f216b804ef226
MD5 252ae1833115a4ec6757a6dea55adbcd
BLAKE2b-256 9f34be12d26984b7fae1928dd652e60ec0227a08893ecca7fa5dcd7b98284476

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.14-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.3.14-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 ba3c10a3dc1e06d048e5e0383d51d534a0ff81404b2b84d77a250cc1bfb2009c
MD5 ef5e758945cbd6ac471b87775dae427f
BLAKE2b-256 f049d81d989f429f52fd64a8cc1e5c327a4b3d0dace87234f02a0a819f314da4

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.14-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.3.14-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 1ddf6b1a1f84944cebf47b8908b8f9f663d64d880b8ec7e7e85fffe7f0d23559
MD5 53f42eaa73a1f5bfbfa6d2e0344ecee9
BLAKE2b-256 9fa426008e1e0145c84b3b9c7cccd0ec6dacca7b2305408b04704aea0ecef2a1

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.14-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.3.14-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 2c03feb4502b780e7e35a4eff15f836208ee9db668536bca2204ae1016d3f78f
MD5 1bacd839fb69f07608c4255876e2aa6f
BLAKE2b-256 02dd7357c10f3721c60667bc5eac172be863673ed10fc7c52fe62ac9bb708a8f

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.14-cp39-cp39-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.3.14-cp39-cp39-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 0d7e2afc3fb4ee6132a3a3570162e6d0d5e31393ccbc6ce730df0d18119d24ad
MD5 a9162de209e9e2889833a095b8804c73
BLAKE2b-256 e5cb2b4eb35c8491ad65d015710c91ef27a229a5c0e3629b776cdcd8b30b15e3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page