High-performance Arabic-first tokenizer with morphology awareness

These details have not been verified by PyPI

Project links

Project description

DeepLatent SARF Tokenizer

Part of Suhail Project - Independent Research by Mohammed Almaghrabi

This is the SARF (Sarf-Aware Representation Framework) tokenizer designed for the DeepLatent language model, trained on bilingual Arabic/English data.

What is SARF?

SARF (صَرْف) is the Arabic term for morphology. In classical and modern Arabic linguistics, ṣarf refers to the system that governs:

Word formation
Roots and patterns (جذر / وزن)
Prefixes, suffixes, infixes
Tense, gender, number, and derivation

Ṣarf is the exact linguistic layer that makes Arabic hard for naive tokenizers.

SARF combines morphological analysis with BPE tokenization to achieve better compression, especially for morphologically rich languages like Arabic.

Most tokenizers treat Arabic as bytes or characters. SARF treats Arabic as a language.

Features

Arabic-Optimized: Designed specifically for Arabic and morphologically-rich languages
Fast: Rust core with Python bindings (up to 43,000+ texts/sec with parallel processing)
Accurate: 100% roundtrip accuracy on 1,000,000 test samples
Edge Case Handling: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
Unicode Support: Full support for Arabic diacritics, and mixed scripts
Parallel Processing: Excellent thread scaling (5x+ speedup with 8 threads)

Installation

pip install deeplatent-nlp

Quick Start

from deeplatent import SARFTokenizer

# Load tokenizer
tok = SARFTokenizer.from_pretrained("SARFTokenizer")

# Encode text
ids = tok.encode("مرحبا بالعالم")
print(ids)

# Decode back
text = tok.decode(ids)
print(text)

Edge Cases Handled

Case	Example	Handling
Diacritics	بِسْمِ	Properly normalized
Arabic-Indic digits	٠١٢٣٤٥	Preserved
Alef variants	أ إ آ ا	Normalized to ا
Taa marbuta	ة	Optionally normalized
Tatweel (kashida)	كـتـاب	Removed
Mixed Arabic/English	Hello مرحبا	Both handled

Performance

Tokenizer Benchmark Results

Comparison with state-of-the-art tokenizers on 60,000 samples (30k Arabic + 30k English).

Dataset: almaghrabima/deeplatent-benchmark-data

Tokenizer	Vocab	AR Fert	EN Fert	Avg Fert	AR C/T	EN C/T	Parity
SARFTokenizer	64,641	1.72	1.57	1.64	3.45	2.99	1.156
ALLaM-7B	64,000	1.82	1.48	1.65	3.08	2.65	1.163
Gemma-3-4B	262,145	2.78	1.33	2.05	2.42	3.00	0.805
Falcon-H1-7B	130,049	2.65	1.55	2.10	2.55	2.75	0.926
Fanar-1-9B	128,256	2.85	1.36	2.11	2.27	2.93	0.775
Hala-9B	128,256	2.85	1.36	2.11	2.27	2.93	0.775
GPT-4o	200,019	2.81	1.44	2.12	2.45	3.37	0.726
Command-R-Arabic	255,033	3.00	1.33	2.16	2.17	3.04	0.714
Qwen3-4B	151,669	3.06	1.50	2.28	2.04	2.92	0.697
GPT-4	100,277	4.59	1.50	3.05	1.35	3.24	0.417
Mistral-7B-v0.3	32,768	5.56	1.48	3.52	1.11	2.64	0.418

Metrics explained:

Fertility: Average tokens per word (lower is better - more efficient encoding)
C/T: Characters per token (higher is better - more characters encoded per token)
Parity: AR chars/token ÷ EN chars/token (1.0 = equal treatment of both languages)

Key findings:

SARFTokenizer achieves best Arabic fertility (1.72 tokens/word) - 35% better than GPT-4o
Lowest average fertility (1.64) among all tokenizers tested
Best Arabic characters/token (3.45) - encodes more Arabic per token than any competitor
Compact vocabulary (64k) while maintaining top performance
ALLaM-7B shows similar efficiency (both use morpheme-aware approaches)
Falcon-H1-7B has best parity (0.926) but 28% higher fertility than SARF
GPT-4 and Mistral struggle with Arabic (4.6-5.6 tokens/word vs 1.7 for SARF)

Throughput Benchmark (1M samples, 680 MB)

Comparison with tiktoken on 1,000,000 documents:

Tokenizer	1 Thread	2 Threads	4 Threads	8 Threads
SARFTokenizer	3.14 MB/s	5.57 MB/s	9.00 MB/s	13.72 MB/s
tiktoken (o200k)	6.23 MB/s	10.55 MB/s	14.90 MB/s	10.60 MB/s
tiktoken (cl100k)	7.99 MB/s	11.68 MB/s	12.02 MB/s	8.47 MB/s
HF tokenizers	1.88 MB/s	3.97 MB/s	9.27 MB/s	17.47 MB/s

Key findings:

SARFTokenizer outperforms tiktoken at 8 threads (13.72 MB/s vs 8.47-10.60 MB/s)
Excellent parallel scaling: 4.4x speedup from 1 to 8 threads
tiktoken degrades with more threads (peaks at 4T, drops at 8T)

Million-Scale Roundtrip Accuracy

Tested on 999,999 samples from real-world data:

Category	Samples	Success	Accuracy
Arabic	333,333	333,333	100.00%
English	333,333	333,333	100.00%
Mixed	333,333	333,333	100.00%
TOTAL	999,999	999,999	100.00%

Edge Case Tests (58/58 Passed)

All 12 edge case categories pass with 100% success:

Category	Tests	Status
Unicode Normalization	6	PASS
Zero-Width Characters	6	PASS
Unicode Whitespace	6	PASS
Grapheme Clusters	6	PASS
Apostrophes	4	PASS
Dashes	4	PASS
Decimal Separators	3	PASS
URLs/Emails	4	PASS
File Paths	3	PASS
Code Identifiers	4	PASS
Mixed Scripts/RTL	6	PASS
Robustness	6	PASS

Reproduce Benchmark Results

Datasets:

Benchmark data (60k samples): almaghrabima/deeplatent-benchmark-data
Eval test data: almaghrabima/eval-test-data

# Install dependencies
pip install deeplatent-nlp pyarrow tiktoken transformers huggingface-hub

# Run parity benchmark (vs GPT-4o, Gemma, etc.)
python benchmark_pypi.py

# Run throughput benchmark (vs tiktoken)
python benchmark_tiktoken_style.py --samples 1000000 --threads 1 2 4 8

# Run comprehensive tests (roundtrip + edge cases)
python test_comprehensive_million.py --samples 1000000 --report

Requirements

Python 3.9+
Rust 1.70+ (for building from source)

License

CC-BY-NC-4.0

Citation

@misc{sarf-tokenizer-2026,
  title={SARF: A Morpheme-Aware Tokenization Framework for Arabic-English - Suhail Project},
  author={Almaghrabi, Mohammed},
  year={2026},
  url={https://huggingface.co/almaghrabima/SARFTokenizer},
  note={Independent research, part of Suhail Project}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.15

Feb 8, 2026

0.3.14

Feb 8, 2026

0.3.13

Feb 8, 2026

0.3.12

Feb 7, 2026

This version

0.3.11

Feb 7, 2026

0.3.10

Feb 4, 2026

0.3.9

Feb 4, 2026

0.3.8

Feb 3, 2026

0.3.7

Feb 3, 2026

0.3.6

Feb 3, 2026

0.3.5

Jan 31, 2026

0.3.4

Jan 31, 2026

0.3.3

Jan 31, 2026

0.3.2

Jan 31, 2026

0.3.1

Jan 31, 2026

0.2.4

Jan 27, 2026

0.1.1

Jan 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplatent_nlp-0.3.11.tar.gz (73.2 kB view details)

Uploaded Feb 7, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

deeplatent_nlp-0.3.11-cp313-cp313-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded Feb 7, 2026 CPython 3.13manylinux: glibc 2.34+ x86-64

deeplatent_nlp-0.3.11-cp312-cp312-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded Feb 7, 2026 CPython 3.12manylinux: glibc 2.34+ x86-64

deeplatent_nlp-0.3.11-cp311-cp311-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded Feb 7, 2026 CPython 3.11manylinux: glibc 2.34+ x86-64

deeplatent_nlp-0.3.11-cp310-cp310-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded Feb 7, 2026 CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file deeplatent_nlp-0.3.11.tar.gz.

File metadata

Download URL: deeplatent_nlp-0.3.11.tar.gz
Upload date: Feb 7, 2026
Size: 73.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for deeplatent_nlp-0.3.11.tar.gz
Algorithm	Hash digest
SHA256	`a216166af98604ee26656e10cddcf5fa3df7e312e9f8531d4674a88450ad2a80`
MD5	`28cfc679187f51a7f6cd74934bc24f85`
BLAKE2b-256	`32053f0ce05dd6c2f4d04a5ae520bec2c5a295ced1cf234b0acd056c797f4bac`

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.11-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

Download URL: deeplatent_nlp-0.3.11-cp313-cp313-manylinux_2_34_x86_64.whl
Upload date: Feb 7, 2026
Size: 1.2 MB
Tags: CPython 3.13, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for deeplatent_nlp-0.3.11-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`e620f2f0a396f43b17d000db1ca455c2798df10c5b3bf3be2585e9718bf0f054`
MD5	`359171bcd1c6f916539a79fd140e6d04`
BLAKE2b-256	`df963dbd4f910b5f97ecd53b64169b3eeba2839d9faae5c79ed0c85467de088a`

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.11-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

Download URL: deeplatent_nlp-0.3.11-cp312-cp312-manylinux_2_34_x86_64.whl
Upload date: Feb 7, 2026
Size: 1.2 MB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for deeplatent_nlp-0.3.11-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`2c78173ea017591b510953601c8ba43b292164643489d24b1a7feaee3e10ce37`
MD5	`3c90eb61b5fa4c97d69abeacee0ed968`
BLAKE2b-256	`39718648b0586b45ef902827d4beda6a07144f5f7d98d53197a28848f96de007`

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.11-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

Download URL: deeplatent_nlp-0.3.11-cp311-cp311-manylinux_2_34_x86_64.whl
Upload date: Feb 7, 2026
Size: 1.2 MB
Tags: CPython 3.11, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for deeplatent_nlp-0.3.11-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`aa23ca92525b357f1213c793d4e9b90344bf2b6538b107f59f67810afa2f4c27`
MD5	`8cae1242ced240072b0982c4e6ccdf17`
BLAKE2b-256	`57f2504830616058d8afd480a9002eaa449a6f3ddcbca6e886ad1ba83ba32c03`

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.11-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

Download URL: deeplatent_nlp-0.3.11-cp310-cp310-manylinux_2_34_x86_64.whl
Upload date: Feb 7, 2026
Size: 1.2 MB
Tags: CPython 3.10, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for deeplatent_nlp-0.3.11-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`cc3d48e4d9cf919cc92e6f47f3f5730c0e41671018b076923f2c95929d7cc7b0`
MD5	`0a0a9f009afc9db38f413dd98d71ac8e`
BLAKE2b-256	`4f2bcd202488bf5dd07334515db87d77003b18e36df1a0726ee5ddc0f6453c1c`

See more details on using hashes here.

deeplatent-nlp 0.3.11

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DeepLatent SARF Tokenizer

What is SARF?

Features

Installation

Quick Start

Edge Cases Handled

Performance

Tokenizer Benchmark Results

Throughput Benchmark (1M samples, 680 MB)

Million-Scale Roundtrip Accuracy

Edge Case Tests (58/58 Passed)

Reproduce Benchmark Results

Requirements

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes