DeepLatent - Morphology-aware tokenizer for Arabic/English bilingual text with native core
Project description
DeepLatent
DeepLatent - SARF Tokenizer for Arabic/English bilingual text with native Rust core.
This package provides the SARF (Sarf-Aware Representation Framework) tokenizer that achieves excellent Arabic/English parity (1.09) by applying morpheme-level preprocessing before BPE tokenization.
Installation
pip install deeplatent-nlp
Building from Source
If installing from source, you'll need Rust installed:
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install from source
pip install .
Quick Start
from deeplatent import SARFTokenizer
# Load tokenizer from HuggingFace
tokenizer = SARFTokenizer.from_pretrained("almaghrabima/deeplatent-tokenizer")
# Encode text (SARF preprocessing is applied automatically for Arabic)
arabic_text = "مرحبا بكم في هذا الاختبار"
tokens = tokenizer.encode(arabic_text)
print(f"Token count: {len(tokens)}")
# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")
# Works with English too
english_text = "Hello world, this is a test"
tokens = tokenizer.encode(english_text)
print(f"English token count: {len(tokens)}")
Performance
| Metric | With SARF Preprocessing | Without Preprocessing |
|---|---|---|
| Arabic Fertility | 2.29 | 5.65 |
| English Fertility | 2.10 | 2.91 |
| Parity (Ar/En) | 1.09 | 1.94 |
| Interpretation | EXCELLENT | Moderate |
Fertility = average tokens per word. Lower is better. Parity closer to 1.0 means more equal treatment between languages.
Supported Platforms
Pre-built wheels are available for:
- Linux (manylinux2014, x86_64)
- macOS (x86_64, arm64)
- Windows (x86_64)
For other platforms, the package will build from source (requires Rust).
What is SARF?
SARF (صَرْف) is the Arabic term for morphology. In Arabic linguistics, ṣarf refers to the system that governs:
- Word formation
- Roots and patterns (جذر / وزن)
- Prefixes, suffixes, infixes
- Tense, gender, number, and derivation
Most tokenizers treat Arabic as bytes or characters. SARF treats Arabic as a language.
API Reference
SARFTokenizer
from deeplatent import SARFTokenizer
# Load from HuggingFace
tokenizer = SARFTokenizer.from_pretrained("almaghrabima/deeplatent-tokenizer")
# Load from local directory
tokenizer = SARFTokenizer.from_directory("./my_tokenizer")
# Disable preprocessing (not recommended for Arabic)
tokenizer = SARFTokenizer.from_pretrained(
"almaghrabima/deeplatent-tokenizer",
use_preprocessing=False
)
Encoding
# Simple encoding
tokens = tokenizer.encode("مرحبا بكم")
# With options
result = tokenizer.encode(
"مرحبا بكم",
add_special_tokens=True,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt" # or "tf" for TensorFlow
)
# Batch encoding
texts = ["مرحبا", "Hello", "مرحبا بكم في العالم"]
batch_tokens = tokenizer.encode_batch(texts)
Decoding
# Simple decoding
text = tokenizer.decode([1234, 5678, 9012])
# Batch decoding
texts = tokenizer.decode_batch([[1234, 5678], [9012, 3456]])
# Keep special tokens
text = tokenizer.decode(tokens, skip_special_tokens=False)
License
This tokenizer is released under CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0 International).
For commercial licensing, please contact: almaghrabima@gmail.com
Author
- Mohammed Almaghrabi
- Email: almaghrabima@gmail.com
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deeplatent_nlp-0.2.4.tar.gz.
File metadata
- Download URL: deeplatent_nlp-0.2.4.tar.gz
- Upload date:
- Size: 208.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b19582f4d83fbbbffc7c2d1c7af37fe4b41a8f5230a3f00ca901fb701dc30bc
|
|
| MD5 |
ee0d5f524ccabc6e9ec25760fbcd533c
|
|
| BLAKE2b-256 |
cfb93eaa414ccbb0f9fc83d998432f255e00cc09700664f9926a55eebd7ce6bf
|
File details
Details for the file deeplatent_nlp-0.2.4-cp310-cp310-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: deeplatent_nlp-0.2.4-cp310-cp310-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 420.5 kB
- Tags: CPython 3.10, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76c523351d76a95d83784e5ba99bda47ba04189fb5ce7a2169d8d797e6c28cee
|
|
| MD5 |
2760fffdb1268f82441f399dda6fba3c
|
|
| BLAKE2b-256 |
97e101d52b37f14d0d827c1354d212df87ba3fc3d55a7865d4b15f0b4e5ae41c
|