Skip to main content

the new deep latent model

Project description

DeepLatent

DeepLatent - SARF Tokenizer for Arabic/English bilingual text with native Rust core.

This package provides the SARF (Sarf-Aware Representation Framework) tokenizer that achieves excellent Arabic/English parity (1.09) by applying morpheme-level preprocessing before BPE tokenization.

Installation

pip install deeplatent-nlp

Or with uv:

uv add deeplatent-nlp

Building from Source

If installing from source, you'll need Rust installed:

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install from source
pip install .

Quick Start

from deeplatent import SARFTokenizer

# Load tokenizer from HuggingFace
tokenizer = SARFTokenizer.from_pretrained("almaghrabima/deeplatent-tokenizer")

# Encode text (SARF preprocessing is applied automatically for Arabic)
arabic_text = "مرحبا بكم في هذا الاختبار"
tokens = tokenizer.encode(arabic_text)
print(f"Token count: {len(tokens)}")

# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

# Works with English too
english_text = "Hello world, this is a test"
tokens = tokenizer.encode(english_text)
print(f"English token count: {len(tokens)}")

Performance

Metric With SARF Preprocessing Without Preprocessing
Arabic Fertility 2.29 5.65
English Fertility 2.10 2.91
Parity (Ar/En) 1.09 1.94
Interpretation EXCELLENT Moderate

Fertility = average tokens per word. Lower is better. Parity closer to 1.0 means more equal treatment between languages.

Supported Platforms

Pre-built wheels are available for:

  • Linux (manylinux2014, x86_64)
  • macOS (x86_64, arm64)
  • Windows (x86_64)

For other platforms, the package will build from source (requires Rust).

What is SARF?

SARF (صَرْف) is the Arabic term for morphology. In Arabic linguistics, ṣarf refers to the system that governs:

  • Word formation
  • Roots and patterns (جذر / وزن)
  • Prefixes, suffixes, infixes
  • Tense, gender, number, and derivation

Most tokenizers treat Arabic as bytes or characters. SARF treats Arabic as a language.

API Reference

SARFTokenizer

from deeplatent import SARFTokenizer

# Load from HuggingFace
tokenizer = SARFTokenizer.from_pretrained("almaghrabima/deeplatent-tokenizer")

# Load from local directory
tokenizer = SARFTokenizer.from_directory("./my_tokenizer")

# Disable preprocessing (not recommended for Arabic)
tokenizer = SARFTokenizer.from_pretrained(
    "almaghrabima/deeplatent-tokenizer",
    use_preprocessing=False
)

Encoding

# Simple encoding
tokens = tokenizer.encode("مرحبا بكم")

# With options
result = tokenizer.encode(
    "مرحبا بكم",
    add_special_tokens=True,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"  # or "tf" for TensorFlow
)

# Batch encoding
texts = ["مرحبا", "Hello", "مرحبا بكم في العالم"]
batch_tokens = tokenizer.encode_batch(texts)

Decoding

# Simple decoding
text = tokenizer.decode([1234, 5678, 9012])

# Batch decoding
texts = tokenizer.decode_batch([[1234, 5678], [9012, 3456]])

# Keep special tokens
text = tokenizer.decode(tokens, skip_special_tokens=False)

License

This tokenizer is released under CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0 International).

For commercial licensing, please contact: almaghrabima@gmail.com

Author

Mohammed Almaghrabi Email: almaghrabima@gmail.com

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplatent_nlp-0.1.1-cp310-cp310-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file deeplatent_nlp-0.1.1-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.1.1-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 16e247d7090aaa1cff3ec448c0ad9e246b364a37120ba54f03f2403b5556e309
MD5 3d67a2d071d64f7622a4bad763a63016
BLAKE2b-256 2160a77d41f281d992f5f75109c9ecd81f654054ac01f5e6948e2d5c46c6402b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page