Skip to main content

DeepLatent - Morphology-aware tokenizer for Arabic/English bilingual text with native core

Project description

DeepLatent

DeepLatent - SARF Tokenizer for Arabic/English bilingual text with native Rust core.

This package provides the SARF (Sarf-Aware Representation Framework) tokenizer that achieves excellent Arabic/English parity (1.09) by applying morpheme-level preprocessing before BPE tokenization.

Installation

pip install deeplatent-nlp

Building from Source

If installing from source, you'll need Rust installed:

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install from source
pip install .

Quick Start

from deeplatent import SARFTokenizer

# Load tokenizer from HuggingFace
tokenizer = SARFTokenizer.from_pretrained("almaghrabima/deeplatent-tokenizer")

# Encode text (SARF preprocessing is applied automatically for Arabic)
arabic_text = "مرحبا بكم في هذا الاختبار"
tokens = tokenizer.encode(arabic_text)
print(f"Token count: {len(tokens)}")

# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

# Works with English too
english_text = "Hello world, this is a test"
tokens = tokenizer.encode(english_text)
print(f"English token count: {len(tokens)}")

Performance

Metric With SARF Preprocessing Without Preprocessing
Arabic Fertility 2.29 5.65
English Fertility 2.10 2.91
Parity (Ar/En) 1.09 1.94
Interpretation EXCELLENT Moderate

Fertility = average tokens per word. Lower is better. Parity closer to 1.0 means more equal treatment between languages.

Supported Platforms

Pre-built wheels are available for:

  • Linux (manylinux2014, x86_64)
  • macOS (x86_64, arm64)
  • Windows (x86_64)

For other platforms, the package will build from source (requires Rust).

What is SARF?

SARF (صَرْف) is the Arabic term for morphology. In Arabic linguistics, ṣarf refers to the system that governs:

  • Word formation
  • Roots and patterns (جذر / وزن)
  • Prefixes, suffixes, infixes
  • Tense, gender, number, and derivation

Most tokenizers treat Arabic as bytes or characters. SARF treats Arabic as a language.

API Reference

SARFTokenizer

from deeplatent import SARFTokenizer

# Load from HuggingFace
tokenizer = SARFTokenizer.from_pretrained("almaghrabima/deeplatent-tokenizer")

# Load from local directory
tokenizer = SARFTokenizer.from_directory("./my_tokenizer")

# Disable preprocessing (not recommended for Arabic)
tokenizer = SARFTokenizer.from_pretrained(
    "almaghrabima/deeplatent-tokenizer",
    use_preprocessing=False
)

Encoding

# Simple encoding
tokens = tokenizer.encode("مرحبا بكم")

# With options
result = tokenizer.encode(
    "مرحبا بكم",
    add_special_tokens=True,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"  # or "tf" for TensorFlow
)

# Batch encoding
texts = ["مرحبا", "Hello", "مرحبا بكم في العالم"]
batch_tokens = tokenizer.encode_batch(texts)

Decoding

# Simple decoding
text = tokenizer.decode([1234, 5678, 9012])

# Batch decoding
texts = tokenizer.decode_batch([[1234, 5678], [9012, 3456]])

# Keep special tokens
text = tokenizer.decode(tokens, skip_special_tokens=False)

License

This tokenizer is released under CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0 International).

For commercial licensing, please contact: almaghrabima@gmail.com

Author

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplatent_nlp-0.2.4.tar.gz (208.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplatent_nlp-0.2.4-cp310-cp310-manylinux_2_34_x86_64.whl (420.5 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file deeplatent_nlp-0.2.4.tar.gz.

File metadata

  • Download URL: deeplatent_nlp-0.2.4.tar.gz
  • Upload date:
  • Size: 208.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for deeplatent_nlp-0.2.4.tar.gz
Algorithm Hash digest
SHA256 7b19582f4d83fbbbffc7c2d1c7af37fe4b41a8f5230a3f00ca901fb701dc30bc
MD5 ee0d5f524ccabc6e9ec25760fbcd533c
BLAKE2b-256 cfb93eaa414ccbb0f9fc83d998432f255e00cc09700664f9926a55eebd7ce6bf

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.2.4-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.2.4-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 76c523351d76a95d83784e5ba99bda47ba04189fb5ce7a2169d8d797e6c28cee
MD5 2760fffdb1268f82441f399dda6fba3c
BLAKE2b-256 97e101d52b37f14d0d827c1354d212df87ba3fc3d55a7865d4b15f0b4e5ae41c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page