Skip to main content

High-performance Arabic-first tokenizer with morphology awareness

Project description

Suhail

High-performance Arabic tokenizer with morphology awareness. Built with Rust for speed, with Python bindings for ease of use.

Features

  • Arabic-Optimized: Designed specifically for Arabic and morphologically-rich languages
  • Fast: Rust core with Python bindings (~30,000 operations/sec)
  • Accurate: 100% roundtrip accuracy on 300,000+ test samples
  • Edge Case Handling: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
  • Unicode Support: Full support for Arabic diacritics, PUA characters, and mixed scripts
  • IP Protection: AES-256-GCM encrypted morpheme maps (no license key required)

Installation

pip install deeplatent-nlp

Quick Start

Using SarfCodec (Recommended)

The SarfCodec class provides encode/decode functionality using a morpheme map:

from suhail import SarfCodec

# Create codec from morpheme map dictionary
morf_map = {
    'ال': '\uE000',      # definite article -> PUA
    'كتاب': '\uE001',    # kitab -> PUA
    'و': '\uE002',       # wa (and) -> PUA
    'ب': '\uE003',       # bi (with) -> PUA
}
codec = SarfCodec(morf_map)

# Encode Arabic text (morphemes -> PUA characters)
text = "الكتاب"
encoded = codec.encode(text)
print(f"Encoded: {repr(encoded)}")  # '\ue000\ue001'

# Decode back to Arabic (PUA -> morphemes)
decoded = codec.decode(encoded)
print(f"Decoded: {decoded}")  # 'الكتاب'

# Verify roundtrip
normalized, decoded, is_ok = codec.roundtrip(text)
print(f"Roundtrip OK: {is_ok}")  # True

Loading from Encrypted File

Morpheme maps are distributed as encrypted .enc files for IP protection:

from suhail import SarfCodec

# Load from encrypted file (no license key needed)
codec = SarfCodec.from_encrypted("morf_map.enc")

# Use as normal
encoded = codec.encode("بسم الله الرحمن الرحيم")
decoded = codec.decode(encoded)

Creating Encrypted Morf Map Files

To create encrypted files from your JSON morf_map:

from suhail import SarfCodec, encrypt_morf_map

# Option 1: Encrypt JSON file directly
encrypt_morf_map("morf_map.json", "morf_map.enc")

# Option 2: Encrypt from dict
morf_map = {'ال': '\uE000', 'كتاب': '\uE001'}
codec = SarfCodec(morf_map)
codec.encrypt_to_file("morf_map.enc")

Encryption details:

  • AES-256-GCM encryption
  • Key embedded in compiled Rust binary
  • Cannot be decrypted without deeplatent-nlp library
  • Checksum verification for tamper detection

Standalone Functions

For quick one-off operations without creating a codec:

from suhail import encode, decode, normalize

morf_map = {'ال': '\uE000', 'كتاب': '\uE001'}

# Encode text
encoded = encode("الكتاب", morf_map)

# Decode text
decoded = decode(encoded, morf_map)

# Normalize Arabic text (without encoding)
normalized = normalize("الكِتَابُ", level="medium")

Normalization Levels

The codec supports three normalization levels:

from suhail import SarfCodec

# Light normalization (minimal changes)
codec = SarfCodec(morf_map, normalization="light")

# Medium normalization (default - recommended)
codec = SarfCodec(morf_map, normalization="medium")

# Aggressive normalization (maximum normalization)
codec = SarfCodec(morf_map, normalization="aggressive")
Level Alef Variants Taa Marbuta Diacritics Tatweel
light Preserved Preserved Preserved Removed
medium Normalized Preserved Stripped Removed
aggressive Normalized Normalized Stripped Removed

Handling Diacritics (Tashkeel)

The codec properly handles Arabic diacritics:

from suhail import SarfCodec

codec = SarfCodec(morf_map)

# Text with full tashkeel
text = "بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ"
encoded = codec.encode(text)
decoded = codec.decode(encoded)

# Diacritics are handled correctly
print(decoded)  # Normalized form

Utility Functions

from suhail import is_arabic, is_pua, normalize, version

# Check if character is Arabic
is_arabic('ب')  # True
is_arabic('a')  # False

# Check if character is in Private Use Area
is_pua('\uE000')  # True
is_pua('ب')       # False

# Normalize Arabic text
normalize("الكِتَابُ")  # 'الكتاب'

# Get version
version()  # '0.1.0'

Codec Statistics

codec = SarfCodec(morf_map)

# Get number of morphemes
print(codec.num_morphemes)  # 114

# Get detailed statistics
stats = codec.stats()
print(stats)
# {'total_morphemes': 114, 'basic_pua_codes': 114, 'supplementary_pua_codes': 0}

Performance

Tested on 300,000 samples with 100% accuracy:

Test Samples Success Rate Speed
Random Arabic/English 100,000 100% ~30,000/sec
Diacritized Arabic (tashkeel) 100,000 100% ~5,000/sec
Plain Arabic 100,000 100% ~6,000/sec

Edge Cases Handled

Case Example Handling
Diacritics بِسْمِ Properly normalized
Arabic-Indic digits ٠١٢٣٤٥ Preserved
Alef variants أ إ آ ا Normalized to ا
Taa marbuta ة Optionally normalized
Tatweel (kashida) كـتـاب Removed
French guillemets « » Preserved
Mixed Arabic/English Hello مرحبا Both handled
URLs and emails email@test.com Preserved

Building from Source

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone and build
git clone https://github.com/almaghrabima/suhail-pkg
cd suhail-pkg
pip install maturin
maturin develop --release

# Run tests
python test_comprehensive.py
python test_large_scale.py

Requirements

  • Python 3.9+
  • Rust 1.70+ (for building from source)

License

Proprietary. Contact for licensing options.

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplatent_nlp-0.3.8.tar.gz (40.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplatent_nlp-0.3.8-cp38-cp38-manylinux_2_31_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.31+ x86-64

File details

Details for the file deeplatent_nlp-0.3.8.tar.gz.

File metadata

  • Download URL: deeplatent_nlp-0.3.8.tar.gz
  • Upload date:
  • Size: 40.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for deeplatent_nlp-0.3.8.tar.gz
Algorithm Hash digest
SHA256 13ccbf553c7bfc8e472b2573ee3d51ed59b9349ea622e657002b8b4a2415a42e
MD5 a5459eda4234c0df370b9775403519a7
BLAKE2b-256 6ba57c2aa21da50486509c55424be1b0ea506db19804d2b4b8c144ff4f36a2ef

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.8-cp38-cp38-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.3.8-cp38-cp38-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 dc5ae64e22bc7320992061cb23932709dfe04641127e3aeca5688bb8ed4a3456
MD5 8480dd6b963a87ccc1576d82fd81c7c8
BLAKE2b-256 ac7e846218e96e24c3a8d7273174c587b7aebbabaea3d45f1b9f5752e9c765f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page