Skip to main content

High-performance Arabic-first tokenizer with morphology awareness

Project description

Suhail

High-performance Arabic tokenizer with morphology awareness. Built with Rust for speed, with Python bindings for ease of use.

Features

  • Arabic-Optimized: Designed specifically for Arabic and morphologically-rich languages
  • Fast: Rust core with Python bindings (~30,000 operations/sec)
  • Accurate: 100% roundtrip accuracy on 300,000+ test samples
  • Edge Case Handling: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
  • Unicode Support: Full support for Arabic diacritics, PUA characters, and mixed scripts
  • IP Protection: AES-256-GCM encrypted morpheme maps (no license key required)

Installation

pip install deeplatent-nlp

Quick Start

Using SarfCodec (Recommended)

The SarfCodec class provides encode/decode functionality using a morpheme map:

from suhail import SarfCodec

# Create codec from morpheme map dictionary
morf_map = {
    'ال': '\uE000',      # definite article -> PUA
    'كتاب': '\uE001',    # kitab -> PUA
    'و': '\uE002',       # wa (and) -> PUA
    'ب': '\uE003',       # bi (with) -> PUA
}
codec = SarfCodec(morf_map)

# Encode Arabic text (morphemes -> PUA characters)
text = "الكتاب"
encoded = codec.encode(text)
print(f"Encoded: {repr(encoded)}")  # '\ue000\ue001'

# Decode back to Arabic (PUA -> morphemes)
decoded = codec.decode(encoded)
print(f"Decoded: {decoded}")  # 'الكتاب'

# Verify roundtrip
normalized, decoded, is_ok = codec.roundtrip(text)
print(f"Roundtrip OK: {is_ok}")  # True

Loading from JSON File

from suhail import SarfCodec

# Load morpheme map from JSON file
codec = SarfCodec.from_file("morf_map.json")

# Use as normal
encoded = codec.encode("بسم الله الرحمن الرحيم")
decoded = codec.decode(encoded)

Encrypted Morf Map (IP Protection)

Protect your morpheme mappings with AES-256-GCM encryption. Encrypted files can only be read by this library - no license key required.

from suhail import SarfCodec, encrypt_morf_map

# Option 1: Encrypt existing codec's morf_map
codec = SarfCodec.from_file("morf_map.json")
codec.encrypt_to_file("morf_map.enc")

# Option 2: Encrypt JSON file directly
encrypt_morf_map("morf_map.json", "morf_map.enc")

# Load from encrypted file (no key needed)
codec = SarfCodec.from_encrypted("morf_map.enc")

# Works exactly like normal codec
encoded = codec.encode("الكتاب")
decoded = codec.decode(encoded)

The encrypted file format:

  • Uses AES-256-GCM encryption
  • Decryption key is embedded in the compiled Rust library
  • Cannot be decrypted without the deeplatent-nlp library
  • Includes checksum for integrity verification

Standalone Functions

For quick one-off operations without creating a codec:

from suhail import encode, decode, normalize

morf_map = {'ال': '\uE000', 'كتاب': '\uE001'}

# Encode text
encoded = encode("الكتاب", morf_map)

# Decode text
decoded = decode(encoded, morf_map)

# Normalize Arabic text (without encoding)
normalized = normalize("الكِتَابُ", level="medium")

Normalization Levels

The codec supports three normalization levels:

from suhail import SarfCodec

# Light normalization (minimal changes)
codec = SarfCodec(morf_map, normalization="light")

# Medium normalization (default - recommended)
codec = SarfCodec(morf_map, normalization="medium")

# Aggressive normalization (maximum normalization)
codec = SarfCodec(morf_map, normalization="aggressive")
Level Alef Variants Taa Marbuta Diacritics Tatweel
light Preserved Preserved Preserved Removed
medium Normalized Preserved Stripped Removed
aggressive Normalized Normalized Stripped Removed

Handling Diacritics (Tashkeel)

The codec properly handles Arabic diacritics:

from suhail import SarfCodec

codec = SarfCodec(morf_map)

# Text with full tashkeel
text = "بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ"
encoded = codec.encode(text)
decoded = codec.decode(encoded)

# Diacritics are handled correctly
print(decoded)  # Normalized form

Utility Functions

from suhail import is_arabic, is_pua, normalize, version

# Check if character is Arabic
is_arabic('ب')  # True
is_arabic('a')  # False

# Check if character is in Private Use Area
is_pua('\uE000')  # True
is_pua('ب')       # False

# Normalize Arabic text
normalize("الكِتَابُ")  # 'الكتاب'

# Get version
version()  # '0.1.0'

Codec Statistics

codec = SarfCodec(morf_map)

# Get number of morphemes
print(codec.num_morphemes)  # 114

# Get detailed statistics
stats = codec.stats()
print(stats)
# {'total_morphemes': 114, 'basic_pua_codes': 114, 'supplementary_pua_codes': 0}

Performance

Tested on 300,000 samples with 100% accuracy:

Test Samples Success Rate Speed
Random Arabic/English 100,000 100% ~30,000/sec
Diacritized Arabic (tashkeel) 100,000 100% ~5,000/sec
Plain Arabic 100,000 100% ~6,000/sec

Edge Cases Handled

Case Example Handling
Diacritics بِسْمِ Properly normalized
Arabic-Indic digits ٠١٢٣٤٥ Preserved
Alef variants أ إ آ ا Normalized to ا
Taa marbuta ة Optionally normalized
Tatweel (kashida) كـتـاب Removed
French guillemets « » Preserved
Mixed Arabic/English Hello مرحبا Both handled
URLs and emails email@test.com Preserved

Building from Source

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone and build
git clone https://github.com/almaghrabima/suhail-pkg
cd suhail-pkg
pip install maturin
maturin develop --release

# Run tests
python test_comprehensive.py
python test_large_scale.py

Requirements

  • Python 3.9+
  • Rust 1.70+ (for building from source)

License

Proprietary. Contact for licensing options.

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplatent_nlp-0.3.7.tar.gz (40.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplatent_nlp-0.3.7-cp38-cp38-manylinux_2_31_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.31+ x86-64

File details

Details for the file deeplatent_nlp-0.3.7.tar.gz.

File metadata

  • Download URL: deeplatent_nlp-0.3.7.tar.gz
  • Upload date:
  • Size: 40.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for deeplatent_nlp-0.3.7.tar.gz
Algorithm Hash digest
SHA256 ff992663e4e72895f501ebe8c18fd89a5805f33333cfe355fafcbfa1a15f587b
MD5 fa85c6408a2368cfc1a1beb1cc7751a9
BLAKE2b-256 c8d687d90349d2393e1ba8e57c12154f9ec72713797fc3c30a7e74d490aad8e9

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.7-cp38-cp38-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.3.7-cp38-cp38-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 bb64b6490a88a92583a238c2fd260275744f1c5086f82d7fe7709bade6e5ec42
MD5 289a6c9e809f00ee95f5e417b2e59829
BLAKE2b-256 3f56234e92308a8b14814aa9438e3f89369b522a21eefec7ca6562eddcaa69be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page