High-performance Arabic-first tokenizer with morphology awareness
Project description
Suhail
High-performance Arabic tokenizer with morphology awareness. Built with Rust for speed, with Python bindings for ease of use.
Features
- Arabic-Optimized: Designed specifically for Arabic and morphologically-rich languages
- Fast: Rust core with Python bindings (~30,000 operations/sec)
- Accurate: 100% roundtrip accuracy on 300,000+ test samples
- Edge Case Handling: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
- Unicode Support: Full support for Arabic diacritics, PUA characters, and mixed scripts
- IP Protection: AES-256-GCM encrypted morpheme maps (no license key required)
Installation
pip install deeplatent-nlp
Quick Start
Using SarfCodec (Recommended)
The SarfCodec class provides encode/decode functionality using a morpheme map:
from suhail import SarfCodec
# Create codec from morpheme map dictionary
morf_map = {
'ال': '\uE000', # definite article -> PUA
'كتاب': '\uE001', # kitab -> PUA
'و': '\uE002', # wa (and) -> PUA
'ب': '\uE003', # bi (with) -> PUA
}
codec = SarfCodec(morf_map)
# Encode Arabic text (morphemes -> PUA characters)
text = "الكتاب"
encoded = codec.encode(text)
print(f"Encoded: {repr(encoded)}") # '\ue000\ue001'
# Decode back to Arabic (PUA -> morphemes)
decoded = codec.decode(encoded)
print(f"Decoded: {decoded}") # 'الكتاب'
# Verify roundtrip
normalized, decoded, is_ok = codec.roundtrip(text)
print(f"Roundtrip OK: {is_ok}") # True
Loading from Encrypted File
Morpheme maps are distributed as encrypted .enc files for IP protection:
from suhail import SarfCodec
# Load from encrypted file (no license key needed)
codec = SarfCodec.from_encrypted("morf_map.enc")
# Use as normal
encoded = codec.encode("بسم الله الرحمن الرحيم")
decoded = codec.decode(encoded)
Creating Encrypted Morf Map Files
To create encrypted files from your JSON morf_map:
from suhail import SarfCodec, encrypt_morf_map
# Option 1: Encrypt JSON file directly
encrypt_morf_map("morf_map.json", "morf_map.enc")
# Option 2: Encrypt from dict
morf_map = {'ال': '\uE000', 'كتاب': '\uE001'}
codec = SarfCodec(morf_map)
codec.encrypt_to_file("morf_map.enc")
Encryption details:
- AES-256-GCM encryption
- Key embedded in compiled Rust binary
- Cannot be decrypted without deeplatent-nlp library
- Checksum verification for tamper detection
Standalone Functions
For quick one-off operations without creating a codec:
from suhail import encode, decode, normalize
morf_map = {'ال': '\uE000', 'كتاب': '\uE001'}
# Encode text
encoded = encode("الكتاب", morf_map)
# Decode text
decoded = decode(encoded, morf_map)
# Normalize Arabic text (without encoding)
normalized = normalize("الكِتَابُ", level="medium")
Normalization Levels
The codec supports three normalization levels:
from suhail import SarfCodec
# Light normalization (minimal changes)
codec = SarfCodec(morf_map, normalization="light")
# Medium normalization (default - recommended)
codec = SarfCodec(morf_map, normalization="medium")
# Aggressive normalization (maximum normalization)
codec = SarfCodec(morf_map, normalization="aggressive")
| Level | Alef Variants | Taa Marbuta | Diacritics | Tatweel |
|---|---|---|---|---|
| light | Preserved | Preserved | Preserved | Removed |
| medium | Normalized | Preserved | Stripped | Removed |
| aggressive | Normalized | Normalized | Stripped | Removed |
Handling Diacritics (Tashkeel)
The codec properly handles Arabic diacritics:
from suhail import SarfCodec
codec = SarfCodec(morf_map)
# Text with full tashkeel
text = "بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ"
encoded = codec.encode(text)
decoded = codec.decode(encoded)
# Diacritics are handled correctly
print(decoded) # Normalized form
Utility Functions
from suhail import is_arabic, is_pua, normalize, version
# Check if character is Arabic
is_arabic('ب') # True
is_arabic('a') # False
# Check if character is in Private Use Area
is_pua('\uE000') # True
is_pua('ب') # False
# Normalize Arabic text
normalize("الكِتَابُ") # 'الكتاب'
# Get version
version() # '0.1.0'
Codec Statistics
codec = SarfCodec(morf_map)
# Get number of morphemes
print(codec.num_morphemes) # 114
# Get detailed statistics
stats = codec.stats()
print(stats)
# {'total_morphemes': 114, 'basic_pua_codes': 114, 'supplementary_pua_codes': 0}
Performance
Tested on 300,000 samples with 100% accuracy:
| Test | Samples | Success Rate | Speed |
|---|---|---|---|
| Random Arabic/English | 100,000 | 100% | ~30,000/sec |
| Diacritized Arabic (tashkeel) | 100,000 | 100% | ~5,000/sec |
| Plain Arabic | 100,000 | 100% | ~6,000/sec |
Edge Cases Handled
| Case | Example | Handling |
|---|---|---|
| Diacritics | بِسْمِ | Properly normalized |
| Arabic-Indic digits | ٠١٢٣٤٥ | Preserved |
| Alef variants | أ إ آ ا | Normalized to ا |
| Taa marbuta | ة | Optionally normalized |
| Tatweel (kashida) | كـتـاب | Removed |
| French guillemets | « » | Preserved |
| Mixed Arabic/English | Hello مرحبا | Both handled |
| URLs and emails | email@test.com | Preserved |
Building from Source
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Clone and build
git clone https://github.com/almaghrabima/suhail-pkg
cd suhail-pkg
pip install maturin
maturin develop --release
# Run tests
python test_comprehensive.py
python test_large_scale.py
Requirements
- Python 3.9+
- Rust 1.70+ (for building from source)
License
Proprietary. Contact for licensing options.
Support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deeplatent_nlp-0.3.8.tar.gz.
File metadata
- Download URL: deeplatent_nlp-0.3.8.tar.gz
- Upload date:
- Size: 40.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
13ccbf553c7bfc8e472b2573ee3d51ed59b9349ea622e657002b8b4a2415a42e
|
|
| MD5 |
a5459eda4234c0df370b9775403519a7
|
|
| BLAKE2b-256 |
6ba57c2aa21da50486509c55424be1b0ea506db19804d2b4b8c144ff4f36a2ef
|
File details
Details for the file deeplatent_nlp-0.3.8-cp38-cp38-manylinux_2_31_x86_64.whl.
File metadata
- Download URL: deeplatent_nlp-0.3.8-cp38-cp38-manylinux_2_31_x86_64.whl
- Upload date:
- Size: 1.0 MB
- Tags: CPython 3.8, manylinux: glibc 2.31+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc5ae64e22bc7320992061cb23932709dfe04641127e3aeca5688bb8ed4a3456
|
|
| MD5 |
8480dd6b963a87ccc1576d82fd81c7c8
|
|
| BLAKE2b-256 |
ac7e846218e96e24c3a8d7273174c587b7aebbabaea3d45f1b9f5752e9c765f4
|