High-performance Arabic-first tokenizer with morphology awareness
Project description
Suhail
High-performance Arabic tokenizer with morphology awareness. Built with Rust for speed, with Python bindings for ease of use.
Features
- Arabic-Optimized: Designed specifically for Arabic and morphologically-rich languages
- Fast: Rust core with Python bindings (~30,000 operations/sec)
- Accurate: 100% roundtrip accuracy on 300,000+ test samples
- Edge Case Handling: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
- Unicode Support: Full support for Arabic diacritics, PUA characters, and mixed scripts
- IP Protection: AES-256-GCM encrypted morpheme maps (no license key required)
Installation
pip install deeplatent-nlp
Quick Start
Using SarfCodec (Recommended)
The SarfCodec class provides encode/decode functionality using a morpheme map:
from suhail import SarfCodec
# Create codec from morpheme map dictionary
morf_map = {
'ال': '\uE000', # definite article -> PUA
'كتاب': '\uE001', # kitab -> PUA
'و': '\uE002', # wa (and) -> PUA
'ب': '\uE003', # bi (with) -> PUA
}
codec = SarfCodec(morf_map)
# Encode Arabic text (morphemes -> PUA characters)
text = "الكتاب"
encoded = codec.encode(text)
print(f"Encoded: {repr(encoded)}") # '\ue000\ue001'
# Decode back to Arabic (PUA -> morphemes)
decoded = codec.decode(encoded)
print(f"Decoded: {decoded}") # 'الكتاب'
# Verify roundtrip
normalized, decoded, is_ok = codec.roundtrip(text)
print(f"Roundtrip OK: {is_ok}") # True
Loading from JSON File
from suhail import SarfCodec
# Load morpheme map from JSON file
codec = SarfCodec.from_file("morf_map.json")
# Use as normal
encoded = codec.encode("بسم الله الرحمن الرحيم")
decoded = codec.decode(encoded)
Encrypted Morf Map (IP Protection)
Protect your morpheme mappings with AES-256-GCM encryption. Encrypted files can only be read by this library - no license key required.
from suhail import SarfCodec, encrypt_morf_map
# Option 1: Encrypt existing codec's morf_map
codec = SarfCodec.from_file("morf_map.json")
codec.encrypt_to_file("morf_map.enc")
# Option 2: Encrypt JSON file directly
encrypt_morf_map("morf_map.json", "morf_map.enc")
# Load from encrypted file (no key needed)
codec = SarfCodec.from_encrypted("morf_map.enc")
# Works exactly like normal codec
encoded = codec.encode("الكتاب")
decoded = codec.decode(encoded)
The encrypted file format:
- Uses AES-256-GCM encryption
- Decryption key is embedded in the compiled Rust library
- Cannot be decrypted without the deeplatent-nlp library
- Includes checksum for integrity verification
Standalone Functions
For quick one-off operations without creating a codec:
from suhail import encode, decode, normalize
morf_map = {'ال': '\uE000', 'كتاب': '\uE001'}
# Encode text
encoded = encode("الكتاب", morf_map)
# Decode text
decoded = decode(encoded, morf_map)
# Normalize Arabic text (without encoding)
normalized = normalize("الكِتَابُ", level="medium")
Normalization Levels
The codec supports three normalization levels:
from suhail import SarfCodec
# Light normalization (minimal changes)
codec = SarfCodec(morf_map, normalization="light")
# Medium normalization (default - recommended)
codec = SarfCodec(morf_map, normalization="medium")
# Aggressive normalization (maximum normalization)
codec = SarfCodec(morf_map, normalization="aggressive")
| Level | Alef Variants | Taa Marbuta | Diacritics | Tatweel |
|---|---|---|---|---|
| light | Preserved | Preserved | Preserved | Removed |
| medium | Normalized | Preserved | Stripped | Removed |
| aggressive | Normalized | Normalized | Stripped | Removed |
Handling Diacritics (Tashkeel)
The codec properly handles Arabic diacritics:
from suhail import SarfCodec
codec = SarfCodec(morf_map)
# Text with full tashkeel
text = "بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ"
encoded = codec.encode(text)
decoded = codec.decode(encoded)
# Diacritics are handled correctly
print(decoded) # Normalized form
Utility Functions
from suhail import is_arabic, is_pua, normalize, version
# Check if character is Arabic
is_arabic('ب') # True
is_arabic('a') # False
# Check if character is in Private Use Area
is_pua('\uE000') # True
is_pua('ب') # False
# Normalize Arabic text
normalize("الكِتَابُ") # 'الكتاب'
# Get version
version() # '0.1.0'
Codec Statistics
codec = SarfCodec(morf_map)
# Get number of morphemes
print(codec.num_morphemes) # 114
# Get detailed statistics
stats = codec.stats()
print(stats)
# {'total_morphemes': 114, 'basic_pua_codes': 114, 'supplementary_pua_codes': 0}
Performance
Tested on 300,000 samples with 100% accuracy:
| Test | Samples | Success Rate | Speed |
|---|---|---|---|
| Random Arabic/English | 100,000 | 100% | ~30,000/sec |
| Diacritized Arabic (tashkeel) | 100,000 | 100% | ~5,000/sec |
| Plain Arabic | 100,000 | 100% | ~6,000/sec |
Edge Cases Handled
| Case | Example | Handling |
|---|---|---|
| Diacritics | بِسْمِ | Properly normalized |
| Arabic-Indic digits | ٠١٢٣٤٥ | Preserved |
| Alef variants | أ إ آ ا | Normalized to ا |
| Taa marbuta | ة | Optionally normalized |
| Tatweel (kashida) | كـتـاب | Removed |
| French guillemets | « » | Preserved |
| Mixed Arabic/English | Hello مرحبا | Both handled |
| URLs and emails | email@test.com | Preserved |
Building from Source
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Clone and build
git clone https://github.com/almaghrabima/suhail-pkg
cd suhail-pkg
pip install maturin
maturin develop --release
# Run tests
python test_comprehensive.py
python test_large_scale.py
Requirements
- Python 3.9+
- Rust 1.70+ (for building from source)
License
Proprietary. Contact for licensing options.
Support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deeplatent_nlp-0.3.7.tar.gz.
File metadata
- Download URL: deeplatent_nlp-0.3.7.tar.gz
- Upload date:
- Size: 40.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff992663e4e72895f501ebe8c18fd89a5805f33333cfe355fafcbfa1a15f587b
|
|
| MD5 |
fa85c6408a2368cfc1a1beb1cc7751a9
|
|
| BLAKE2b-256 |
c8d687d90349d2393e1ba8e57c12154f9ec72713797fc3c30a7e74d490aad8e9
|
File details
Details for the file deeplatent_nlp-0.3.7-cp38-cp38-manylinux_2_31_x86_64.whl.
File metadata
- Download URL: deeplatent_nlp-0.3.7-cp38-cp38-manylinux_2_31_x86_64.whl
- Upload date:
- Size: 1.0 MB
- Tags: CPython 3.8, manylinux: glibc 2.31+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb64b6490a88a92583a238c2fd260275744f1c5086f82d7fe7709bade6e5ec42
|
|
| MD5 |
289a6c9e809f00ee95f5e417b2e59829
|
|
| BLAKE2b-256 |
3f56234e92308a8b14814aa9438e3f89369b522a21eefec7ca6562eddcaa69be
|