Skip to main content

High-performance Arabic-first tokenizer with morphology awareness

Project description

Suhail

High-performance Arabic tokenizer with morphology awareness. Built with Rust for speed, with Python bindings for ease of use.

Features

  • Arabic-Optimized: Designed specifically for Arabic and morphologically-rich languages
  • Fast: Rust core with Python bindings (~30,000 operations/sec)
  • Accurate: 100% roundtrip accuracy on 300,000+ test samples
  • Edge Case Handling: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
  • Unicode Support: Full support for Arabic diacritics, PUA characters, and mixed scripts

Installation

pip install deeplatent-nlp

Quick Start

Using SarfCodec (Recommended)

The SarfCodec class provides encode/decode functionality using a morpheme map:

from suhail import SarfCodec

# Create codec from morpheme map dictionary
morf_map = {
    'ال': '\uE000',      # definite article -> PUA
    'كتاب': '\uE001',    # kitab -> PUA
    'و': '\uE002',       # wa (and) -> PUA
    'ب': '\uE003',       # bi (with) -> PUA
}
codec = SarfCodec(morf_map)

# Encode Arabic text (morphemes -> PUA characters)
text = "الكتاب"
encoded = codec.encode(text)
print(f"Encoded: {repr(encoded)}")  # '\ue000\ue001'

# Decode back to Arabic (PUA -> morphemes)
decoded = codec.decode(encoded)
print(f"Decoded: {decoded}")  # 'الكتاب'

# Verify roundtrip
normalized, decoded, is_ok = codec.roundtrip(text)
print(f"Roundtrip OK: {is_ok}")  # True

Loading from JSON File

from suhail import SarfCodec

# Load morpheme map from JSON file
codec = SarfCodec.from_file("morf_map.json")

# Use as normal
encoded = codec.encode("بسم الله الرحمن الرحيم")
decoded = codec.decode(encoded)

Standalone Functions

For quick one-off operations without creating a codec:

from suhail import encode, decode, normalize

morf_map = {'ال': '\uE000', 'كتاب': '\uE001'}

# Encode text
encoded = encode("الكتاب", morf_map)

# Decode text
decoded = decode(encoded, morf_map)

# Normalize Arabic text (without encoding)
normalized = normalize("الكِتَابُ", level="medium")

Normalization Levels

The codec supports three normalization levels:

from suhail import SarfCodec

# Light normalization (minimal changes)
codec = SarfCodec(morf_map, normalization="light")

# Medium normalization (default - recommended)
codec = SarfCodec(morf_map, normalization="medium")

# Aggressive normalization (maximum normalization)
codec = SarfCodec(morf_map, normalization="aggressive")
Level Alef Variants Taa Marbuta Diacritics Tatweel
light Preserved Preserved Preserved Removed
medium Normalized Preserved Stripped Removed
aggressive Normalized Normalized Stripped Removed

Handling Diacritics (Tashkeel)

The codec properly handles Arabic diacritics:

from suhail import SarfCodec

codec = SarfCodec(morf_map)

# Text with full tashkeel
text = "بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ"
encoded = codec.encode(text)
decoded = codec.decode(encoded)

# Diacritics are handled correctly
print(decoded)  # Normalized form

Utility Functions

from suhail import is_arabic, is_pua, normalize, version

# Check if character is Arabic
is_arabic('ب')  # True
is_arabic('a')  # False

# Check if character is in Private Use Area
is_pua('\uE000')  # True
is_pua('ب')       # False

# Normalize Arabic text
normalize("الكِتَابُ")  # 'الكتاب'

# Get version
version()  # '0.1.0'

Codec Statistics

codec = SarfCodec(morf_map)

# Get number of morphemes
print(codec.num_morphemes)  # 114

# Get detailed statistics
stats = codec.stats()
print(stats)
# {'total_morphemes': 114, 'basic_pua_codes': 114, 'supplementary_pua_codes': 0}

Performance

Tested on 300,000 samples with 100% accuracy:

Test Samples Success Rate Speed
Random Arabic/English 100,000 100% ~30,000/sec
Diacritized Arabic (tashkeel) 100,000 100% ~5,000/sec
Plain Arabic 100,000 100% ~6,000/sec

Edge Cases Handled

Case Example Handling
Diacritics بِسْمِ Properly normalized
Arabic-Indic digits ٠١٢٣٤٥ Preserved
Alef variants أ إ آ ا Normalized to ا
Taa marbuta ة Optionally normalized
Tatweel (kashida) كـتـاب Removed
French guillemets « » Preserved
Mixed Arabic/English Hello مرحبا Both handled
URLs and emails email@test.com Preserved

Building from Source

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone and build
git clone https://github.com/almaghrabima/suhail-pkg
cd suhail-pkg
pip install maturin
maturin develop --release

# Run tests
python test_comprehensive.py
python test_large_scale.py

Requirements

  • Python 3.9+
  • Rust 1.70+ (for building from source)

License

Proprietary. Contact for licensing options.

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deeplatent_nlp-0.3.6.tar.gz (36.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deeplatent_nlp-0.3.6-cp38-cp38-manylinux_2_31_x86_64.whl (994.5 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.31+ x86-64

File details

Details for the file deeplatent_nlp-0.3.6.tar.gz.

File metadata

  • Download URL: deeplatent_nlp-0.3.6.tar.gz
  • Upload date:
  • Size: 36.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for deeplatent_nlp-0.3.6.tar.gz
Algorithm Hash digest
SHA256 8187d088dcbf7200f020a0c434e5bfc91fd2bb595985644778727be1c85ee04c
MD5 58f1ad20ee6cb7025c86cca52a502c97
BLAKE2b-256 4e62dce808a3dc52be17d3e0b9e3f84a8004666b7a4e1352751cf0906ad08fcb

See more details on using hashes here.

File details

Details for the file deeplatent_nlp-0.3.6-cp38-cp38-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for deeplatent_nlp-0.3.6-cp38-cp38-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 59cdcea4a6ed36a8fdfa245e57503096c75ba764e9cfea4b3d3eebe87b1f4b8c
MD5 2f2f33e75b0abb176561832fa7d91c31
BLAKE2b-256 de975f740a34b2352891cd7a1eac26fcaf46fb81c81edfc4f059c8d097b1321c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page