A zero-dependency, ultra-fast Arabic NLP toolkit for text preprocessing, normalization, and advanced analysis.

These details have not been verified by PyPI

Project description

SahlNLP

A zero-dependency, ultra-fast Arabic NLP toolkit for text preprocessing, normalization, and analysis.

SahlNLP (سهل = easy in Arabic) is a lightweight Python library designed for Arabic text preprocessing, normalization, and advanced analysis. It targets AI engineers and web developers who need a fast, no-overhead solution for handling Arabic text.

Features

Zero external dependencies — only uses Python's built-in standard library
High performance — pre-compiled regex patterns, minimal memory footprint
Full type hints — excellent IDE support and autocompletion
Comprehensive — cleaning, normalization, numeral conversion, number-to-words (tafkeet), dialect detection, keyword extraction, and fuzzy matching
Well-tested — 156 tests with 100% pass rate
Advanced algorithms from scratch — TF-IDF, Levenshtein distance, and weighted dialect classification built with zero external libraries

Installation

pip install sahlnlp

Quick Start

import sahlnlp

# Clean noisy Arabic text
sahlnlp.clean_all("مَرْحَباً بـكـــــم في <b>موقعنا</b> https://example.com")
# => "مرحبا بكم في موقعنا"

# Normalize for search indexing
sahlnlp.normalize_search("أحمد مُعَلِّمٌ في المدرسة")
# => "احمد معلم في المدرسه"

# Convert numbers to Arabic words
sahlnlp.tafkeet(150)
# => "مائة وخمسون"

# Convert Hindi digits to standard numerals
sahlnlp.indic_to_arabic("٣ أبريل ٢٠٢٥")
# => "3 أبريل 2025"

# Detect Arabic dialect (Gulf, Levantine, Egyptian, Maghrebi)
sahlnlp.detect_dialect("شلونك يا خوي")
# => {"Gulf": 1.0, "Levantine": 0.0, "Egyptian": 0.0, "Maghrebi": 0.0}

# Extract keywords using pure-Python TF-IDF
sahlnlp.extract_keywords("الذكاء الاصطناعي فرع مهم. الذكاء الاصطناعي متطور.", top_n=3)
# => [("الاصطناعي", 0.12), ("متطور", 0.10), ("فرع", 0.10)]

# Fuzzy match with Arabic keyboard-aware Levenshtein distance
sahlnlp.suggest_correction("مدرية", ["مدرسة", "مدينة", "مربية"])
# => "مدرسة"

Architecture & Data Flow

Package Structure

classDiagram
    direction LR

    class __init__ {
        +clean_all()
        +remove_tashkeel()
        +remove_tatweel()
        +remove_html_and_links()
        +remove_repeated_chars()
        +normalize_hamza()
        +normalize_taa()
        +normalize_yaa()
        +normalize_search()
        +indic_to_arabic()
        +arabic_to_indic()
        +tafkeet()
        +detect_dialect()
        +extract_keywords()
        +suggest_correction()
        +compute_tf()
        +compute_idf()
        +mask_sensitive_info()
    }

    class cleaner {
        remove_tashkeel(text)
        remove_tatweel(text)
        remove_html_and_links(text)
        remove_repeated_chars(text, max_repeat)
        clean_all(text, ...flags)
    }

    class normalizer {
        normalize_hamza(text)
        normalize_taa(text, to_haa)
        normalize_yaa(text)
        normalize_search(text)
    }

    class converter {
        indic_to_arabic(text)
        arabic_to_indic(text)
        tafkeet(number, case, currency)
    }

    class analyzer {
        detect_dialect(text)
        extract_keywords(text, top_n)
        suggest_correction(word, dictionary)
        compute_tf(tokens)
        compute_idf(documents)
    }

    class guardian {
        mask_sensitive_info(text, mode, mask_char)
    }

    class constants {
        Pre-compiled Regex
        Unicode maps
        Tafkeet dicts
        PII patterns
    }

    class dictionaries {
        Dialect lexicons
        Stop-words
        Keyboard map
    }

    __init__ ..> cleaner : imports
    __init__ ..> normalizer : imports
    __init__ ..> converter : imports
    __init__ ..> analyzer : imports
    __init__ ..> guardian : imports
    cleaner ..> constants : regex
    normalizer ..> constants : char maps
    normalizer ..> cleaner : remove_tashkeel
    converter ..> constants : dicts
    analyzer ..> dictionaries : lexicons
    guardian ..> constants : PII regex

`clean_all()` Execution Sequence

sequenceDiagram
    actor User
    participant API as __init__.py
    participant Cleaner as cleaner.py
    participant Regex as Pre-compiled Patterns

    User->>API: sahlnlp.clean_all(noisy_text)
    API->>Cleaner: clean_all(text, flags=True)

    alt remove_tashkeel_flag
        Cleaner->>Regex: RE_TASHKEEL.sub("", text)
        Regex-->>Cleaner: text without diacritics
    end

    alt remove_tatweel_flag
        Cleaner->>Regex: RE_TATWEEL.sub("", text)
        Regex-->>Cleaner: text without kashida
    end

    alt remove_html_flag
        Cleaner->>Regex: RE_URL.sub("", text)
        Regex-->>Cleaner: text without URLs
        Cleaner->>Regex: RE_HTML_TAGS.sub("", text)
        Regex-->>Cleaner: text without HTML
    end

    alt remove_repeated_flag
        Cleaner->>Regex: RE_REPEATED_CHAR.sub(fn, text)
        Regex-->>Cleaner: flooding reduced
    end

    Cleaner-->>API: cleaned text
    API-->>User: "مرحبا بكم في موقعنا"

`tafkeet()` Decision Flow

flowchart TD
    Start([Input: number, case, currency]) --> TypeCheck{int or float?}
    TypeCheck -- No --> ErrorType[/TypeError/]
    TypeCheck -- Yes --> SignCheck{number < 0?}
    SignCheck -- Yes --> ErrorVal[/ValueError/]
    SignCheck -- No --> FloatCheck{is float?}

    FloatCheck -- Yes --> SplitFloat[Split int_part + dec_part]
    SplitFloat --> RecurseInt[tafkeet int_part]

    FloatCheck -- No --> ZeroCheck{number == 0?}
    ZeroCheck -- Yes --> OutZero["صفر"]

    ZeroCheck -- No --> Range1{1-19?}
    Range1 -- Yes --> OutOnes["ARABIC_ONES[n]"]

    Range1 -- No --> Range2{20-99?}
    Range2 -- Yes --> SplitTens["tens = n//10, ones = n%10"]
    SplitTens --> TensCompound["ONES[ones] و TENS[tens]"]

    Range2 -- No --> Range3{100-999?}
    Range3 -- Yes --> Convert100["_convert_below_1000(n, case)"]

    Range3 -- No --> LargeNum[Split into groups of 3 digits]
    LargeNum --> ForEach[For each group_val + scale]
    ForEach --> GV1{val == 1?}
    GV1 -- Yes --> OutSingular["singular"]
    GV1 -- No --> GV2{val == 2?}
    GV2 -- Yes --> OutDual["dual (case-inflected)"]
    GV2 -- No --> GV3{"3-10?"}
    GV3 -- Yes --> OutPlural["number + plural form"]
    GV3 -- No --> OutAcc["number + singularاً"]

    OutOnes & OutZero & TensCompound & Convert100 & OutSingular & OutDual & OutPlural & OutAcc --> Join["Join parts with و"]
    Join --> SARCheck{currency == SAR?}
    SARCheck -- Yes --> OutSAR["+ ريالاً"]
    SARCheck -- No --> Done([Return])
    OutSAR --> Done

Guardian PII Masking Pipeline

flowchart LR
    Input(["Raw Text"]) --> IBAN["IBAN\nDetector"]
    IBAN --> Phone["Phone\nDetector"]
    Phone --> ID["National ID\nDetector"]
    ID --> Email["Email\nDetector"]
    Email --> NameT["Name (Title)\nDetector"]
    NameT --> NameTh["Name (Theophoric)\nDetector"]
    NameTh --> Mode{mode?}
    Mode -- tag --> TagOut(["[PHONE] [ID]\n[EMAIL] [NAME]"])
    Mode -- mask --> MaskOut(["05*****567\n1********90\n********"])

    style Input fill:#e1f5fe
    style TagOut fill:#c8e6c9
    style MaskOut fill:#c8e6c9
    style IBAN fill:#fff9c4
    style Phone fill:#fff9c4
    style ID fill:#fff9c4
    style Email fill:#fff9c4
    style NameT fill:#fff9c4
    style NameTh fill:#fff9c4
    style Mode fill:#ffccbc

API Reference

Text Cleaning (`sahlnlp.cleaner`)

`remove_tashkeel(text)`

Remove all Arabic diacritical marks (tashkeel).

sahlnlp.remove_tashkeel("كِتَاب")
# => "كتاب"

`remove_tatweel(text)`

Remove tatweel/kashida characters (ـ).

sahlnlp.remove_tatweel("الســــلام")
# => "السلام"

`remove_html_and_links(text)`

Remove HTML tags and URLs from text.

sahlnlp.remove_html_and_links("زوروا <b>http://example.com</b>")
# => "زوروا "

`remove_repeated_chars(text, max_repeat=2)`

Reduce character flooding to a maximum number of repetitions.

sahlnlp.remove_repeated_chars("مرحباًاااا")
# => "مرحباًاا"

`clean_all(text, ...)`

Master cleaning function. Applies all cleaning operations with toggle flags.

sahlnlp.clean_all(
    "مَرْحَباً",
    remove_tashkeel_flag=True,
    remove_tatweel_flag=True,
    remove_html_flag=True,
    remove_repeated_flag=True,
    max_repeat=2,
)
# => "مرحبا"

Text Normalization (`sahlnlp.normalizer`)

`normalize_hamza(text)`

Convert all Alef variations (أ, إ, آ) to bare Alef (ا).

sahlnlp.normalize_hamza("أحمد إبراهيم آدم")
# => "احمد ابراهيم ادم"

`normalize_taa(text, to_haa=True)`

Convert Taa Marbuta (ة) to Haa (ه), or vice versa.

sahlnlp.normalize_taa("مدرسة")          # => "مدرسه"
sahlnlp.normalize_taa("مدرسه", to_haa=False)  # => "مدرسة"

`normalize_yaa(text)`

Convert Alef Maksura (ى) to Yaa (ي).

sahlnlp.normalize_yaa("موسى")
# => "موسي"

`normalize_search(text)`

Aggressive normalization for search engine indexing. Combines all normalization steps.

sahlnlp.normalize_search("أحمد مُعَلِّمٌ في المدرسة")
# => "احمد معلم في المدرسه"

Number Conversion (`sahlnlp.converter`)

`indic_to_arabic(text)`

Convert Arabic-Indic digits (٠١٢٣...) to standard numerals (0123...).

sahlnlp.indic_to_arabic("٣ أبريل ٢٠٢٥")
# => "3 أبريل 2025"

`arabic_to_indic(text)`

Convert standard numerals (0123...) to Arabic-Indic digits (٠١٢٣...).

sahlnlp.arabic_to_indic("3 أبريل 2025")
# => "٣ أبريل ٢٠٢٥"

`tafkeet(number, case='nominative', currency=None)`

Convert a number to grammatically correct Arabic words with full إعراب support.

sahlnlp.tafkeet(0)        # => "صفر"
sahlnlp.tafkeet(11)       # => "أحد عشر"
sahlnlp.tafkeet(101)      # => "مائة وواحد"
sahlnlp.tafkeet(1011)     # => "ألف وأحد عشر"
sahlnlp.tafkeet(250000)   # => "مائتان وخمسون ألفاً"

# Case inflection (إعراب)
sahlnlp.tafkeet(20, case='nominative')   # => "عشرون" (مرفوع)
sahlnlp.tafkeet(20, case='accusative')   # => "عشرين" (منصوب)
sahlnlp.tafkeet(2000, case='nominative')  # => "ألفان"
sahlnlp.tafkeet(2000, case='accusative')  # => "ألفين"

# Currency (SAR)
sahlnlp.tafkeet(150, currency='SAR')     # => "مائة وخمسون ريالاً"
sahlnlp.tafkeet(1.5, currency='SAR')     # => "واحد ريالاً وخمسة هللة"

Advanced Analysis (`sahlnlp.analyzer`) — Built from scratch, zero dependencies

`detect_dialect(text)`

Detect the most likely Arabic dialect using weighted lexicon-based classification. Supports Gulf, Levantine, Egyptian, and Maghrebi dialects.

sahlnlp.detect_dialect("شلونك يا خوي")
# => {"Gulf": 1.0, "Levantine": 0.0, "Egyptian": 0.0, "Maghrebi": 0.0}

sahlnlp.detect_dialect("عاوز اروح ازاي")
# => {"Gulf": 0.0, "Levantine": 0.0, "Egyptian": 1.0, "Maghrebi": 0.0}

`extract_keywords(text, top_n=5)`

Extract top keywords using a pure-Python TF-IDF implementation. Splits text on punctuation for IDF calculation and filters Arabic stop-words.

sahlnlp.extract_keywords("الذكاء الاصطناعي فرع من علوم الحاسوب. الذكاء مهم.", top_n=3)
# => [("الحاسوب", ...), ("علوم", ...), ("الاصطناعي", ...)]

`suggest_correction(word, dictionary, use_keyboard=True)`

Find the closest matching word using Levenshtein distance with optional Arabic keyboard proximity penalties (adjacent keys get reduced substitution cost).

sahlnlp.suggest_correction("مدرية", ["مدرسة", "مدينة", "مربية"])
# => "مدرسة"

sahlnlp.suggest_correction("مكتية", ["مكتبة", "مكتب", "مكية"])
# => "مكتبة"

`compute_tf(tokens)` / `compute_idf(documents)`

Lower-level TF and IDF functions for custom pipelines.

from sahlnlp import compute_tf, compute_idf

tf = compute_tf(["كتاب", "كتاب", "قلم"])   # {"كتاب": 0.667, "قلم": 0.333}
idf = compute_idf([["كتاب", "قلم"], ["كتاب", "حبر"]])

Development

# Clone the repository
git clone https://github.com/your-username/SahlNLP.git
cd SahlNLP

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

Security & Privacy — PII Masking (`sahlnlp.guardian`)

`mask_sensitive_info(text, mode="tag", mask_char="*")`

Detect and redact Personally Identifiable Information from Arabic text. Supports Saudi phone numbers, national IDs, IBANs, emails, and Arabic names (using contextual title-based detection).

Tag mode — replaces PII with descriptive labels:

sahlnlp.mask_sensitive_info(
    "السيد أحمد رقمه 0551234567 وهويته 1234567890 وآيبان SA0380000000608010167519",
    mode="tag",
)
# => "[NAME] رقمه [PHONE] وهويته [ID] وآيبان [IBAN]"

Mask mode — replaces PII with * while preserving first/last characters:

sahlnp.mask_sensitive_info("اتصل على 0551234567", mode="mask")
# => "اتصل على 05*****567"

Detected entities:

Entity	Pattern	Example
Saudi Phone	`+9665...`, `05...`, `5...`	`0551234567`
National ID	10 digits starting with 1 or 2	`1234567890`
Saudi IBAN	`SA` + 22 digits	`SA0380000000608010167519`
Email	Standard RFC 5322	`user@example.com`
Arabic Names	Title-prefix heuristic (`السيد`, `الدكتور`, etc.) + `عبد`/`بن` patterns	`السيد أحمد محمد`

License

This project is licensed under the MIT License — see the LICENSE file for details.

SahlNLP - وثائق بالعربية

مكتبة بايثون خفيفة وسريعة لمعالجة النصوص العربية بدون أي مكتبات خارجية.

المميزات

صفر تبعيات خارجية — تستخدم فقط مكتبة بايثون القياسية
أداء عالي — أنماط regex مجمعة مسبقاً، وبصمة ذاكرة ضئيلة
كتابة الأنواع الكاملة — دعم ممتاز للمحررات والأكمل التلقائي
شامل — تنظيف، تطبيع، تحويل أرقام، تفقيط، كشف لهجة، استخراج كلمات مفتاحية، تطابق تقريبي، وحجب المعلومات الحساسة
خوارزميات متقدمة من الصفر — TF-IDF، مسافة ليفنشتاين، تصنيف اللهجات، وحجب PII مبنية بدون مكتبات خارجية
مختبر بالكامل — 194 اختبار بنسبة نجاح 100%

التثبيت

pip install sahlnlp

مثال سريع

import sahlnlp

# تنظيف النص
sahlnlp.clean_all("مَرْحَباً بـكـــــم")
# => "مرحبا بكم"

# تطبيع للبحث
sahlnlp.normalize_search("أحمد مُعَلِّمٌ في المدرسة")
# => "احمد معلم في المدرسه"

# تحويل الأرقام إلى كلمات
sahlnlp.tafkeet(150)
# => "مائة وخمسون"

# تحويل الأرقام الهندية
sahlnlp.indic_to_arabic("٣ أبريل ٢٠٢٥")
# => "3 أبريل 2025"

التطوير

pip install -e ".[dev]"
pytest tests/ -v

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.0

Apr 16, 2026

0.3.0

Apr 15, 2026

0.2.0

Apr 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sahlnlp-0.4.0.tar.gz (32.7 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sahlnlp-0.4.0-py3-none-any.whl (24.0 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file sahlnlp-0.4.0.tar.gz.

File metadata

Download URL: sahlnlp-0.4.0.tar.gz
Upload date: Apr 16, 2026
Size: 32.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sahlnlp-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`248bed2049bd90ad7a2b9b491bb43e8fa72a32aa1ad5ff09fb7f26754293a4db`
MD5	`489955a3206609b342cba098105b4b29`
BLAKE2b-256	`88106ac0793d88c812cea8f3ef8febbe0acb1c61878da75faa06881e26b60c5a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sahlnlp-0.4.0.tar.gz:

Publisher: publish.yml on mralwaleed/SahlNLP

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sahlnlp-0.4.0.tar.gz
- Subject digest: 248bed2049bd90ad7a2b9b491bb43e8fa72a32aa1ad5ff09fb7f26754293a4db
- Sigstore transparency entry: 1313665228
- Sigstore integration time: Apr 16, 2026
Source repository:
- Permalink: mralwaleed/SahlNLP@8728c6d06d1db0c6b055ca1f74d5520d20fbc84f
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/mralwaleed
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8728c6d06d1db0c6b055ca1f74d5520d20fbc84f
- Trigger Event: release

File details

Details for the file sahlnlp-0.4.0-py3-none-any.whl.

File metadata

Download URL: sahlnlp-0.4.0-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 24.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sahlnlp-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`61319f8df1c0ae3f8ec28e1baf09cbe53127b8bbd3ca2ab9e6ddfbe9fa45ff0c`
MD5	`2963984ec06c29f06b217099a3db031d`
BLAKE2b-256	`a77e3b5901b9ef990f5b0c3177a7c26c3e5d890e7fafa21e2bd5e553581fb138`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sahlnlp-0.4.0-py3-none-any.whl:

Publisher: publish.yml on mralwaleed/SahlNLP

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sahlnlp-0.4.0-py3-none-any.whl
- Subject digest: 61319f8df1c0ae3f8ec28e1baf09cbe53127b8bbd3ca2ab9e6ddfbe9fa45ff0c
- Sigstore transparency entry: 1313665337
- Sigstore integration time: Apr 16, 2026
Source repository:
- Permalink: mralwaleed/SahlNLP@8728c6d06d1db0c6b055ca1f74d5520d20fbc84f
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/mralwaleed
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8728c6d06d1db0c6b055ca1f74d5520d20fbc84f
- Trigger Event: release

sahlnlp 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

SahlNLP

Features

Installation

Quick Start

Architecture & Data Flow

Package Structure

clean_all() Execution Sequence

tafkeet() Decision Flow

Guardian PII Masking Pipeline

API Reference

Text Cleaning (sahlnlp.cleaner)

remove_tashkeel(text)

remove_tatweel(text)

remove_html_and_links(text)

remove_repeated_chars(text, max_repeat=2)

clean_all(text, ...)

Text Normalization (sahlnlp.normalizer)

normalize_hamza(text)

normalize_taa(text, to_haa=True)

normalize_yaa(text)

normalize_search(text)

Number Conversion (sahlnlp.converter)

indic_to_arabic(text)

arabic_to_indic(text)

tafkeet(number, case='nominative', currency=None)

Advanced Analysis (sahlnlp.analyzer) — Built from scratch, zero dependencies

detect_dialect(text)

extract_keywords(text, top_n=5)

suggest_correction(word, dictionary, use_keyboard=True)

compute_tf(tokens) / compute_idf(documents)

Development

Security & Privacy — PII Masking (sahlnlp.guardian)

mask_sensitive_info(text, mode="tag", mask_char="*")

License

SahlNLP - وثائق بالعربية

المميزات

التثبيت

مثال سريع

التطوير

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`clean_all()` Execution Sequence

`tafkeet()` Decision Flow

Text Cleaning (`sahlnlp.cleaner`)

`remove_tashkeel(text)`

`remove_tatweel(text)`

`remove_html_and_links(text)`

`remove_repeated_chars(text, max_repeat=2)`

`clean_all(text, ...)`

Text Normalization (`sahlnlp.normalizer`)

`normalize_hamza(text)`

`normalize_taa(text, to_haa=True)`

`normalize_yaa(text)`

`normalize_search(text)`

Number Conversion (`sahlnlp.converter`)

`indic_to_arabic(text)`

`arabic_to_indic(text)`

`tafkeet(number, case='nominative', currency=None)`

Advanced Analysis (`sahlnlp.analyzer`) — Built from scratch, zero dependencies

`detect_dialect(text)`

`extract_keywords(text, top_n=5)`

`suggest_correction(word, dictionary, use_keyboard=True)`

`compute_tf(tokens)` / `compute_idf(documents)`

Security & Privacy — PII Masking (`sahlnlp.guardian`)

`mask_sensitive_info(text, mode="tag", mask_char="*")`