Skip to main content

A zero-dependency, ultra-fast Arabic NLP toolkit for text preprocessing, normalization, and advanced analysis.

Project description

SahlNLP

A zero-dependency, ultra-fast Arabic NLP toolkit for text preprocessing, normalization, and analysis.

Python 3.9+ License: MIT Tests

SahlNLP (سهل = easy in Arabic) is a lightweight Python library designed for Arabic text preprocessing, normalization, and advanced analysis. It targets AI engineers and web developers who need a fast, no-overhead solution for handling Arabic text.


Features

  • Zero external dependencies — only uses Python's built-in standard library
  • High performance — pre-compiled regex patterns, minimal memory footprint
  • Full type hints — excellent IDE support and autocompletion
  • Comprehensive — cleaning, normalization, numeral conversion, number-to-words (tafkeet), dialect detection, keyword extraction, and fuzzy matching
  • Well-tested — 156 tests with 100% pass rate
  • Advanced algorithms from scratch — TF-IDF, Levenshtein distance, and weighted dialect classification built with zero external libraries

Installation

pip install sahlnlp

Quick Start

import sahlnlp

# Clean noisy Arabic text
sahlnlp.clean_all("مَرْحَباً بـكـــــم في <b>موقعنا</b> https://example.com")
# => "مرحبا بكم في موقعنا"

# Normalize for search indexing
sahlnlp.normalize_search("أحمد مُعَلِّمٌ في المدرسة")
# => "احمد معلم في المدرسه"

# Convert numbers to Arabic words
sahlnlp.tafkeet(150)
# => "مائة وخمسون"

# Convert Hindi digits to standard numerals
sahlnlp.indic_to_arabic("٣ أبريل ٢٠٢٥")
# => "3 أبريل 2025"

# Detect Arabic dialect (Gulf, Levantine, Egyptian, Maghrebi)
sahlnlp.detect_dialect("شلونك يا خوي")
# => {"Gulf": 1.0, "Levantine": 0.0, "Egyptian": 0.0, "Maghrebi": 0.0}

# Extract keywords using pure-Python TF-IDF
sahlnlp.extract_keywords("الذكاء الاصطناعي فرع مهم. الذكاء الاصطناعي متطور.", top_n=3)
# => [("الاصطناعي", 0.12), ("متطور", 0.10), ("فرع", 0.10)]

# Fuzzy match with Arabic keyboard-aware Levenshtein distance
sahlnlp.suggest_correction("مدرية", ["مدرسة", "مدينة", "مربية"])
# => "مدرسة"

API Reference

Text Cleaning (sahlnlp.cleaner)

remove_tashkeel(text)

Remove all Arabic diacritical marks (tashkeel).

sahlnlp.remove_tashkeel("كِتَاب")
# => "كتاب"

remove_tatweel(text)

Remove tatweel/kashida characters (ـ).

sahlnlp.remove_tatweel("الســــلام")
# => "السلام"

remove_html_and_links(text)

Remove HTML tags and URLs from text.

sahlnlp.remove_html_and_links("زوروا <b>http://example.com</b>")
# => "زوروا "

remove_repeated_chars(text, max_repeat=2)

Reduce character flooding to a maximum number of repetitions.

sahlnlp.remove_repeated_chars("مرحباًاااا")
# => "مرحباًاا"

clean_all(text, ...)

Master cleaning function. Applies all cleaning operations with toggle flags.

sahlnlp.clean_all(
    "مَرْحَباً",
    remove_tashkeel_flag=True,
    remove_tatweel_flag=True,
    remove_html_flag=True,
    remove_repeated_flag=True,
    max_repeat=2,
)
# => "مرحبا"

Text Normalization (sahlnlp.normalizer)

normalize_hamza(text)

Convert all Alef variations (أ, إ, آ) to bare Alef (ا).

sahlnlp.normalize_hamza("أحمد إبراهيم آدم")
# => "احمد ابراهيم ادم"

normalize_taa(text, to_haa=True)

Convert Taa Marbuta (ة) to Haa (ه), or vice versa.

sahlnlp.normalize_taa("مدرسة")          # => "مدرسه"
sahlnlp.normalize_taa("مدرسه", to_haa=False)  # => "مدرسة"

normalize_yaa(text)

Convert Alef Maksura (ى) to Yaa (ي).

sahlnlp.normalize_yaa("موسى")
# => "موسي"

normalize_search(text)

Aggressive normalization for search engine indexing. Combines all normalization steps.

sahlnlp.normalize_search("أحمد مُعَلِّمٌ في المدرسة")
# => "احمد معلم في المدرسه"

Number Conversion (sahlnlp.converter)

indic_to_arabic(text)

Convert Arabic-Indic digits (٠١٢٣...) to standard numerals (0123...).

sahlnlp.indic_to_arabic("٣ أبريل ٢٠٢٥")
# => "3 أبريل 2025"

arabic_to_indic(text)

Convert standard numerals (0123...) to Arabic-Indic digits (٠١٢٣...).

sahlnlp.arabic_to_indic("3 أبريل 2025")
# => "٣ أبريل ٢٠٢٥"

tafkeet(number)

Convert a number to its written Arabic word form. Supports integers and floats.

sahlnlp.tafkeet(0)        # => "صفر"
sahlnlp.tafkeet(11)       # => "أحد عشر"
sahlnlp.tafkeet(100)      # => "مائة"
sahlnlp.tafkeet(150)      # => "مائة وخمسون"
sahlnlp.tafkeet(1000)     # => "ألف"
sahlnlp.tafkeet(1001)     # => "ألف وواحد"
sahlnlp.tafkeet(1000000)  # => "مليون"

Advanced Analysis (sahlnlp.analyzer) — Built from scratch, zero dependencies

detect_dialect(text)

Detect the most likely Arabic dialect using weighted lexicon-based classification. Supports Gulf, Levantine, Egyptian, and Maghrebi dialects.

sahlnlp.detect_dialect("شلونك يا خوي")
# => {"Gulf": 1.0, "Levantine": 0.0, "Egyptian": 0.0, "Maghrebi": 0.0}

sahlnlp.detect_dialect("عاوز اروح ازاي")
# => {"Gulf": 0.0, "Levantine": 0.0, "Egyptian": 1.0, "Maghrebi": 0.0}

extract_keywords(text, top_n=5)

Extract top keywords using a pure-Python TF-IDF implementation. Splits text on punctuation for IDF calculation and filters Arabic stop-words.

sahlnlp.extract_keywords("الذكاء الاصطناعي فرع من علوم الحاسوب. الذكاء مهم.", top_n=3)
# => [("الحاسوب", ...), ("علوم", ...), ("الاصطناعي", ...)]

suggest_correction(word, dictionary, use_keyboard=True)

Find the closest matching word using Levenshtein distance with optional Arabic keyboard proximity penalties (adjacent keys get reduced substitution cost).

sahlnlp.suggest_correction("مدرية", ["مدرسة", "مدينة", "مربية"])
# => "مدرسة"

sahlnlp.suggest_correction("مكتية", ["مكتبة", "مكتب", "مكية"])
# => "مكتبة"

compute_tf(tokens) / compute_idf(documents)

Lower-level TF and IDF functions for custom pipelines.

from sahlnlp import compute_tf, compute_idf

tf = compute_tf(["كتاب", "كتاب", "قلم"])   # {"كتاب": 0.667, "قلم": 0.333}
idf = compute_idf([["كتاب", "قلم"], ["كتاب", "حبر"]])

Development

# Clone the repository
git clone https://github.com/your-username/SahlNLP.git
cd SahlNLP

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

Security & Privacy — PII Masking (sahlnlp.guardian)

mask_sensitive_info(text, mode="tag", mask_char="*")

Detect and redact Personally Identifiable Information from Arabic text. Supports Saudi phone numbers, national IDs, IBANs, emails, and Arabic names (using contextual title-based detection).

Tag mode — replaces PII with descriptive labels:

sahlnlp.mask_sensitive_info(
    "السيد أحمد رقمه 0551234567 وهويته 1234567890 وآيبان SA0380000000608010167519",
    mode="tag",
)
# => "[NAME] رقمه [PHONE] وهويته [ID] وآيبان [IBAN]"

Mask mode — replaces PII with * while preserving first/last characters:

sahlnp.mask_sensitive_info("اتصل على 0551234567", mode="mask")
# => "اتصل على 05*****567"

Detected entities:

Entity Pattern Example
Saudi Phone +9665..., 05..., 5... 0551234567
National ID 10 digits starting with 1 or 2 1234567890
Saudi IBAN SA + 22 digits SA0380000000608010167519
Email Standard RFC 5322 user@example.com
Arabic Names Title-prefix heuristic (السيد, الدكتور, etc.) + عبد/بن patterns السيد أحمد محمد

License

This project is licensed under the MIT License — see the LICENSE file for details.


SahlNLP - وثائق بالعربية

مكتبة بايثون خفيفة وسريعة لمعالجة النصوص العربية بدون أي مكتبات خارجية.

المميزات

  • صفر تبعيات خارجية — تستخدم فقط مكتبة بايثون القياسية
  • أداء عالي — أنماط regex مجمعة مسبقاً، وبصمة ذاكرة ضئيلة
  • كتابة الأنواع الكاملة — دعم ممتاز للمحررات والأكمل التلقائي
  • شامل — تنظيف، تطبيع، تحويل أرقام، تفقيط، كشف لهجة، استخراج كلمات مفتاحية، تطابق تقريبي، وحجب المعلومات الحساسة
  • خوارزميات متقدمة من الصفر — TF-IDF، مسافة ليفنشتاين، تصنيف اللهجات، وحجب PII مبنية بدون مكتبات خارجية
  • مختبر بالكامل — 194 اختبار بنسبة نجاح 100%

التثبيت

pip install sahlnlp

مثال سريع

import sahlnlp

# تنظيف النص
sahlnlp.clean_all("مَرْحَباً بـكـــــم")
# => "مرحبا بكم"

# تطبيع للبحث
sahlnlp.normalize_search("أحمد مُعَلِّمٌ في المدرسة")
# => "احمد معلم في المدرسه"

# تحويل الأرقام إلى كلمات
sahlnlp.tafkeet(150)
# => "مائة وخمسون"

# تحويل الأرقام الهندية
sahlnlp.indic_to_arabic("٣ أبريل ٢٠٢٥")
# => "3 أبريل 2025"

التطوير

pip install -e ".[dev]"
pytest tests/ -v

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sahlnlp-0.3.0.tar.gz (27.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sahlnlp-0.3.0-py3-none-any.whl (21.1 kB view details)

Uploaded Python 3

File details

Details for the file sahlnlp-0.3.0.tar.gz.

File metadata

  • Download URL: sahlnlp-0.3.0.tar.gz
  • Upload date:
  • Size: 27.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sahlnlp-0.3.0.tar.gz
Algorithm Hash digest
SHA256 ae2e4373cbe3d98a7786075724d579a0439bc178f7f0188999433caad93ccfc5
MD5 fcc7724c11798051e48d3eb3b9a0a3b2
BLAKE2b-256 6c4a7d561aa142a2a5ef1cd05016c9dc4fcfccd7150c1c6a5571316990e7e103

See more details on using hashes here.

Provenance

The following attestation bundles were made for sahlnlp-0.3.0.tar.gz:

Publisher: publish.yml on mralwaleed/SahlNLP

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sahlnlp-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: sahlnlp-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 21.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sahlnlp-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 59f4397d55e88ff4c566f1d6e3513f5877fa2808a290a2d0e619cf720e548ea3
MD5 36a8a2d761896666f8ff3928146be786
BLAKE2b-256 049449800f06ba5b1cb20d5e0d827ab8586ccfa906f9c2377d61da3e2aa3a083

See more details on using hashes here.

Provenance

The following attestation bundles were made for sahlnlp-0.3.0-py3-none-any.whl:

Publisher: publish.yml on mralwaleed/SahlNLP

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page