Skip to main content

A zero-dependency, ultra-fast Arabic NLP toolkit for text preprocessing, normalization, and advanced analysis.

Project description

SahlNLP

A zero-dependency, ultra-fast Arabic NLP toolkit for text preprocessing, normalization, and analysis.

Python 3.9+ License: MIT Tests

SahlNLP (سهل = easy in Arabic) is a lightweight Python library designed for Arabic text preprocessing, normalization, and advanced analysis. It targets AI engineers and web developers who need a fast, no-overhead solution for handling Arabic text.


Features

  • Zero external dependencies — only uses Python's built-in standard library
  • High performance — pre-compiled regex patterns, minimal memory footprint
  • Full type hints — excellent IDE support and autocompletion
  • Comprehensive — cleaning, normalization, numeral conversion, number-to-words (tafkeet), dialect detection, keyword extraction, and fuzzy matching
  • Well-tested — 156 tests with 100% pass rate
  • Advanced algorithms from scratch — TF-IDF, Levenshtein distance, and weighted dialect classification built with zero external libraries

Installation

pip install sahlnlp

Quick Start

import sahlnlp

# Clean noisy Arabic text
sahlnlp.clean_all("مَرْحَباً بـكـــــم في <b>موقعنا</b> https://example.com")
# => "مرحبا بكم في موقعنا"

# Normalize for search indexing
sahlnlp.normalize_search("أحمد مُعَلِّمٌ في المدرسة")
# => "احمد معلم في المدرسه"

# Convert numbers to Arabic words
sahlnlp.tafkeet(150)
# => "مائة وخمسون"

# Convert Hindi digits to standard numerals
sahlnlp.indic_to_arabic("٣ أبريل ٢٠٢٥")
# => "3 أبريل 2025"

# Detect Arabic dialect (Gulf, Levantine, Egyptian, Maghrebi)
sahlnlp.detect_dialect("شلونك يا خوي")
# => {"Gulf": 1.0, "Levantine": 0.0, "Egyptian": 0.0, "Maghrebi": 0.0}

# Extract keywords using pure-Python TF-IDF
sahlnlp.extract_keywords("الذكاء الاصطناعي فرع مهم. الذكاء الاصطناعي متطور.", top_n=3)
# => [("الاصطناعي", 0.12), ("متطور", 0.10), ("فرع", 0.10)]

# Fuzzy match with Arabic keyboard-aware Levenshtein distance
sahlnlp.suggest_correction("مدرية", ["مدرسة", "مدينة", "مربية"])
# => "مدرسة"

API Reference

Text Cleaning (sahlnlp.cleaner)

remove_tashkeel(text)

Remove all Arabic diacritical marks (tashkeel).

sahlnlp.remove_tashkeel("كِتَاب")
# => "كتاب"

remove_tatweel(text)

Remove tatweel/kashida characters (ـ).

sahlnlp.remove_tatweel("الســــلام")
# => "السلام"

remove_html_and_links(text)

Remove HTML tags and URLs from text.

sahlnlp.remove_html_and_links("زوروا <b>http://example.com</b>")
# => "زوروا "

remove_repeated_chars(text, max_repeat=2)

Reduce character flooding to a maximum number of repetitions.

sahlnlp.remove_repeated_chars("مرحباًاااا")
# => "مرحباًاا"

clean_all(text, ...)

Master cleaning function. Applies all cleaning operations with toggle flags.

sahlnlp.clean_all(
    "مَرْحَباً",
    remove_tashkeel_flag=True,
    remove_tatweel_flag=True,
    remove_html_flag=True,
    remove_repeated_flag=True,
    max_repeat=2,
)
# => "مرحبا"

Text Normalization (sahlnlp.normalizer)

normalize_hamza(text)

Convert all Alef variations (أ, إ, آ) to bare Alef (ا).

sahlnlp.normalize_hamza("أحمد إبراهيم آدم")
# => "احمد ابراهيم ادم"

normalize_taa(text, to_haa=True)

Convert Taa Marbuta (ة) to Haa (ه), or vice versa.

sahlnlp.normalize_taa("مدرسة")          # => "مدرسه"
sahlnlp.normalize_taa("مدرسه", to_haa=False)  # => "مدرسة"

normalize_yaa(text)

Convert Alef Maksura (ى) to Yaa (ي).

sahlnlp.normalize_yaa("موسى")
# => "موسي"

normalize_search(text)

Aggressive normalization for search engine indexing. Combines all normalization steps.

sahlnlp.normalize_search("أحمد مُعَلِّمٌ في المدرسة")
# => "احمد معلم في المدرسه"

Number Conversion (sahlnlp.converter)

indic_to_arabic(text)

Convert Arabic-Indic digits (٠١٢٣...) to standard numerals (0123...).

sahlnlp.indic_to_arabic("٣ أبريل ٢٠٢٥")
# => "3 أبريل 2025"

arabic_to_indic(text)

Convert standard numerals (0123...) to Arabic-Indic digits (٠١٢٣...).

sahlnlp.arabic_to_indic("3 أبريل 2025")
# => "٣ أبريل ٢٠٢٥"

tafkeet(number)

Convert a number to its written Arabic word form. Supports integers and floats.

sahlnlp.tafkeet(0)        # => "صفر"
sahlnlp.tafkeet(11)       # => "أحد عشر"
sahlnlp.tafkeet(100)      # => "مائة"
sahlnlp.tafkeet(150)      # => "مائة وخمسون"
sahlnlp.tafkeet(1000)     # => "ألف"
sahlnlp.tafkeet(1001)     # => "ألف وواحد"
sahlnlp.tafkeet(1000000)  # => "مليون"

Advanced Analysis (sahlnlp.analyzer) — Built from scratch, zero dependencies

detect_dialect(text)

Detect the most likely Arabic dialect using weighted lexicon-based classification. Supports Gulf, Levantine, Egyptian, and Maghrebi dialects.

sahlnlp.detect_dialect("شلونك يا خوي")
# => {"Gulf": 1.0, "Levantine": 0.0, "Egyptian": 0.0, "Maghrebi": 0.0}

sahlnlp.detect_dialect("عاوز اروح ازاي")
# => {"Gulf": 0.0, "Levantine": 0.0, "Egyptian": 1.0, "Maghrebi": 0.0}

extract_keywords(text, top_n=5)

Extract top keywords using a pure-Python TF-IDF implementation. Splits text on punctuation for IDF calculation and filters Arabic stop-words.

sahlnlp.extract_keywords("الذكاء الاصطناعي فرع من علوم الحاسوب. الذكاء مهم.", top_n=3)
# => [("الحاسوب", ...), ("علوم", ...), ("الاصطناعي", ...)]

suggest_correction(word, dictionary, use_keyboard=True)

Find the closest matching word using Levenshtein distance with optional Arabic keyboard proximity penalties (adjacent keys get reduced substitution cost).

sahlnlp.suggest_correction("مدرية", ["مدرسة", "مدينة", "مربية"])
# => "مدرسة"

sahlnlp.suggest_correction("مكتية", ["مكتبة", "مكتب", "مكية"])
# => "مكتبة"

compute_tf(tokens) / compute_idf(documents)

Lower-level TF and IDF functions for custom pipelines.

from sahlnlp import compute_tf, compute_idf

tf = compute_tf(["كتاب", "كتاب", "قلم"])   # {"كتاب": 0.667, "قلم": 0.333}
idf = compute_idf([["كتاب", "قلم"], ["كتاب", "حبر"]])

Development

# Clone the repository
git clone https://github.com/your-username/SahlNLP.git
cd SahlNLP

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

License

This project is licensed under the MIT License — see the LICENSE file for details.


SahlNLP - وثائق بالعربية

مكتبة بايثون خفيفة وسريعة لمعالجة النصوص العربية بدون أي مكتبات خارجية.

المميزات

  • صفر تبعيات خارجية — تستخدم فقط مكتبة بايثون القياسية
  • أداء عالي — أنماط regex مجمعة مسبقاً، وبصمة ذاكرة ضئيلة
  • كتابة الأنواع الكاملة — دعم ممتاز للمحررات والأكمل التلقائي
  • شامل — تنظيف، تطبيع، تحويل أرقام، تفقيط، كشف لهجة، استخراج كلمات مفتاحية، وتطابق تقريبي
  • خوارزميات متقدمة من الصفر — TF-IDF، مسافة ليفنشتاين، وتصنيف اللهجات مبنية بدون مكتبات خارجية
  • مختبر بالكامل — 156 اختبار بنسبة نجاح 100%

التثبيت

pip install sahlnlp

مثال سريع

import sahlnlp

# تنظيف النص
sahlnlp.clean_all("مَرْحَباً بـكـــــم")
# => "مرحبا بكم"

# تطبيع للبحث
sahlnlp.normalize_search("أحمد مُعَلِّمٌ في المدرسة")
# => "احمد معلم في المدرسه"

# تحويل الأرقام إلى كلمات
sahlnlp.tafkeet(150)
# => "مائة وخمسون"

# تحويل الأرقام الهندية
sahlnlp.indic_to_arabic("٣ أبريل ٢٠٢٥")
# => "3 أبريل 2025"

التطوير

pip install -e ".[dev]"
pytest tests/ -v

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sahlnlp-0.2.0.tar.gz (22.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sahlnlp-0.2.0-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file sahlnlp-0.2.0.tar.gz.

File metadata

  • Download URL: sahlnlp-0.2.0.tar.gz
  • Upload date:
  • Size: 22.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sahlnlp-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e6985c1c81d8109abe2a8bdfc19a2bf17955d9282c26e2a6a16f0981725eeccc
MD5 083e826d86348f79d78eca8ccb55aa5d
BLAKE2b-256 1b47829d7ae086dd7515cb7e5112f6d6ae50852ab39fa061ff3a3d6cf5679b18

See more details on using hashes here.

Provenance

The following attestation bundles were made for sahlnlp-0.2.0.tar.gz:

Publisher: publish.yml on mralwaleed/SahlNLP

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sahlnlp-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: sahlnlp-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 18.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sahlnlp-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e2b228706e62b7684bf37eb010b50c8e1de366a8eabdc81040c9945e2d79b662
MD5 04f575ca1ffe9a0c9658543be01c46f5
BLAKE2b-256 bb10ece6b2da6da7fef090fc177735d3144f2a76657395303c413d9192e6d9f6

See more details on using hashes here.

Provenance

The following attestation bundles were made for sahlnlp-0.2.0-py3-none-any.whl:

Publisher: publish.yml on mralwaleed/SahlNLP

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page