Skip to main content

Prepare Arabic (and mixed Arabic/English) documents for RAG and search: normalization, sentence-aware chunking, and a provider-agnostic vector index.

Project description

arabic-rag-kit

The missing first mile for Arabic RAG: normalize, chunk, and index Arabic (and mixed Arabic/English) documents — with a dependency-free core.

PyPI version Python versions License: MIT CI


Why this exists

Most RAG and search tooling is built and tested against English. Arabic brings problems those tools quietly get wrong:

  • Diacritics (tashkeel), tatweel, and letter variants (أ/إ/آ vs ا) fragment what should be the same token, tanking retrieval recall.
  • Invisible characters — zero-width joiners and bidirectional control marks — sneak into copied text and corrupt indexes and embeddings.
  • Arabic-Indic digits (٠١٢٣) and Arabic punctuation (؟ ؛ ،) are invisible to English-centric normalizers and sentence splitters, so chunks break in the wrong places.

arabic-rag-kit handles these correctly, with a zero-dependency core so you can drop it into any pipeline. Embeddings and file loaders are opt-in extras — the library never forces a vendor or an API key on you.

Use cases

Reach for arabic-rag-kit whenever Arabic text enters a search or LLM pipeline:

  • RAG over Arabic documents — clean, split, and chunk PDFs/Word/text before embedding, so retrieval actually finds the right passage. This is the core use case the library is named for.
  • Better search recall — normalize both the indexed text and the query so that مُحَمَّد, محمّد, and محمـــد all match محمد. Diacritics, tatweel, and alef variants stop fragmenting your index.
  • Deduplication / clustering — use an aggressive normalization profile as a canonical "match key" to detect near-duplicate Arabic strings.
  • Cleaning scraped / copy-pasted text — strip the invisible zero-width and bidirectional control characters that break tokenizers, embeddings, and diffs.
  • Data prep for fine-tuning or classification — consistent normalization and digit handling (٢٠٢٦2026) as a preprocessing step.
  • Sentence segmentation — split Arabic text on ؟ ؛ ، and the Arabic full stop (not just Latin punctuation) for summarization, translation, or display.
  • A lightweight in-memory vector search — prototype semantic search with any embedding function, no database or vendor lock-in.

If your text is English-only, you don't need this. If it's Arabic or mixed Arabic/English, these are exactly the sharp edges that quietly hurt quality.

Install

# Core: normalization + chunking. Zero third-party dependencies.
pip install arabic-rag-kit

# Add the numpy-backed vector index:
pip install "arabic-rag-kit[search]"

# Add the sentence-transformers embedder helper:
pip install "arabic-rag-kit[embeddings]"

# Add PDF/DOCX loaders:
pip install "arabic-rag-kit[docs]"

# Everything:
pip install "arabic-rag-kit[all]"

Requires Python 3.11+.

Quickstart

The 30-second RAG pipeline

The whole point of the library — turn a raw Arabic document into clean, retrievable chunks and search them — end to end:

from arabic_rag_kit import chunk_text, VectorIndex
from arabic_rag_kit.search import sentence_transformers_embedder

document = (
    "تأسست شركة جي بي إم في عام ١٩٩٠. "
    "تقدم الشركة حلولاً في مجال الذكاء الاصطناعي والحوسبة السحابية. "
    "كيف يمكن للعملاء البدء؟ عبر التواصل مع فريق المبيعات."
)

# 1) Normalize + split into overlapping, sentence-aware chunks in one step.
chunks = chunk_text(document, chunk_size=90, chunk_overlap=20, normalize=True)

# 2) Embed and index (bring any embedding function you like).
index = VectorIndex(sentence_transformers_embedder())
index.add(
    [c.text for c in chunks],
    metadatas=[{"chunk_id": c.index} for c in chunks],
)

# 3) Ask a question — retrieval finds the right passage.
for hit in index.search("كيف يبدأ العملاء؟", k=2):
    print(round(hit.score, 3), hit.metadata, hit.text)

The sections below break the same three steps down.

1. Normalize

from arabic_rag_kit import normalize

raw = "الْعَرَبِيَّةُ لُغَةٌ جَمِيلَة… كتـــاب رقم ١٢٣"
normalize(raw)
# -> "العربية لغة جميلة… كتاب رقم 123"

Why it matters — the same word, four ways, becomes one. Diacritics, tatweel, and alef variants all fold to a single canonical form, so search and dedup work:

normalize("مُحَمَّد")   # diacritics  -> "محمد"
normalize("محمّد")      # shadda      -> "محمد"
normalize("محمـــد")    # tatweel     -> "محمد"

# Perfect for building a "match key" — normalize the index AND the query:
normalize("مُحَمَّد") == normalize("محمد")   # -> True

Every step is toggleable. Meaning-changing folds (hamza, ta-marbuta, alef maqsura) are off by default so you don't distort text unless you ask:

normalize("مسؤول", normalize_hamza=True)          # -> "مسوول"   (ؤ → و)
normalize("جامعة", normalize_ta_marbuta=True)      # -> "جامعه"   (ة → ه)
normalize("مصطفى", normalize_alef_maqsura=True)    # -> "مصطفي"   (ى → ي)

Digits from both Arabic-Indic sets are converted, and invisible control characters are stripped:

normalize("سنة ٢٠٢٦ و ۱۹۹۹")     # -> "سنة 2026 و 1999"
normalize("  الذكاء   الاصطناعي\n\tمفيد  ")   # -> "الذكاء الاصطناعي مفيد"

Reuse one configured instance across a whole corpus. For a search index you often want the aggressive profile so more variants collapse together:

from arabic_rag_kit import Normalizer, NormalizerConfig

search_key = Normalizer(NormalizerConfig(
    normalize_hamza=True,
    normalize_ta_marbuta=True,
    normalize_alef_maqsura=True,
))
search_key("المُؤسَّسة على الطُّلّاب")   # -> "الموسسه علي الطلاب"

2. Split & chunk (sentence-aware)

split_sentences understands Arabic punctuation and does not break on decimals or abbreviations:

from arabic_rag_kit import split_sentences

split_sentences("الإصدار 3.14 متاح الآن. راجع e.g. الوثائق! هل لديك سؤال؟")
# -> ['الإصدار 3.14 متاح الآن.', 'راجع e.g. الوثائق!', 'هل لديك سؤال؟']

chunk_text packs sentences into overlapping chunks for embedding:

from arabic_rag_kit import chunk_text

text = (
    "الذكاء الاصطناعي يغير طريقة عملنا. "
    "أنظمة استرجاع المعلومات تعتمد على تقطيع جيد للنص. "
    "كيف نضمن جودة التقطيع؟ عبر احترام حدود الجمل العربية."
)

for c in chunk_text(text, chunk_size=80, chunk_overlap=20):
    print(f"[{c.index}] ({c.start_char}:{c.end_char}) {c.text}")
# [0] (0:34)   الذكاء الاصطناعي يغير طريقة عملنا.
# [1] (35:107) أنظمة استرجاع المعلومات تعتمد على تقطيع جيد للنص. كيف نضمن جودة التقطيع؟
# [2] (85:138) كيف نضمن جودة التقطيع؟ عبر احترام حدود الجمل العربية.
# note how chunk [2] starts at 85 — before [1] ends at 107 — that is the overlap.

Each Chunk has text, index, start_char, and end_char. Chunks never exceed chunk_size, prefer to break on Arabic/Latin sentence boundaries, and the offsets index straight back into the source (text[c.start_char:c.end_char] == c.text) — handy for highlighting the retrieved passage. Pass normalize=True to normalize and chunk in one call (offsets then refer to the normalized text).

3. Index & search (optional [search] extra)

VectorIndex never hardcodes an embedding provider — you hand it any embed_fn (text → vector). Option A: the built-in multilingual helper.

from arabic_rag_kit import VectorIndex
from arabic_rag_kit.search import sentence_transformers_embedder

index = VectorIndex(sentence_transformers_embedder())
index.add(
    ["القاهرة عاصمة مصر", "باريس عاصمة فرنسا"],
    metadatas=[{"country": "مصر"}, {"country": "فرنسا"}],
)

for hit in index.search("ما هي عاصمة مصر؟", k=1):
    print(hit.text, round(hit.score, 3), hit.metadata)
    # -> القاهرة عاصمة مصر  <cosine score>  {'country': 'مصر'}

Option B: bring your own model / API. Any callable that returns a vector works — OpenAI, Cohere, a local model, whatever. No vendor lock-in, no API key baked into the library:

from openai import OpenAI
client = OpenAI()

def openai_embed(text: str) -> list[float]:
    return client.embeddings.create(
        model="text-embedding-3-small", input=text
    ).data[0].embedding

index = VectorIndex(openai_embed)

Each result is a SearchResult with .text, .score (cosine similarity), .metadata, and .index. Attach metadata (source file, page, chunk id, URL) on add and read it back on every hit to cite or filter your answers.

4. Load documents (optional [docs] extra)

Loaders return plain text — feed it straight into the pipeline above:

from arabic_rag_kit.loaders import load_txt, load_pdf, load_docx
from arabic_rag_kit import chunk_text, VectorIndex
from arabic_rag_kit.search import sentence_transformers_embedder

raw = load_pdf("tender_ar.pdf")        # needs [docs];  or load_docx / load_txt
chunks = chunk_text(raw, chunk_size=1000, chunk_overlap=200, normalize=True)

index = VectorIndex(sentence_transformers_embedder())
index.add(
    [c.text for c in chunks],
    metadatas=[{"source": "tender_ar.pdf", "chunk_id": c.index} for c in chunks],
)

answer_context = index.search("ما هي شروط التأهيل؟", k=4)
load_txt("notes_ar.txt")     # stdlib, always available
load_pdf("report_ar.pdf")    # needs [docs]  (pypdf)
load_docx("memo_ar.docx")    # needs [docs]  (python-docx)

Command line

Installing the package also installs an arabic-rag-kit command — handy for quick jobs and shell pipelines, no Python file needed. Each subcommand reads from an argument, an --input file, or standard input:

# Normalize a string
arabic-rag-kit normalize "الْعَرَبِيَّةُ ١٢٣ كتـــاب"
# -> العربية 123 كتاب

# Normalize a whole file with the aggressive profile, write the result out
arabic-rag-kit normalize -i doc_ar.txt -o clean.txt --hamza --ta-marbuta --alef-maqsura

# Pipe text in from anything
cat report_ar.txt | arabic-rag-kit normalize > report_clean.txt

# Split into sentences (one per line)
arabic-rag-kit sentences "جملة أولى. جملة ثانية؟"

# Chunk a document and emit JSON with offsets (great for feeding a script)
arabic-rag-kit chunk -i doc_ar.txt --size 500 --overlap 100 --normalize --json

Run arabic-rag-kit --help (or arabic-rag-kit normalize --help) to see every flag. The normalization flags mirror the Python options: --hamza, --ta-marbuta, --alef-maqsura turn on the off-by-default folds, while --no-diacritics, --no-tatweel, --no-alef, --no-digits, --no-control, --no-whitespace turn off the on-by-default steps.

API overview

Symbol Import Extra What it does
normalize(text, **opts) arabic_rag_kit One-shot Arabic normalization
Normalizer / NormalizerConfig arabic_rag_kit Reusable, configured normalizer
split_sentences(text) arabic_rag_kit Arabic/Latin sentence splitting
chunk_text(text, chunk_size, chunk_overlap, normalize) arabic_rag_kit Sentence-aware chunking
Chunk arabic_rag_kit text, index, start_char, end_char
VectorIndex arabic_rag_kit [search] Cosine-similarity vector index
sentence_transformers_embedder(model_name) arabic_rag_kit.search [embeddings] Ready-made embed_fn
load_txt / load_pdf / load_docx arabic_rag_kit.loaders [docs]* File loaders (*txt is stdlib)
arabic-rag-kit (CLI) shell command normalize / sentences / chunk

Normalization options (defaults)

Option Default Effect
remove_diacritics True Strip tashkeel/harakat (U+064B–U+0652, U+0670)
remove_tatweel True Remove kashida elongation (U+0640)
normalize_alef True أ إ آ ٱا
normalize_hamza False ؤو, ئي
normalize_ta_marbuta False ةه
normalize_alef_maqsura False ىي
convert_digits True ٠–٩ and ۰–۹0–9
strip_control_chars True Remove zero-width & bidi controls
collapse_whitespace True Collapse runs of whitespace and trim

Configuration reference

Which normalization profile should I use?

  • Default profile (just call normalize(text)) — safe for display and general cleanup. It removes diacritics, tatweel, and control characters, folds alef variants, and converts digits, without changing letters that carry meaning (hamza, ta marbuta, alef maqsura stay as written).
  • Aggressive / "search key" profile — turn on normalize_hamza, normalize_ta_marbuta, and normalize_alef_maqsura too. This collapses more spelling variants into one form, which maximizes recall for search, matching, and deduplication. The trade-off is that the output is no longer a "correct" spelling — use it as an internal index/match key, not for display.
  • Golden rule: apply the same profile to your documents and your queries. A query normalized differently from the index will silently miss matches.
from arabic_rag_kit import Normalizer, NormalizerConfig

display   = Normalizer()                                   # safe default
search_key = Normalizer(NormalizerConfig(
    normalize_hamza=True, normalize_ta_marbuta=True, normalize_alef_maqsura=True,
))

Choosing chunk_size and chunk_overlap

  • chunk_size is measured in characters, not tokens. A rough guide for embedding models: ~4 characters per token, so chunk_size=1000 ≈ 250 tokens. Pick a size comfortably under your embedding model's limit.
  • chunk_overlap keeps context from spilling across a boundary. A common starting point is 10–20% of chunk_size (e.g. 1000 / 200). It must be smaller than chunk_size, or chunk_text raises ValueError.
  • Smaller chunks → more precise retrieval but more vectors to store; larger chunks → fewer vectors but coarser hits. Start at 1000 / 200 and tune.
  • Set normalize=True to normalize and chunk in one call; the returned offsets then index into the normalized text.

Development

pip install -e ".[dev]"
ruff check .
pytest

See CONTRIBUTING.md.

Built by GBM

Created and maintained by Hasan Odeh at Gulf Business Machines (GBM). Born out of real Arabic RAG work, and open-sourced because Arabic NLP deserves better tooling. Contributions welcome.

License

MIT © Gulf Business Machines (GBM)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabic_rag_kit-0.2.0.tar.gz (25.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arabic_rag_kit-0.2.0-py3-none-any.whl (21.8 kB view details)

Uploaded Python 3

File details

Details for the file arabic_rag_kit-0.2.0.tar.gz.

File metadata

  • Download URL: arabic_rag_kit-0.2.0.tar.gz
  • Upload date:
  • Size: 25.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arabic_rag_kit-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ea3a4cf16cf35e2a8077f044c335bab5536b0b8f84da0bf798bf96d65bc967e2
MD5 98242b05d9fcb3314eecf63da695870d
BLAKE2b-256 3d78ddd7a40214fba2b434617a75448185518186a98f7543144faacce57a5d9c

See more details on using hashes here.

Provenance

The following attestation bundles were made for arabic_rag_kit-0.2.0.tar.gz:

Publisher: publish.yml on GBMUAE/arabic-rag-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arabic_rag_kit-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: arabic_rag_kit-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 21.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arabic_rag_kit-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b74d246afad5a8ef450589dbe345abbf697770eaf43efbd45534694c366547ac
MD5 d9884e64f2612c75b6637cd3b2108a6e
BLAKE2b-256 4cd79a620ca560309edc19aeec2edf0ebeb866a2c87d23f386edd29922c702db

See more details on using hashes here.

Provenance

The following attestation bundles were made for arabic_rag_kit-0.2.0-py3-none-any.whl:

Publisher: publish.yml on GBMUAE/arabic-rag-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page