Prepare Arabic (and mixed Arabic/English) documents for RAG and search: normalization, sentence-aware chunking, and a provider-agnostic vector index.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

HasanOdeh84

These details have not been verified by PyPI

Project description

arabic-rag-kit

The missing first mile for Arabic RAG: normalize, chunk, and index Arabic (and mixed Arabic/English) documents — with a dependency-free core.

Why this exists

Most RAG and search tooling is built and tested against English. Arabic brings problems those tools quietly get wrong:

Diacritics (tashkeel), tatweel, and letter variants (أ/إ/آ vs ا) fragment what should be the same token, tanking retrieval recall.
Invisible characters — zero-width joiners and bidirectional control marks — sneak into copied text and corrupt indexes and embeddings.
Arabic-Indic digits (٠١٢٣) and Arabic punctuation (؟ ؛ ،) are invisible to English-centric normalizers and sentence splitters, so chunks break in the wrong places.

arabic-rag-kit handles these correctly, with a zero-dependency core so you can drop it into any pipeline. Embeddings and file loaders are opt-in extras — the library never forces a vendor or an API key on you.

Use cases

Reach for arabic-rag-kit whenever Arabic text enters a search or LLM pipeline:

RAG over Arabic documents — clean, split, and chunk PDFs/Word/text before embedding, so retrieval actually finds the right passage. This is the core use case the library is named for.
Better search recall — normalize both the indexed text and the query so that مُحَمَّد, محمّد, and محمـــد all match محمد. Diacritics, tatweel, and alef variants stop fragmenting your index.
Deduplication / clustering — use an aggressive normalization profile as a canonical "match key" to detect near-duplicate Arabic strings.
Cleaning scraped / copy-pasted text — strip the invisible zero-width and bidirectional control characters that break tokenizers, embeddings, and diffs.
Data prep for fine-tuning or classification — consistent normalization and digit handling (٢٠٢٦ → 2026) as a preprocessing step.
Sentence segmentation — split Arabic text on ؟ ؛ ، and the Arabic full stop (not just Latin punctuation) for summarization, translation, or display.
A lightweight in-memory vector search — prototype semantic search with any embedding function, no database or vendor lock-in.

If your text is English-only, you don't need this. If it's Arabic or mixed Arabic/English, these are exactly the sharp edges that quietly hurt quality.

Install

# Core: normalization + chunking. Zero third-party dependencies.
pip install arabic-rag-kit

# Add the numpy-backed vector index:
pip install "arabic-rag-kit[search]"

# Add the sentence-transformers embedder helper:
pip install "arabic-rag-kit[embeddings]"

# Add PDF/DOCX loaders:
pip install "arabic-rag-kit[docs]"

# Everything:
pip install "arabic-rag-kit[all]"

Requires Python 3.11+.

Quickstart

The 30-second RAG pipeline

The whole point of the library — turn a raw Arabic document into clean, retrievable chunks and search them — end to end:

from arabic_rag_kit import chunk_text, VectorIndex
from arabic_rag_kit.search import sentence_transformers_embedder

document = (
    "تأسست شركة جي بي إم في عام ١٩٩٠. "
    "تقدم الشركة حلولاً في مجال الذكاء الاصطناعي والحوسبة السحابية. "
    "كيف يمكن للعملاء البدء؟ عبر التواصل مع فريق المبيعات."
)

# 1) Normalize + split into overlapping, sentence-aware chunks in one step.
chunks = chunk_text(document, chunk_size=90, chunk_overlap=20, normalize=True)

# 2) Embed and index (bring any embedding function you like).
index = VectorIndex(sentence_transformers_embedder())
index.add(
    [c.text for c in chunks],
    metadatas=[{"chunk_id": c.index} for c in chunks],
)

# 3) Ask a question — retrieval finds the right passage.
for hit in index.search("كيف يبدأ العملاء؟", k=2):
    print(round(hit.score, 3), hit.metadata, hit.text)

The sections below break the same three steps down.

1. Normalize

from arabic_rag_kit import normalize

raw = "الْعَرَبِيَّةُ لُغَةٌ جَمِيلَة… كتـــاب رقم ١٢٣"
normalize(raw)
# -> "العربية لغة جميلة… كتاب رقم 123"

Why it matters — the same word, four ways, becomes one. Diacritics, tatweel, and alef variants all fold to a single canonical form, so search and dedup work:

normalize("مُحَمَّد")   # diacritics  -> "محمد"
normalize("محمّد")      # shadda      -> "محمد"
normalize("محمـــد")    # tatweel     -> "محمد"

# Perfect for building a "match key" — normalize the index AND the query:
normalize("مُحَمَّد") == normalize("محمد")   # -> True

Every step is toggleable. Meaning-changing folds (hamza, ta-marbuta, alef maqsura) are off by default so you don't distort text unless you ask:

normalize("مسؤول", normalize_hamza=True)          # -> "مسوول"   (ؤ → و)
normalize("جامعة", normalize_ta_marbuta=True)      # -> "جامعه"   (ة → ه)
normalize("مصطفى", normalize_alef_maqsura=True)    # -> "مصطفي"   (ى → ي)

Digits from both Arabic-Indic sets are converted, and invisible control characters are stripped:

normalize("سنة ٢٠٢٦ و ۱۹۹۹")     # -> "سنة 2026 و 1999"
normalize("  الذكاء   الاصطناعي\n\tمفيد  ")   # -> "الذكاء الاصطناعي مفيد"

Reuse one configured instance across a whole corpus. For a search index you often want the aggressive profile so more variants collapse together:

from arabic_rag_kit import Normalizer, NormalizerConfig

search_key = Normalizer(NormalizerConfig(
    normalize_hamza=True,
    normalize_ta_marbuta=True,
    normalize_alef_maqsura=True,
))
search_key("المُؤسَّسة على الطُّلّاب")   # -> "الموسسه علي الطلاب"

2. Split & chunk (sentence-aware)

split_sentences understands Arabic punctuation and does not break on decimals or abbreviations:

from arabic_rag_kit import split_sentences

split_sentences("الإصدار 3.14 متاح الآن. راجع e.g. الوثائق! هل لديك سؤال؟")
# -> ['الإصدار 3.14 متاح الآن.', 'راجع e.g. الوثائق!', 'هل لديك سؤال؟']

chunk_text packs sentences into overlapping chunks for embedding:

from arabic_rag_kit import chunk_text

text = (
    "الذكاء الاصطناعي يغير طريقة عملنا. "
    "أنظمة استرجاع المعلومات تعتمد على تقطيع جيد للنص. "
    "كيف نضمن جودة التقطيع؟ عبر احترام حدود الجمل العربية."
)

for c in chunk_text(text, chunk_size=80, chunk_overlap=20):
    print(f"[{c.index}] ({c.start_char}:{c.end_char}) {c.text}")
# [0] (0:34)   الذكاء الاصطناعي يغير طريقة عملنا.
# [1] (35:107) أنظمة استرجاع المعلومات تعتمد على تقطيع جيد للنص. كيف نضمن جودة التقطيع؟
# [2] (85:138) كيف نضمن جودة التقطيع؟ عبر احترام حدود الجمل العربية.
# note how chunk [2] starts at 85 — before [1] ends at 107 — that is the overlap.

Each Chunk has text, index, start_char, and end_char. Chunks never exceed chunk_size, prefer to break on Arabic/Latin sentence boundaries, and the offsets index straight back into the source (text[c.start_char:c.end_char] == c.text) — handy for highlighting the retrieved passage. Pass normalize=True to normalize and chunk in one call (offsets then refer to the normalized text).

3. Index & search (optional `[search]` extra)

VectorIndex never hardcodes an embedding provider — you hand it any embed_fn (text → vector). Option A: the built-in multilingual helper.

from arabic_rag_kit import VectorIndex
from arabic_rag_kit.search import sentence_transformers_embedder

index = VectorIndex(sentence_transformers_embedder())
index.add(
    ["القاهرة عاصمة مصر", "باريس عاصمة فرنسا"],
    metadatas=[{"country": "مصر"}, {"country": "فرنسا"}],
)

for hit in index.search("ما هي عاصمة مصر؟", k=1):
    print(hit.text, round(hit.score, 3), hit.metadata)
    # -> القاهرة عاصمة مصر  <cosine score>  {'country': 'مصر'}

Option B: bring your own model / API. Any callable that returns a vector works — OpenAI, Cohere, a local model, whatever. No vendor lock-in, no API key baked into the library:

from openai import OpenAI
client = OpenAI()

def openai_embed(text: str) -> list[float]:
    return client.embeddings.create(
        model="text-embedding-3-small", input=text
    ).data[0].embedding

index = VectorIndex(openai_embed)

Each result is a SearchResult with .text, .score (cosine similarity), .metadata, and .index. Attach metadata (source file, page, chunk id, URL) on add and read it back on every hit to cite or filter your answers.

4. Load documents (optional `[docs]` extra)

Loaders return plain text — feed it straight into the pipeline above:

from arabic_rag_kit.loaders import load_txt, load_pdf, load_docx
from arabic_rag_kit import chunk_text, VectorIndex
from arabic_rag_kit.search import sentence_transformers_embedder

raw = load_pdf("tender_ar.pdf")        # needs [docs];  or load_docx / load_txt
chunks = chunk_text(raw, chunk_size=1000, chunk_overlap=200, normalize=True)

index = VectorIndex(sentence_transformers_embedder())
index.add(
    [c.text for c in chunks],
    metadatas=[{"source": "tender_ar.pdf", "chunk_id": c.index} for c in chunks],
)

answer_context = index.search("ما هي شروط التأهيل؟", k=4)

load_txt("notes_ar.txt")     # stdlib, always available
load_pdf("report_ar.pdf")    # needs [docs]  (pypdf)
load_docx("memo_ar.docx")    # needs [docs]  (python-docx)

Command line

Installing the package also installs an arabic-rag-kit command — handy for quick jobs and shell pipelines, no Python file needed. Each subcommand reads from an argument, an --input file, or standard input:

# Normalize a string
arabic-rag-kit normalize "الْعَرَبِيَّةُ ١٢٣ كتـــاب"
# -> العربية 123 كتاب

# Normalize a whole file with the aggressive profile, write the result out
arabic-rag-kit normalize -i doc_ar.txt -o clean.txt --hamza --ta-marbuta --alef-maqsura

# Pipe text in from anything
cat report_ar.txt | arabic-rag-kit normalize > report_clean.txt

# Split into sentences (one per line)
arabic-rag-kit sentences "جملة أولى. جملة ثانية؟"

# Chunk a document and emit JSON with offsets (great for feeding a script)
arabic-rag-kit chunk -i doc_ar.txt --size 500 --overlap 100 --normalize --json

Run arabic-rag-kit --help (or arabic-rag-kit normalize --help) to see every flag. The normalization flags mirror the Python options: --hamza, --ta-marbuta, --alef-maqsura turn on the off-by-default folds, while --no-diacritics, --no-tatweel, --no-alef, --no-digits, --no-control, --no-whitespace turn off the on-by-default steps.

API overview

Symbol	Import	Extra	What it does
`normalize(text, **opts)`	`arabic_rag_kit`	—	One-shot Arabic normalization
`Normalizer` / `NormalizerConfig`	`arabic_rag_kit`	—	Reusable, configured normalizer
`split_sentences(text)`	`arabic_rag_kit`	—	Arabic/Latin sentence splitting
`chunk_text(text, chunk_size, chunk_overlap, normalize)`	`arabic_rag_kit`	—	Sentence-aware chunking
`Chunk`	`arabic_rag_kit`	—	`text, index, start_char, end_char`
`VectorIndex`	`arabic_rag_kit`	`[search]`	Cosine-similarity vector index
`sentence_transformers_embedder(model_name)`	`arabic_rag_kit.search`	`[embeddings]`	Ready-made `embed_fn`
`load_txt` / `load_pdf` / `load_docx`	`arabic_rag_kit.loaders`	`[docs]`*	File loaders (*txt is stdlib)
`arabic-rag-kit` (CLI)	shell command	—	`normalize` / `sentences` / `chunk`

Normalization options (defaults)

Option	Default	Effect
`remove_diacritics`	`True`	Strip tashkeel/harakat (U+064B–U+0652, U+0670)
`remove_tatweel`	`True`	Remove kashida elongation (U+0640)
`normalize_alef`	`True`	`أ إ آ ٱ` → `ا`
`normalize_hamza`	`False`	`ؤ` → `و`, `ئ` → `ي`
`normalize_ta_marbuta`	`False`	`ة` → `ه`
`normalize_alef_maqsura`	`False`	`ى` → `ي`
`convert_digits`	`True`	`٠–٩` and `۰–۹` → `0–9`
`strip_control_chars`	`True`	Remove zero-width & bidi controls
`collapse_whitespace`	`True`	Collapse runs of whitespace and trim

Configuration reference

Which normalization profile should I use?

Default profile (just call normalize(text)) — safe for display and general cleanup. It removes diacritics, tatweel, and control characters, folds alef variants, and converts digits, without changing letters that carry meaning (hamza, ta marbuta, alef maqsura stay as written).
Aggressive / "search key" profile — turn on normalize_hamza, normalize_ta_marbuta, and normalize_alef_maqsura too. This collapses more spelling variants into one form, which maximizes recall for search, matching, and deduplication. The trade-off is that the output is no longer a "correct" spelling — use it as an internal index/match key, not for display.
Golden rule: apply the same profile to your documents and your queries. A query normalized differently from the index will silently miss matches.

from arabic_rag_kit import Normalizer, NormalizerConfig

display   = Normalizer()                                   # safe default
search_key = Normalizer(NormalizerConfig(
    normalize_hamza=True, normalize_ta_marbuta=True, normalize_alef_maqsura=True,
))

Choosing `chunk_size` and `chunk_overlap`

chunk_size is measured in characters, not tokens. A rough guide for embedding models: ~4 characters per token, so chunk_size=1000 ≈ 250 tokens. Pick a size comfortably under your embedding model's limit.
chunk_overlap keeps context from spilling across a boundary. A common starting point is 10–20% of chunk_size (e.g. 1000 / 200). It must be smaller than chunk_size, or chunk_text raises ValueError.
Smaller chunks → more precise retrieval but more vectors to store; larger chunks → fewer vectors but coarser hits. Start at 1000 / 200 and tune.
Set normalize=True to normalize and chunk in one call; the returned offsets then index into the normalized text.

Development

pip install -e ".[dev]"
ruff check .
pytest

See CONTRIBUTING.md.

Built by GBM

Created and maintained by Hasan Odeh at Gulf Business Machines (GBM). Born out of real Arabic RAG work, and open-sourced because Arabic NLP deserves better tooling. Contributions welcome.

License

MIT © Gulf Business Machines (GBM)

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

HasanOdeh84

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Jul 2, 2026

0.1.0

Jul 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabic_rag_kit-0.2.0.tar.gz (25.2 kB view details)

Uploaded Jul 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arabic_rag_kit-0.2.0-py3-none-any.whl (21.8 kB view details)

Uploaded Jul 2, 2026 Python 3

File details

Details for the file arabic_rag_kit-0.2.0.tar.gz.

File metadata

Download URL: arabic_rag_kit-0.2.0.tar.gz
Upload date: Jul 2, 2026
Size: 25.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arabic_rag_kit-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ea3a4cf16cf35e2a8077f044c335bab5536b0b8f84da0bf798bf96d65bc967e2`
MD5	`98242b05d9fcb3314eecf63da695870d`
BLAKE2b-256	`3d78ddd7a40214fba2b434617a75448185518186a98f7543144faacce57a5d9c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for arabic_rag_kit-0.2.0.tar.gz:

Publisher: publish.yml on GBMUAE/arabic-rag-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: arabic_rag_kit-0.2.0.tar.gz
- Subject digest: ea3a4cf16cf35e2a8077f044c335bab5536b0b8f84da0bf798bf96d65bc967e2
- Sigstore transparency entry: 2048714863
- Sigstore integration time: Jul 2, 2026
Source repository:
- Permalink: GBMUAE/arabic-rag-kit@9fff748e39257f1b5db147a3f45d43ce32dff1ee
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/GBMUAE
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9fff748e39257f1b5db147a3f45d43ce32dff1ee
- Trigger Event: release

File details

Details for the file arabic_rag_kit-0.2.0-py3-none-any.whl.

File metadata

Download URL: arabic_rag_kit-0.2.0-py3-none-any.whl
Upload date: Jul 2, 2026
Size: 21.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arabic_rag_kit-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b74d246afad5a8ef450589dbe345abbf697770eaf43efbd45534694c366547ac`
MD5	`d9884e64f2612c75b6637cd3b2108a6e`
BLAKE2b-256	`4cd79a620ca560309edc19aeec2edf0ebeb866a2c87d23f386edd29922c702db`

See more details on using hashes here.

Provenance

The following attestation bundles were made for arabic_rag_kit-0.2.0-py3-none-any.whl:

Publisher: publish.yml on GBMUAE/arabic-rag-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: arabic_rag_kit-0.2.0-py3-none-any.whl
- Subject digest: b74d246afad5a8ef450589dbe345abbf697770eaf43efbd45534694c366547ac
- Sigstore transparency entry: 2048715066
- Sigstore integration time: Jul 2, 2026
Source repository:
- Permalink: GBMUAE/arabic-rag-kit@9fff748e39257f1b5db147a3f45d43ce32dff1ee
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/GBMUAE
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9fff748e39257f1b5db147a3f45d43ce32dff1ee
- Trigger Event: release

arabic-rag-kit 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

arabic-rag-kit

Why this exists

Use cases

Install

Quickstart

The 30-second RAG pipeline

1. Normalize

2. Split & chunk (sentence-aware)

3. Index & search (optional [search] extra)

4. Load documents (optional [docs] extra)

Command line

API overview

Normalization options (defaults)

Configuration reference

Which normalization profile should I use?

Choosing chunk_size and chunk_overlap

Development

Built by GBM

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

3. Index & search (optional `[search]` extra)

4. Load documents (optional `[docs]` extra)

Choosing `chunk_size` and `chunk_overlap`