Arabic-first Retrieval-Augmented Generation (RAG) toolkit with local-first fallbacks

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

azion-labs

These details have not been verified by PyPI

Project description

أدوات RAG العربية

Python 3.9+ License: MIT

نظرة عامة

مجموعة أدوات شاملة لبناء أنظمة Retrieval-Augmented Generation (RAG) متخصصة في معالجة النصوص العربية بشكل احترافي. تحل هذه الأداة المشاكل الرئيسية التي تواجه أنظمة RAG التقليدية عند التعامل مع اللغة العربية مثل التقطيع الذكي للنصوص، والبحث في الكلمات ذات التشكيل، ومعالجة الأحرف من اليمين لليسار (RTL).

المميزات الرئيسية

تقطيع ذكي للنصوص العربية: معالجة التصريفات والبادئات واللواحق العربية بكفاءة
تطبيع النصوص العربية: إزالة التشكيل، توحيد أشكال الألف، ومعالجة التطويل
بحث مطبّع وعرض أصلي: التطبيع يُستخدم للمطابقة فقط، بينما تُعرض الإجابات والمصادر بالرسم الأصلي للنص (الهمزات والتشكيل محفوظة)
نماذج تضمين عربية: دعم نماذج متخصصة مثل CAMeL وAraBART والنماذج متعددة اللغات
أنظمة متعددة الوكلاء: أدوات مدمجة لتنسيق أدوار البحث والتحقق والكتابة
مرونة في اختيار النماذج: دعم OpenAI و Anthropic والنماذج المحلية
تشغيل محلي افتراضي: استرجاع وإجابة محليان بدون الحاجة إلى مفاتيح API عند البداية
قواعد بيانات متعددة: دعم الذاكرة المحلية و FAISS و ChromaDB
تكوين من البيئة: ArabicRAGPipeline.from_env() يقرأ الإعدادات الموثقة في .env.example
أمثلة عملية: أمثلة حقيقية تطبق على وثائق سعودية ونظام معالجة متكامل

Arabic RAG Toolkit

Arabic-first building blocks for retrieval, chunking, normalization, and answer generation with a local-first default path that works before you wire in hosted AI services.

Overview

A comprehensive suite of tools for building Retrieval-Augmented Generation (RAG) systems specifically optimized for Arabic text processing. This toolkit solves critical challenges faced by traditional RAG systems when handling Arabic: intelligent text chunking, diacritic-aware search, and proper right-to-left (RTL) text handling.

Key Features

Arabic-Aware Text Chunking: Intelligently handles Arabic morphology, prefixes, and suffixes
Arabic Text Normalization: Removes diacritics, normalizes alef variants, and handles tatweel
Normalized Matching, Original Display: Normalization is used for matching only; answers and sources keep the original orthography (hamzas and diacritics preserved)
Arabic Embedding Models: Supports CAMeL, AraBART, and multilingual embedding models
Multi-Agent Utilities: Built-in research, validation, and writing agents
Model Flexibility: Support for OpenAI, Anthropic, and local LLMs
Local-First Defaults: Works without API keys on day one using local fallbacks
Multiple Vector Stores: In-memory, FAISS, and ChromaDB support
Environment-Based Config: ArabicRAGPipeline.from_env() reads the settings documented in .env.example
Typed Package: Ships a py.typed marker for type-checker support
Practical Examples: Real-world examples with Saudi regulatory documents

Why This Repo Exists

Most general-purpose RAG demos ignore Arabic normalization and chunking details.
New users should be able to run the project locally before configuring external APIs.
Hosted providers and external vector stores should be optional upgrades, not installation blockers.

البدء السريع | Quick Start

المتطلبات | Requirements

Python 3.9+
pip

التثبيت | Installation

git clone https://github.com/azizalzahrani/arabic-rag-toolkit.git
cd arabic-rag-toolkit
pip install .

النشر من PyPI | PyPI Install

After the first PyPI release:

pip install arabic-rag-toolkit

إعداد البيئة | Environment Setup

cp .env.example .env
# Optional: edit .env if you want to use OpenAI / Anthropic / Chroma / FAISS

التثبيت مع الإضافات | Optional Extras

# Development tools
pip install -e ".[dev]"

# Sentence-transformers embeddings
pip install ".[embeddings]"

# OpenAI + Chroma example stack
pip install ".[openai,chroma,embeddings]"

أمثلة الاستخدام | Usage Examples

مثال 1: نظام RAG بسيط | Simple RAG System

from arabic_rag.pipeline import ArabicRAGPipeline

# إعداد خط أنابيب RAG يعمل محلياً بدون مفاتيح API
pipeline = ArabicRAGPipeline(
    vector_store="memory",
    llm_provider="local"
)

# إضافة وثائق
documents = [
    "نظام الشركات السعودي ينص على أن رأس مال الشركة المساهمة لا يقل عن خمسة ملايين ريال سعودي",
    "يجب أن يكون لدى الشركة مجلس إدارة يتكون من ثلاثة أعضاء على الأقل",
    "للمساهمين الحق في حضور الجمعية العامة والتصويت على القرارات"
]
pipeline.add_documents(documents)

# البحث والاسترجاع
results = pipeline.retrieve("كم هو الحد الأدنى لرأس مال الشركة المساهمة؟")
answer = pipeline.generate_answer(results, "كم هو الحد الأدنى لرأس مال الشركة المساهمة؟")
print(f"الإجابة: {answer}")

مثال 2: نظام متعدد الوكلاء | Multi-Agent RAG

from arabic_rag.agents.multi_agent_crew import setup_crew
from arabic_rag.pipeline import ArabicRAGPipeline

# إعداد خط الأنابيب الأساسي
pipeline = ArabicRAGPipeline(
    vector_store="memory",
    llm_provider="local",
    verbose=True,
)

# إعداد فريق الوكلاء
crew = setup_crew(pipeline)

# تنفيذ مهمة البحث
task = "ابحث عن المتطلبات القانونية لتسجيل شركة جديدة في السعودية وقدم ملخصاً شاملاً"
result = crew.execute_task(task, top_k=3)
print(result["final_answer"])

مثال 3: معالجة النصوص العربية | Arabic Text Processing

from arabic_rag.preprocessor import ArabicTextPreprocessor
from arabic_rag.chunker import ArabicTextChunker

# تطبيع النص
preprocessor = ArabicTextPreprocessor()
text = "اَلسَّلامُ عَلَيْكُمْ وَرَحْمَةُ اللهِ وَبَرَكاتُهُ"
normalized = preprocessor.normalize(text)
print(f"النص المطبّع: {normalized}")  # السلام عليكم ورحمة الله وبركاته

# تقطيع ذكي
chunker = ArabicTextChunker()
document = "القانون التجاري السعودي يحدد الأطر القانونية لجميع العمليات التجارية. المادة الأولى تنص على حقوق التجار..."
chunks = chunker.chunk(document)
for i, chunk in enumerate(chunks):
    print(f"الجزء {i+1}: {chunk}")

بنية المشروع | Project Structure

arabic-rag-toolkit/
├── .github/
│   └── workflows/
│       ├── tests.yml           # GitHub Actions CI (Python 3.9 - 3.13)
│       └── release.yml         # PyPI trusted publishing on tags
├── README.md
├── CONTRIBUTING.md
├── RELEASING.md
├── LICENSE
├── MANIFEST.in
├── pyproject.toml
├── setup.py
├── requirements.txt
├── .gitignore
├── .env.example                # متغيرات يقرأها from_env()
├── arabic_rag/
│   ├── __init__.py
│   ├── py.typed                # علامة دعم فحص الأنواع
│   ├── chunker.py              # تقطيع النصوص العربية
│   ├── embeddings.py           # نماذج التضمين العربية
│   ├── retriever.py            # استرجاع الوثائق
│   ├── generator.py            # توليد الإجابات
│   ├── pipeline.py             # خط أنابيب RAG متكامل
│   ├── preprocessor.py         # تطبيع النصوص العربية
│   ├── agents/
│   │   ├── __init__.py
│   │   ├── research_agent.py   # وكيل البحث
│   │   ├── validator_agent.py  # وكيل التحقق
│   │   ├── writer_agent.py     # وكيل الكتابة
│   │   └── multi_agent_crew.py # تنسيق فريق الوكلاء
│   └── utils/
│       ├── __init__.py
│       └── arabic_utils.py     # أدوات عربية مساعدة
├── examples/
│   ├── basic_rag.py            # مثال RAG بسيط
│   ├── multi_agent_rag.py      # مثال متعدد الوكلاء
│   └── saudi_regulations.py    # معالجة الوثائق السعودية
├── tests/
│   ├── __init__.py
│   ├── test_api_compatibility.py
│   ├── test_chunker.py
│   ├── test_embeddings.py
│   ├── test_generator.py
│   ├── test_preprocessor.py
│   └── test_pipeline.py
└── docs/
    └── ARCHITECTURE_AR.md      # التوثيق المعماري

المعمارية | Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      User Query (Arabic)                        │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
                ┌────────────────────────────┐
                │   Arabic Preprocessor      │
                │  (Normalize, Remove Tash.) │
                └────────────────┬───────────┘
                                 │
                    ┌────────────┴────────────┐
                    ▼                         ▼
         ┌──────────────────┐      ┌──────────────────┐
         │  Arabic Chunker  │      │ Arabic Embeddings│
         │ (RTL-Aware)      │      │ (CAMeL/AraBART)  │
         └──────────┬───────┘      └────────┬─────────┘
                    │                       │
                    └───────────┬───────────┘
                                ▼
                      ┌──────────────────┐
                      │  Vector Store    │
                      │ (FAISS/ChromaDB) │
                      └────────┬─────────┘
                               │
                               ▼
                      ┌──────────────────┐
                      │   Retriever      │
                      └────────┬─────────┘
                               │
         ┌─────────────────────┼─────────────────────┐
         ▼                     ▼                     ▼
    ┌────────────┐      ┌────────────┐      ┌────────────┐
    │ Researcher │      │ Validator  │      │   Writer   │
    │   Agent    │      │   Agent    │      │   Agent    │
    └─────┬──────┘      └─────┬──────┘      └─────┬──────┘
          │                   │                    │
          └───────────────────┼────────────────────┘
                              ▼
                    ┌──────────────────┐
                    │   LLM Response   │
                    │ (OpenAI/Anthropic)
                    └────────┬─────────┘
                             │
                             ▼
                    ┌──────────────────┐
                    │  Final Answer    │
                    │    (Arabic)      │
                    └──────────────────┘

البيئة والإعدادات | Configuration

The variables in .env.example are read by ArabicRAGPipeline.from_env(). Export them in your shell (or load the file with python-dotenv) and build the pipeline in one line:

from arabic_rag import ArabicRAGPipeline

pipeline = ArabicRAGPipeline.from_env()

`.env.example`

# LLM APIs
OPENAI_API_KEY=your-openai-key-here
ANTHROPIC_API_KEY=your-anthropic-key-here

# Embedding Model (requires the [embeddings] extra)
EMBEDDING_MODEL=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

# Vector Store Choice
VECTOR_STORE=memory  # Options: memory, chroma, faiss

# LLM Provider
LLM_PROVIDER=local  # Options: local, openai, anthropic

# Model Names (optional; defaults: gpt-4o-mini / claude-sonnet-4-6)
# OPENAI_MODEL=gpt-4o-mini
# ANTHROPIC_MODEL=claude-sonnet-4-6

# Vector Store Path
VECTOR_STORE_PATH=./data/vector_store

# Chunk Settings
CHUNK_SIZE=300
CHUNK_OVERLAP=50

# Retrieval and Generation
TOP_K=5
TEMPERATURE=0.3
MAX_TOKENS=2000

المتطلبات | Requirements

Python 3.9+
numpy for the core local/offline path
sentence-transformers for real embedding models
chromadb or faiss-cpu for external vector stores
openai or anthropic only if you want hosted LLM generation

Package metadata and optional extras are defined in pyproject.toml.

المساهمة | Contributing

We welcome contributions! Please read our CONTRIBUTING.md guide for details on our code of conduct and the process for submitting pull requests.

Development Setup

# Clone the repository
git clone https://github.com/azizalzahrani/arabic-rag-toolkit.git
cd arabic-rag-toolkit

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install in development mode
pip install -r requirements.txt

# Run tests
pytest -v

Releases

Maintainer release steps are documented in RELEASING.md.

الترخيص | License

This project is licensed under the MIT License - see the LICENSE file for details.

التواصل والدعم | Support & Contact

Issues: GitHub Issues
Author: @azizalzahrani
Questions: Open a GitHub issue with a minimal reproduction and expected behavior

الشكر والاعتراف | Acknowledgments

CAMeL Lab for Arabic NLP research
Hugging Face for transformer models
All contributors and users

خارطة الطريق | Roadmap

Arabic-specific fine-tuned embedding models
Support for dialects (Egyptian, Levantine, Gulf)
Integration with more Arabic NLP libraries (Farasa, RichArabic)
Multilingual RAG support
Web UI for document management
Benchmark suite for Arabic RAG systems

Version: 0.1.1 Last Updated: 2026-06-13 Status: Active Development

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

azion-labs

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabic_rag_toolkit-0.1.1.tar.gz (55.3 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arabic_rag_toolkit-0.1.1-py3-none-any.whl (48.3 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file arabic_rag_toolkit-0.1.1.tar.gz.

File metadata

Download URL: arabic_rag_toolkit-0.1.1.tar.gz
Upload date: Jun 12, 2026
Size: 55.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arabic_rag_toolkit-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`0e1b40347093514a5ca8f672a3e318888c850e79a7dd6775a6199d9c8dadbfc8`
MD5	`6db2d245e606c3dd8d9740283b42e75a`
BLAKE2b-256	`0270feac38fcc13a92ae62664b16186d64b9e9920762011371f5d8ec96c5c3b6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for arabic_rag_toolkit-0.1.1.tar.gz:

Publisher: release.yml on azizalzahrani/arabic-rag-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: arabic_rag_toolkit-0.1.1.tar.gz
- Subject digest: 0e1b40347093514a5ca8f672a3e318888c850e79a7dd6775a6199d9c8dadbfc8
- Sigstore transparency entry: 1805963459
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: azizalzahrani/arabic-rag-toolkit@d32b25cebe147975a2634c1c89ab6cae21ff6a71
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/azizalzahrani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d32b25cebe147975a2634c1c89ab6cae21ff6a71
- Trigger Event: push

File details

Details for the file arabic_rag_toolkit-0.1.1-py3-none-any.whl.

File metadata

Download URL: arabic_rag_toolkit-0.1.1-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 48.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arabic_rag_toolkit-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`91cd93c267658c0ac043c51ae43b8f40b702c8cd43a2007c3f788d47b57c3ab0`
MD5	`a50eaf6412aac706c1d9bddf5610f0b1`
BLAKE2b-256	`5593425d8d5f54d0f0cbbb2ca0ad0241075bc3821433c425009f497af4ee9932`

See more details on using hashes here.

Provenance

The following attestation bundles were made for arabic_rag_toolkit-0.1.1-py3-none-any.whl:

Publisher: release.yml on azizalzahrani/arabic-rag-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: arabic_rag_toolkit-0.1.1-py3-none-any.whl
- Subject digest: 91cd93c267658c0ac043c51ae43b8f40b702c8cd43a2007c3f788d47b57c3ab0
- Sigstore transparency entry: 1805963507
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: azizalzahrani/arabic-rag-toolkit@d32b25cebe147975a2634c1c89ab6cae21ff6a71
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/azizalzahrani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d32b25cebe147975a2634c1c89ab6cae21ff6a71
- Trigger Event: push

arabic-rag-toolkit 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

أدوات RAG العربية

نظرة عامة

المميزات الرئيسية

Arabic RAG Toolkit

Overview

Key Features

Why This Repo Exists

البدء السريع | Quick Start

المتطلبات | Requirements

التثبيت | Installation

النشر من PyPI | PyPI Install

إعداد البيئة | Environment Setup

التثبيت مع الإضافات | Optional Extras

أمثلة الاستخدام | Usage Examples

مثال 1: نظام RAG بسيط | Simple RAG System

مثال 2: نظام متعدد الوكلاء | Multi-Agent RAG

مثال 3: معالجة النصوص العربية | Arabic Text Processing

بنية المشروع | Project Structure

المعمارية | Architecture

البيئة والإعدادات | Configuration

.env.example

المتطلبات | Requirements

المساهمة | Contributing

Development Setup

Releases

الترخيص | License

التواصل والدعم | Support & Contact

الشكر والاعتراف | Acknowledgments

خارطة الطريق | Roadmap

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`.env.example`