Intelligent RAG system with advanced OCR, semantic chunking, and production-ready batch processing CLI

These details have not been verified by PyPI

Project links

Project description

🚀 Atlas-RAG

Production-ready document processing CLI for RAG applications

Process documents, extract text with advanced OCR, chunk intelligently, and prepare data for RAG systems - all from the command line.

🎯 What is Atlas-RAG?

Atlas-RAG is a command-line tool for processing documents into chunks ready for Retrieval-Augmented Generation (RAG) systems. It handles the dirty work of document ingestion, OCR, and intelligent chunking so you can focus on building your RAG application.

Key capabilities:

📄 Universal document loading (PDF, DOCX, images, HTML, Markdown, etc.)
🔍 Advanced OCR with automatic fallback (EasyOCR → PaddleOCR → pytesseract)
✂️ Intelligent semantic chunking using LangChain
📦 Production-ready batch processing with auto-retry
💾 Multiple export formats (JSON, JSONL, CSV)
🗄️ Direct ingestion into Qdrant vector store

✨ Features

📄 Universal Document Processing

Supported formats: PDF, DOCX, ODT, TXT, HTML, Markdown, Images (JPEG, PNG)
Smart OCR cascade:
1. EasyOCR (best quality, multi-language)
2. PaddleOCR (fast, good for complex layouts)
3. pytesseract (fallback, most tolerant)
Quality detection: Automatically rejects unreadable documents
Multi-language: French, English, German, Spanish, Italian, Portuguese, and more

✂️ Intelligent Chunking

Semantic chunking: Context-aware text splitting using LangChain RecursiveCharacterTextSplitter
Multiple strategies:
- semantic - Smart splitting by meaning (default)
- sentence - Split by sentences
- token - Fixed token-based splitting
Configurable: Token limits (50-2000), overlap (0-500), model selection
Rich metadata: Source file, chunk index, token count, strategy, timestamps

🔄 Production-Ready Batch Processing

Automatic retry: Up to 3 attempts with exponential backoff (1s, 2s, 4s...)
Interactive error handling:
- interactive - Prompt user on each error (default)
- auto-continue - Continue on errors (CI/CD mode)
- auto-stop - Stop on first error (validation mode)
- auto-skip - Skip failed files automatically
Complete history: Every run saved to ~/.atlasrag/history/
Retry capability: atlas-rag retry to rerun failed files only
Per-file output: One chunk file per document for better traceability

💾 Flexible Export & Storage

Export formats: JSON, JSONL (streaming), CSV (Excel-compatible)
Vector store integration: Direct ingestion into Qdrant
No database required: Pure file-based export for easy sharing

⚙️ Configuration System

Hierarchical config: CLI flags > Environment variables > YAML file > Defaults
Example config: config.example.yml with detailed documentation
Easy customization: Override any setting via command line

🚀 Quick Start

Installation

# Clone repository
git clone git@github.com:horiz-data/atlas-rag.git
cd atlas-rag

# Install with pip
pip install -e .

# Verify installation
atlas-rag --version

Basic Usage

# Process a single document
atlas-rag chunk document.pdf --show

# Process with advanced OCR for scanned documents
atlas-rag chunk scanned.pdf --advanced-ocr -o chunks.json

# Batch process a folder
atlas-rag batch ./documents --output ./chunks/

# Batch with auto-retry for CI/CD
atlas-rag batch ./documents --output ./chunks/ --auto-continue

💡 Usage Examples

Single Document Processing

# Simple text file
atlas-rag chunk document.txt --show

# PDF with semantic chunking (default)
atlas-rag chunk report.pdf -o report_chunks.json

# Scanned image with OCR
atlas-rag chunk contract.jpeg --advanced-ocr --show

# Custom chunking parameters
atlas-rag chunk document.pdf \
  --strategy semantic \
  --max-tokens 500 \
  --overlap 100 \
  -o output.jsonl

Batch Processing

# Process all files in a directory
atlas-rag batch ./documents --output ./chunks/

# Process only PDFs recursively
atlas-rag batch ./documents \
  --pattern "*.pdf" \
  --recursive \
  --output ./chunks/

# CI/CD mode - continue on errors
atlas-rag batch ./documents \
  --output ./chunks/ \
  --auto-continue \
  --save-history

# Per-file output (default):
# chunks/
# ├── doc1_chunks.jsonl  (25 chunks)
# ├── doc2_chunks.jsonl  (42 chunks)
# └── doc3_chunks.jsonl  (18 chunks)

# Single-file output (all chunks combined):
atlas-rag batch ./documents \
  --output ./all_chunks.jsonl \
  --single-file

Retry Failed Files

# Show last failed run
atlas-rag retry --show

# Retry all failed files from last run
atlas-rag retry

# Retry specific run by ID
atlas-rag retry run_20251028_133403

Vector Store Integration

# Ingest chunks into Qdrant
atlas-rag ingest chunks.jsonl \
  --collection my-docs \
  --url http://localhost:6333

# Get system info
atlas-rag info

Evaluate Chunking Quality

# Evaluate chunking strategy
atlas-rag eval document.pdf \
  --strategies semantic sentence token \
  --metrics coverage overlap coherence

# Compare strategies with visualization
atlas-rag eval document.pdf --compare --output eval_results.json

📚 Documentation

Document	Description
Getting Started	Installation and first steps
CLI Guide	Complete command reference
Security	Security features and best practices
Full Documentation	Complete documentation index

⚙️ Configuration

Create ~/.atlasrag/config.yml or use CLI flags:

# OCR settings
ocr:
  use_advanced_ocr: false
  enable_fallback: true

# Chunking settings
chunking:
  strategy: semantic
  max_tokens: 400
  overlap: 50

# Output settings
output:
  format: jsonl
  include_metadata: true
  pretty_print: true

Configuration hierarchy: CLI flags > Environment variables > YAML config > Defaults

🧪 Testing

# Run all tests
make test

# Run CLI tests
make test-cli

# Quick validation
atlas-rag --version
atlas-rag chunk tests/data/sample.txt --show

Test Coverage: 129 tests, 96% coverage

📊 Performance

Processing Speed

Text documents: ~100-200 docs/minute
PDFs with OCR: ~5-10 docs/minute (depends on page count)
Batch processing: Parallel-ready with retry mechanism

Quality Metrics

OCR accuracy: 95%+ with EasyOCR on clear scans
Chunk quality: 90% readability threshold enforced
Semantic coherence: LangChain's RecursiveCharacterTextSplitter optimized for context

🛠️ CLI Commands

Command	Description
`atlas-rag chunk`	Process a single document
`atlas-rag batch`	Batch process multiple files
`atlas-rag retry`	Retry failed files from history
`atlas-rag ingest`	Ingest chunks into Qdrant
`atlas-rag eval`	Evaluate chunking quality
`atlas-rag info`	System information

Run atlas-rag COMMAND --help for detailed options.

🐛 Troubleshooting

Common Issues

NumPy incompatibility

# For OCR support, use NumPy 1.x
pip install "numpy<2.0"

Missing system dependencies

# Ubuntu/Debian
sudo apt-get install tesseract-ocr poppler-utils

# macOS
brew install tesseract poppler

"Document unreadable" errors

Try lowering quality threshold: --ocr-threshold 0.2
Use advanced OCR: --advanced-ocr
Check document is not corrupted

Import errors

# Reinstall dependencies
pip install -e .

More help: Getting Started Guide

🔧 Development

# Install dev dependencies
make install-dev

# Format code
make format

# Run linters
make lint

# Install pre-commit hooks
make pre-commit-install

# Run all CI checks
make ci-all

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

See CONTRIBUTING.md for detailed guidelines (coming soon).

📧 Support

Documentation: docs/
Issues: GitHub Issues
Discussions: GitHub Discussions

🙏 Acknowledgments

Built with:

LangChain - Text splitting and document loading
EasyOCR - OCR engine
PaddleOCR - Alternative OCR engine
Unstructured - Document parsing
Typer - CLI framework
Rich - Terminal formatting

Version: 0.1.0 | Status: Beta | License: MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.5

Jan 8, 2026

0.1.3

Oct 30, 2025

0.1.2

Oct 29, 2025

0.1.1

Oct 29, 2025

This version

0.1.0

Oct 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragctl-0.1.0.tar.gz (214.6 kB view details)

Uploaded Oct 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ragctl-0.1.0-py3-none-any.whl (266.9 kB view details)

Uploaded Oct 28, 2025 Python 3

File details

Details for the file ragctl-0.1.0.tar.gz.

File metadata

Download URL: ragctl-0.1.0.tar.gz
Upload date: Oct 28, 2025
Size: 214.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for ragctl-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1c0a675fd06f75f592da6991d74648c6a650506afc35d14f35393feb3f4b548a`
MD5	`0e933cba5ad001e8d7f5cd51028d59ad`
BLAKE2b-256	`be1a23fd06dc04ad10bec073951cbe0389d88beed894952deaef2e95f09c68db`

See more details on using hashes here.

File details

Details for the file ragctl-0.1.0-py3-none-any.whl.

File metadata

Download URL: ragctl-0.1.0-py3-none-any.whl
Upload date: Oct 28, 2025
Size: 266.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for ragctl-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9287a2e613aa443356516e6d1dbd97642ef484d78e2d45eaa8e9eb311f4d1309`
MD5	`9ac8346265b1e07bb35aabf9877a4832`
BLAKE2b-256	`c8b57434a1d982edf1a88f2e5b4887e61005a330a5ed3541ff378ecc8cec1da9`

See more details on using hashes here.

ragctl 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🚀 Atlas-RAG

🎯 What is Atlas-RAG?

✨ Features

📄 Universal Document Processing

✂️ Intelligent Chunking

🔄 Production-Ready Batch Processing

💾 Flexible Export & Storage

⚙️ Configuration System

🚀 Quick Start

Installation

Basic Usage

💡 Usage Examples

Single Document Processing

Batch Processing

Retry Failed Files

Vector Store Integration

Evaluate Chunking Quality

📚 Documentation

⚙️ Configuration

🧪 Testing

📊 Performance

Processing Speed

Quality Metrics

🛠️ CLI Commands

🐛 Troubleshooting

Common Issues

🔧 Development

📝 License

🤝 Contributing

📧 Support

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes