ragctl - Production-ready RAG toolkit with advanced OCR, semantic chunking, and intelligent document processing
Project description
ragctl
Production-ready document processing CLI for RAG applications
Process documents, extract text with advanced OCR, chunk intelligently, and prepare data for RAG systems - all from the command line with ragctl.
What is ragctl?
ragctl is a command-line tool for processing documents into chunks ready for Retrieval-Augmented Generation (RAG) systems. It handles the dirty work of document ingestion, OCR, and intelligent chunking so you can focus on building your RAG application.
Key capabilities:
- Universal document loading (PDF, DOCX, images, HTML, Markdown, etc.)
- Advanced OCR with automatic fallback (EasyOCR → PaddleOCR → pytesseract)
- Intelligent semantic chunking using LangChain
- Production-ready batch processing with auto-retry
- Multiple export formats (JSON, JSONL, CSV)
- Direct ingestion into Qdrant vector store
Features
Universal Document Processing
- Supported formats: PDF, DOCX, ODT, TXT, HTML, Markdown, Images (JPEG, PNG)
- Smart OCR cascade:
- EasyOCR (best quality, multi-language)
- PaddleOCR (fast, good for complex layouts)
- pytesseract (fallback, most tolerant)
- Quality detection: Automatically rejects unreadable documents
- Multi-language: French, English, German, Spanish, Italian, Portuguese, and more
Intelligent Chunking
- Semantic chunking: Context-aware text splitting using LangChain RecursiveCharacterTextSplitter
- Multiple strategies:
semantic- Smart splitting by meaning (default)sentence- Split by sentencestoken- Fixed token-based splitting
- Configurable: Token limits (50-2000), overlap (0-500), model selection
- Rich metadata: Source file, chunk index, token count, strategy, timestamps
Production-Ready Batch Processing
- Automatic retry: Up to 3 attempts with exponential backoff (1s, 2s, 4s...)
- Interactive error handling:
interactive- Prompt user on each error (default)auto-continue- Continue on errors (CI/CD mode)auto-stop- Stop on first error (validation mode)auto-skip- Skip failed files automatically
- Complete history: Every run saved to
~/.ragctl/history/ - Retry capability:
ragctl retryto rerun failed files only - Per-file output: One chunk file per document for better traceability
Flexible Export & Storage
- Export formats: JSON, JSONL (streaming), CSV (Excel-compatible)
- Vector store integration: Direct ingestion into Qdrant
- No database required: Pure file-based export for easy sharing
Configuration System
- Hierarchical config: CLI flags > Environment variables > YAML file > Defaults
- Example config:
config.example.ymlwith detailed documentation - Easy customization: Override any setting via command line
Quick Start
Installation
From PyPI (Recommended)
# Install from PyPI
pip install ragctl
# Verify installation
ragctl --version
From Source
# Clone repository
git clone git@github.com:datallmhub/ragctl.git
cd ragctl
# Install with pip
pip install -e .
# Verify installation
ragctl --version
Basic Usage
# Process a single document
ragctl chunk document.pdf --show
# Process with advanced OCR for scanned documents
ragctl chunk scanned.pdf --advanced-ocr -o chunks.json
# Batch process a folder
ragctl batch ./documents --output ./chunks/
# Preview files before processing (dry-run)
ragctl batch ./documents --dry-run
# Batch with auto-retry for CI/CD
ragctl batch ./documents --output ./chunks/ --auto-continue
# Quiet mode (errors only)
ragctl batch ./documents -q
# Verbose mode (debug info)
ragctl chunk document.pdf -v
Usage Examples
Single Document Processing
# Simple text file
ragctl chunk document.txt --show
# PDF with semantic chunking (default)
ragctl chunk report.pdf -o report_chunks.json
# Scanned image with OCR
ragctl chunk contract.jpeg --advanced-ocr --show
# Custom chunking parameters
ragctl chunk document.pdf \
--strategy semantic \
--max-tokens 500 \
--overlap 100 \
-o output.jsonl
Batch Processing
# Process all files in a directory
ragctl batch ./documents --output ./chunks/
# Preview files without processing (dry-run)
ragctl batch ./documents --dry-run
# Output:
# Would process 15 files:
# ├── report.pdf (2.3 MB)
# ├── contract.docx (156 KB)
# └── notes.txt (12 KB)
# Total: 15 files, 45.2 MB
# Process only PDFs recursively
ragctl batch ./documents \
--pattern "*.pdf" \
--recursive \
--output ./chunks/
# CI/CD mode - continue on errors (quiet mode)
ragctl batch ./documents \
--output ./chunks/ \
--auto-continue \
--quiet \
--save-history
# Verbose mode for debugging
ragctl batch ./documents -v --output ./chunks/
# Per-file output (default):
# chunks/
# ├── doc1_chunks.jsonl (25 chunks)
# ├── doc2_chunks.jsonl (42 chunks)
# └── doc3_chunks.jsonl (18 chunks)
# Single-file output (all chunks combined):
ragctl batch ./documents \
--output ./all_chunks.jsonl \
--single-file
Retry Failed Files
# Show last failed run
ragctl retry --show
# Retry all failed files from last run
ragctl retry
# Retry specific run by ID
ragctl retry run_20251028_133403
Vector Store Integration
# Ingest chunks into Qdrant
ragctl ingest chunks.jsonl \
--collection my-docs \
--url http://localhost:6333
# Get system info
ragctl info
Evaluate Chunking Quality
# Evaluate chunking strategy
ragctl eval document.pdf \
--strategies semantic sentence token \
--metrics coverage overlap coherence
# Compare strategies with visualization
ragctl eval document.pdf --compare --output eval_results.json
Documentation
| Document | Description |
|---|---|
| Getting Started | Installation and first steps |
| CLI Guide | Complete command reference |
| Security | Security features and best practices |
| Full Documentation | Complete documentation index |
Configuration
Create ~/.ragctl/config.yml or use CLI flags:
# OCR settings
ocr:
use_advanced_ocr: false
enable_fallback: true
# Chunking settings
chunking:
strategy: semantic
max_tokens: 400
overlap: 50
# Output settings
output:
format: jsonl
include_metadata: true
pretty_print: true
Configuration hierarchy: CLI flags > Environment variables > YAML config > Defaults
Testing
# Run all tests
make test
# Run CLI tests
make test-cli
# Quick validation
ragctl --version
ragctl chunk tests/data/sample.txt --show
Test Coverage: 496 tests, 41% coverage
Performance
Processing Speed
- Text documents: ~100-200 docs/minute
- PDFs with OCR: ~5-10 docs/minute (depends on page count)
- Batch processing: Parallel-ready with retry mechanism
Quality Metrics
- OCR accuracy: 95%+ with EasyOCR on clear scans
- Chunk quality: 90% readability threshold enforced
- Semantic coherence: LangChain's RecursiveCharacterTextSplitter optimized for context
CLI Commands
| Command | Description |
|---|---|
ragctl chunk |
Process a single document |
ragctl batch |
Batch process multiple files |
ragctl retry |
Retry failed files from history |
ragctl ingest |
Ingest chunks into Qdrant |
ragctl eval |
Evaluate chunking quality |
ragctl info |
System information |
Run ragctl COMMAND --help for detailed options.
Troubleshooting
Common Issues
NumPy incompatibility
# For OCR support, use NumPy 1.x
pip install "numpy<2.0"
Missing system dependencies
# Ubuntu/Debian
sudo apt-get install tesseract-ocr poppler-utils
# macOS
brew install tesseract poppler
"Document unreadable" errors
- Try lowering quality threshold:
--ocr-threshold 0.2 - Use advanced OCR:
--advanced-ocr - Check document is not corrupted
Import errors
# Reinstall dependencies
pip install -e .
More help: Getting Started Guide
Development
# Install dev dependencies
make install-dev
# Format code
make format
# Run linters
make lint
# Install pre-commit hooks
make pre-commit-install
# Run all CI checks
make ci-all
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
Support
- Documentation: docs/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Acknowledgments
Built with:
- LangChain - Text splitting and document loading
- EasyOCR - OCR engine
- PaddleOCR - Alternative OCR engine
- Unstructured - Document parsing
- Typer - CLI framework
- Rich - Terminal formatting
Version: 0.1.5 | Status: Beta | License: MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file ragctl-0.1.5.tar.gz.
File metadata
- Download URL: ragctl-0.1.5.tar.gz
- Upload date:
- Size: 174.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
396ab947b6161500636f06aa693cc6e553c61e85a7583d294a01854865c51918
|
|
| MD5 |
fc30cf9a286ff425ef87f01c3f16baef
|
|
| BLAKE2b-256 |
46d1beb122d402a15fa0616123b1f7fa8029411b049d6b5157437b99ace195b4
|