High-quality document processing for RAG pipelines, supporting multiple formats and processing backends
Project description
ingest
High-quality document processing CLI for RAG pipelines. Process PDFs, Office documents, images, and more into markdown, JSON, HTML, or RAG-optimized chunks.
Features
✅ Standalone CLI Command - Simple ingest command
✅ 4 Output Formats - markdown, json, html, chunks (RAG-optimized)
✅ 3 Converters - pdf, table, ocr specialized processing
✅ LLM Enhancement - Optional AI boost (81% → 91% table accuracy)
✅ Multi-Worker - Parallel batch processing
✅ 20+ Options - Full control over processing
Quick Start
Installation
Using uv (recommended - fastest!)
# Basic installation (fast, lightweight)
uv pip install ingest-cli
# With marker-pdf for high-quality processing
uv pip install ingest-cli[marker]
# With LLM support
uv pip install ingest-cli[llm]
# Full installation (everything)
uv pip install ingest-cli[full]
Using pip
# Basic installation
pip install ingest-cli
# With marker-pdf
pip install ingest-cli[marker]
# Full installation
pip install ingest-cli[full]
Install from source
git clone https://github.com/therealtimex/ingest.git
cd ingest
# Basic installation (lightweight, no marker-pdf)
uv pip install -e .
# With marker-pdf for high-quality processing
uv pip install -e ".[marker]"
# Full installation with all features
uv pip install -e ".[full]"
Basic Usage
# Process a document
ingest document.pdf
# Process for RAG
ingest ./documents --output-format chunks --batch-mode
# Extract tables with LLM
ingest report.pdf --converter-type table --use-llm
# View help
ingest --help
Common Use Cases
1. RAG System Preparation
ingest ./knowledge_base \
--output-format chunks \
--batch-mode \
--workers 4
Output: Pre-chunked JSON optimized for embeddings and retrieval.
2. Table Extraction
ingest financial_reports/ \
--converter-type table \
--use-llm \
--output-format json \
--batch-mode
Output: High-accuracy table data in JSON format.
3. OCR Scanned Documents
ingest scanned_docs/ \
--force-ocr \
--output-format markdown \
--batch-mode
Output: Clean markdown from scanned PDFs.
Output Formats
- markdown: Clean markdown with proper formatting
- json: Structured JSON with full metadata
- html: Web-ready HTML with embedded images
- chunks: RAG-optimized pre-chunked JSON for vector databases
Performance
| Workers | VRAM | Throughput (H100) |
|---|---|---|
| 1 | 5GB | ~30 pages/sec |
| 4 | 20GB | ~120 pages/sec |
| 8 | 40GB | ~240 pages/sec |
Requirements
- Python 3.10+
- Optional: GPU for faster processing (CPU mode available)
Environment Variables
# PyTorch device
export TORCH_DEVICE=cuda # or cpu, mps
# LLM API keys (optional, for enhanced accuracy)
export GOOGLE_API_KEY="your-gemini-key"
export ANTHROPIC_API_KEY="your-claude-key"
export OPENAI_API_KEY="your-openai-key"
License
MIT License - see LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Support
- Issues: GitHub Issues
- Repository: github.com/therealtimex/ingest
Built with ❤️ by RealTimeX
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ingest_cli-1.0.2.tar.gz.
File metadata
- Download URL: ingest_cli-1.0.2.tar.gz
- Upload date:
- Size: 17.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb7638de77d34764bdc01340ee10614b45328d70a3428a8793fd88e42db4f9fb
|
|
| MD5 |
c4a8678d75743be36bbb07815d59891b
|
|
| BLAKE2b-256 |
ca5f6725bfc4c87f910c8fec5448125a0d896ce6b5c87ec5def83215f92032a5
|
File details
Details for the file ingest_cli-1.0.2-py3-none-any.whl.
File metadata
- Download URL: ingest_cli-1.0.2-py3-none-any.whl
- Upload date:
- Size: 15.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f3c80fef32f99e91009ff683044c6b141bd72058be691d639afb2f851be9de2
|
|
| MD5 |
c858340ea66a5a28116eeb44fe339090
|
|
| BLAKE2b-256 |
e15dd404a9b75f5876e81255cae434ab262f47ce56756ddfa5c7d4c958309561
|