Skip to main content

High-quality document processing for RAG pipelines, supporting multiple formats and processing backends

Project description

ingest

High-quality document processing CLI for RAG pipelines. Process PDFs, Office documents, images, and more into markdown, JSON, HTML, or RAG-optimized chunks.

Features

Standalone CLI Command - Simple ingest command
4 Output Formats - markdown, json, html, chunks (RAG-optimized)
3 Converters - pdf, table, ocr specialized processing
LLM Enhancement - Optional AI boost (81% → 91% table accuracy)
Multi-Worker - Parallel batch processing
20+ Options - Full control over processing

Quick Start

Installation

Using uv (recommended - fastest!)

# Basic installation (fast, lightweight)
uv pip install ingest-cli

# With marker-pdf for high-quality processing
uv pip install ingest-cli[marker]

# With LLM support
uv pip install ingest-cli[llm]

# Full installation (everything)
uv pip install ingest-cli[full]

Using pip

# Basic installation
pip install ingest-cli

# With marker-pdf
pip install ingest-cli[marker]

# Full installation
pip install ingest-cli[full]

Install from source

git clone https://github.com/therealtimex/ingest.git
cd ingest

# Basic installation (lightweight, no marker-pdf)
uv pip install -e .

# With marker-pdf for high-quality processing
uv pip install -e ".[marker]"

# Full installation with all features
uv pip install -e ".[full]"

Basic Usage

# Process a document
ingest document.pdf

# Process for RAG
ingest ./documents --output-format chunks --batch-mode

# Extract tables with LLM
ingest report.pdf --converter-type table --use-llm

# View help
ingest --help

Common Use Cases

1. RAG System Preparation

ingest ./knowledge_base \
    --output-format chunks \
    --batch-mode \
    --workers 4

Output: Pre-chunked JSON optimized for embeddings and retrieval.

2. Table Extraction

ingest financial_reports/ \
    --converter-type table \
    --use-llm \
    --output-format json \
    --batch-mode

Output: High-accuracy table data in JSON format.

3. OCR Scanned Documents

ingest scanned_docs/ \
    --force-ocr \
    --output-format markdown \
    --batch-mode

Output: Clean markdown from scanned PDFs.

Output Formats

  • markdown: Clean markdown with proper formatting
  • json: Structured JSON with full metadata
  • html: Web-ready HTML with embedded images
  • chunks: RAG-optimized pre-chunked JSON for vector databases

Performance

Workers VRAM Throughput (H100)
1 5GB ~30 pages/sec
4 20GB ~120 pages/sec
8 40GB ~240 pages/sec

Requirements

  • Python 3.10+
  • Optional: GPU for faster processing (CPU mode available)

Environment Variables

# PyTorch device
export TORCH_DEVICE=cuda  # or cpu, mps

# LLM API keys (optional, for enhanced accuracy)
export GOOGLE_API_KEY="your-gemini-key"
export ANTHROPIC_API_KEY="your-claude-key"
export OPENAI_API_KEY="your-openai-key"

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support


Built with ❤️ by RealTimeX

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ingest_cli-1.0.2.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ingest_cli-1.0.2-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file ingest_cli-1.0.2.tar.gz.

File metadata

  • Download URL: ingest_cli-1.0.2.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for ingest_cli-1.0.2.tar.gz
Algorithm Hash digest
SHA256 cb7638de77d34764bdc01340ee10614b45328d70a3428a8793fd88e42db4f9fb
MD5 c4a8678d75743be36bbb07815d59891b
BLAKE2b-256 ca5f6725bfc4c87f910c8fec5448125a0d896ce6b5c87ec5def83215f92032a5

See more details on using hashes here.

File details

Details for the file ingest_cli-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: ingest_cli-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 15.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for ingest_cli-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5f3c80fef32f99e91009ff683044c6b141bd72058be691d639afb2f851be9de2
MD5 c858340ea66a5a28116eeb44fe339090
BLAKE2b-256 e15dd404a9b75f5876e81255cae434ab262f47ce56756ddfa5c7d4c958309561

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page