Skip to main content

RAG ingestion pipeline — Chef → Chunker → Refinery → Porter

Project description

openingestion

Modular RAG ingestion pipeline — from raw documents to retrieval-ready chunks.

Fetcher → Chef → Chunker → Refinery → Porter

Version: 0.1.2 · Python: 3.10 – 3.13 · License: MIT


Overview

openingestion orchestrates the full journey from raw documents to enriched, retrieval-ready chunks through five composable stages:

Stage Classes Input → Output
Fetcher LocalFileFetcher, WebFetcher, SharepointFetcher Source → FetchedDocument[]
Chef MinerUChef, DoclingChef File/dir → ContentBlock[]
Chunker TokenChunker, SentenceChunker, SemanticChunker, SlumberChunker, BlockChunker, PageChunker, SectionChunker, RecursiveChunker ContentBlock[]RagChunk[]
Refinery RagRefinery, ContextualRagRefinery, VisionRefinery RagChunk[] → enriched RagChunk[]
Porter JSONPorter, to_dicts, to_langchain, to_llamaindex RagChunk[] → target format

Installation

From PyPI

pip install openingestion

The base install (loguru + chonkie-core) is intentionally minimal — no heavy ML dependencies. Parsers, tokenizers, and advanced chunkers are opt-in extras.

From source (editable)

git clone https://github.com/Isopope/openIngestion.git
cd openIngestion
pip install -e .

Windows / PowerShell

openingestion requires Python 3.10 – 3.13 (3.14+ not yet supported due to MinerU).

py -3.13 -m venv .venv
. .\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -e .

Optional extras

Extra Installs Use case
mineru mineru[pipeline]==3.0.4 GPU-accelerated PDF parsing (full layout analysis)
docling docling CPU-only PDF parsing (IBM Docling, no GPU required)
semantic sentence-transformers, numpy, scipy SemanticChunker — embedding-based splitting
slumber openai, pydantic, tenacity, tqdm SlumberChunker + OpenAIGenie — LLM-guided chunking
tiktoken tiktoken Exact OpenAI tokenizer (cl100k_base, o200k_base …)
hf-tokenizers tokenizers Fast HuggingFace tokenizers (Rust, BPE/WordPiece)
transformers transformers HuggingFace AutoTokenizer (full model loading)
langchain langchain-core output_format="langchain"
llamaindex llama-index-core output_format="llamaindex"
web playwright WebFetcher — render websites to PDF/HTML
sharepoint msal, office365-rest-python-client SharepointFetcher — Microsoft 365 / SharePoint

Convenience bundles:

# CPU pipeline (Docling + semantic + tiktoken)
pip install -e ".[cpu]"

# GPU pipeline (MinerU + semantic + tiktoken)
pip install -e ".[mineru,gpu]"

# Everything
pip install -e ".[all]"

Individual extras:

pip install -e ".[mineru]"          # MinerU parser (GPU recommended)
pip install -e ".[docling]"         # Docling parser (CPU)
pip install -e ".[semantic]"        # SemanticChunker
pip install -e ".[slumber]"         # SlumberChunker + OpenAI
pip install -e ".[web]"             # WebFetcher (then: playwright install chromium)
pip install -e ".[sharepoint]"      # SharepointFetcher

Quick start

from openingestion import ingest, ingest_from_output, ingest_from_json

# Parse a raw PDF with MinerU (requires [mineru] extra)
chunks = ingest("rapport.pdf")

# Skip re-parsing — reuse an existing MinerU output directory
chunks = ingest_from_output("./output/rapport/auto/")

# Load directly from a content_list.json
chunks = ingest_from_json("./output/rapport/auto/rapport_content_list.json")

# Use Docling instead of MinerU (CPU, no GPU needed)
chunks = ingest("rapport.pdf", parser="docling", strategy="by_sentence")

# Full control
chunks = ingest(
    "rapport.pdf",
    parser="mineru",                # or "docling"
    strategy="by_token",            # by_block | by_token | by_sentence | by_semantic | by_slumber
    max_tokens=512,
    overlap_tokens=64,
    image_mode="path",              # path | base64 | skip | ignore
    infer_captions=True,
    output_format="chunks",         # chunks | dicts | langchain | llamaindex
)

# Export to LangChain Documents
docs = ingest("rapport.pdf", output_format="langchain")

# Export to LlamaIndex TextNodes
nodes = ingest("rapport.pdf", output_format="llamaindex")

Fetchers

from openingestion.fetcher import LocalFileFetcher, WebFetcher, SharepointFetcher

# Local filesystem
fetcher = LocalFileFetcher(ext=[".pdf"])
docs = fetcher(dir="./inputs/")

# Website → PDF (requires [web] extra + playwright install chromium)
fetcher = WebFetcher(output_dir="./downloads/", mode="pdf")
docs = fetcher.fetch(urls=["https://example.com"])

# SharePoint / Microsoft 365 (requires [sharepoint] extra)
fetcher = SharepointFetcher(
    client_id="...", client_secret="...", tenant_id="...",
    output_dir="./downloads/",
)
docs = fetcher.fetch(site_url="https://tenant.sharepoint.com/sites/MySite")

Refineries

from openingestion.refinery import RagRefinery, ContextualRagRefinery, VisionRefinery
from openingestion.genie import OpenAIGenie

# Standard enrichment: token count, content hash, image paths
refinery = RagRefinery(output_dir="./output/doc/auto/", image_mode="path")
chunks = refinery.enrich(chunks)

# Contextual RAG: LLM-generated doc summary + per-chunk context
genie = OpenAIGenie(model="gpt-4o-mini", api_key="sk-...")
ctx_refinery = ContextualRagRefinery(genie=genie, generate_doc_summary=True)
chunks = ctx_refinery.enrich(chunks)

# Vision: extract text from scanned tables / images via GPT-4o
vision_refinery = VisionRefinery(genie=genie, only_if_empty=True)
chunks = vision_refinery.enrich(chunks)

Export

from openingestion.porter import JSONPorter

# JSONL (one chunk per line)
JSONPorter(lines=True)(chunks, file="output.jsonl")

# Pretty JSON array
JSONPorter(lines=False, indent=2)(chunks, file="output.json")

Architecture

Each stage follows a uniform Abstract Base Class pattern:

  • Abstract method: process() / chunk() / enrich() / export()
  • Batch processing: process_batch() / chunk_batch() / enrich_batch() / export_batch()
  • Callable shortcut: instance(input) == instance.main_method(input)
  • Unified logging via loguru

The three core data models flow through the whole pipeline:

FetchedDocument  →  ContentBlock  →  RagChunk
   (Fetcher)          (Chef)       (Chunker + Refinery)

BlockKind (TEXT, TITLE, TABLE, IMAGE, LIST, EQUATION, DISCARDED) is preserved from Chef through to the final export.


License

MIT — see LICENSE.

Note on optional dependencies: the [mineru] extra installs MinerU which is licensed under AGPL-3.0. Its licence terms apply when that extra is installed.

See specv3.md for full technical specifications.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openingestion-0.1.3.tar.gz (82.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openingestion-0.1.3-py3-none-any.whl (98.8 kB view details)

Uploaded Python 3

File details

Details for the file openingestion-0.1.3.tar.gz.

File metadata

  • Download URL: openingestion-0.1.3.tar.gz
  • Upload date:
  • Size: 82.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for openingestion-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d99f5c0dc63f3ef332fb51ec9e6bc494237e81d4e3764d0fee0a766231e7cfcd
MD5 c6c1daabeb27b363d15dfa6e80907155
BLAKE2b-256 8ab4c42b065f53caf65a22aa4f3620d989db1dcfd48adcd01b3e68b81ba4af46

See more details on using hashes here.

File details

Details for the file openingestion-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: openingestion-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 98.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for openingestion-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 70ffd8e50ef5b09c0ce9a23fc8d5e59fdf74b3f043d43d8682adc9d257d32ff3
MD5 559c8c47d23a9be680b7a05611b0dc18
BLAKE2b-256 ae4067a28dec68dc9b42e97f9d948b0a29ebd3e5d18b14da02a17cd3c198316d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page