RAG ingestion pipeline — Chef → Chunker → Refinery → Porter

These details have not been verified by PyPI

Project links

Project description

openingestion

Modular RAG ingestion pipeline — from raw documents to retrieval-ready chunks.

Fetcher → Chef → Chunker → Refinery → Porter

Version: 0.1.4 · Python: 3.10 – 3.13 · License: MIT

Overview

openingestion orchestrates the full journey from raw documents to enriched, retrieval-ready chunks through five composable stages:

Stage	Classes	Input → Output
Fetcher	`LocalFileFetcher`, `WebFetcher`, `SharepointFetcher`	Source → `FetchedDocument[]`
Chef	`MinerUChef`, `DoclingChef`	File/dir → `ContentBlock[]`
Chunker	`TokenChunker`, `SentenceChunker`, `SemanticChunker`, `SlumberChunker`, `BlockChunker`, `PageChunker`, `SectionChunker`, `RecursiveChunker`	`ContentBlock[]` → `RagChunk[]`
Refinery	`RagRefinery`, `ContextualRagRefinery`, `VisionRefinery`	`RagChunk[]` → enriched `RagChunk[]`
Porter	`JSONPorter`, `to_dicts`, `to_langchain`, `to_llamaindex`	`RagChunk[]` → target format

Installation

From PyPI

pip install openingestion

The base install (loguru + chonkie-core) is intentionally minimal — no heavy ML dependencies. Parsers, tokenizers, and advanced chunkers are opt-in extras.

From source (editable)

git clone https://github.com/Isopope/openIngestion.git
cd openIngestion
pip install -e .

Windows / PowerShell

openingestion requires Python 3.10 – 3.13 (3.14+ not yet supported due to MinerU).

py -3.13 -m venv .venv
. .\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -e .

Optional extras

Extra	Installs	Use case
`mineru`	`mineru[pipeline]==3.1.0`	GPU-accelerated PDF, Office, and image parsing (full layout analysis)
`docling`	`docling`	CPU-only PDF parsing (IBM Docling, no GPU required)
`semantic`	`sentence-transformers`, `numpy`, `scipy`	`SemanticChunker` — embedding-based splitting
`slumber`	`openai`, `pydantic`, `tenacity`, `tqdm`	`SlumberChunker` + `OpenAIGenie` — LLM-guided chunking
`tiktoken`	`tiktoken`	Exact OpenAI tokenizer (`cl100k_base`, `o200k_base` …)
`hf-tokenizers`	`tokenizers`	Fast HuggingFace tokenizers (Rust, BPE/WordPiece)
`transformers`	`transformers`	HuggingFace `AutoTokenizer` (full model loading)
`langchain`	`langchain-core`	`output_format="langchain"`
`llamaindex`	`llama-index-core`	`output_format="llamaindex"`
`web`	`playwright`	`WebFetcher` — render websites to PDF/HTML
`sharepoint`	`msal`, `office365-rest-python-client`	`SharepointFetcher` — Microsoft 365 / SharePoint

Convenience bundles:

# CPU pipeline (Docling + semantic + tiktoken)
pip install -e ".[cpu]"

# GPU pipeline (MinerU + semantic + tiktoken)
pip install -e ".[mineru,gpu]"

# Everything
pip install -e ".[all]"

Individual extras:

pip install -e ".[mineru]"          # MinerU parser (GPU recommended)
pip install -e ".[docling]"         # Docling parser (CPU)
pip install -e ".[semantic]"        # SemanticChunker
pip install -e ".[slumber]"         # SlumberChunker + OpenAI
pip install -e ".[web]"             # WebFetcher (then: playwright install chromium)
pip install -e ".[sharepoint]"      # SharepointFetcher

Quick start

from openingestion import ingest, ingest_from_output, ingest_from_json

# Parse a raw PDF with MinerU (requires [mineru] extra)
chunks = ingest("rapport.pdf")

# Parse native Office documents with MinerU 3.1+
chunks = ingest("support.docx")
chunks = ingest("deck.pptx")
chunks = ingest("tableau.xlsx")

# Skip re-parsing — reuse an existing MinerU output directory
chunks = ingest_from_output("./output/rapport/auto/")

# Load directly from a content_list.json
chunks = ingest_from_json("./output/rapport/auto/rapport_content_list.json")

# Use Docling instead of MinerU (CPU, no GPU needed)
chunks = ingest("rapport.pdf", parser="docling", strategy="by_sentence")

# Full control
chunks = ingest(
    "rapport.pdf",
    parser="mineru",                # or "docling"
    strategy="by_token",            # by_block | by_token | by_sentence | by_semantic | by_slumber
    max_tokens=512,
    overlap_tokens=64,
    image_mode="path",              # path | base64 | skip | ignore
    infer_captions=True,
    output_format="chunks",         # chunks | dicts | langchain | llamaindex
)

# Export to LangChain Documents
docs = ingest("rapport.pdf", output_format="langchain")

# Export to LlamaIndex TextNodes
nodes = ingest("rapport.pdf", output_format="llamaindex")

Fetchers

from openingestion.fetcher import LocalFileFetcher, WebFetcher, SharepointFetcher

# Local filesystem
fetcher = LocalFileFetcher(ext=[".pdf"])
docs = fetcher(dir="./inputs/")

# Website → PDF (requires [web] extra + playwright install chromium)
fetcher = WebFetcher(output_dir="./downloads/", mode="pdf")
docs = fetcher.fetch(urls=["https://example.com"])

# SharePoint / Microsoft 365 (requires [sharepoint] extra)
fetcher = SharepointFetcher(
    client_id="...", client_secret="...", tenant_id="...",
    output_dir="./downloads/",
)
docs = fetcher.fetch(site_url="https://tenant.sharepoint.com/sites/MySite")

Refineries

from openingestion.refinery import RagRefinery, ContextualRagRefinery, VisionRefinery
from openingestion.genie import OpenAIGenie

# Standard enrichment: token count, content hash, image paths
refinery = RagRefinery(output_dir="./output/doc/auto/", image_mode="path")
chunks = refinery.enrich(chunks)

# Contextual RAG: LLM-generated doc summary + per-chunk context
genie = OpenAIGenie(model="gpt-4o-mini", api_key="sk-...")
ctx_refinery = ContextualRagRefinery(genie=genie, generate_doc_summary=True)
chunks = ctx_refinery.enrich(chunks)

# Vision: extract text from scanned tables / images via GPT-4o
vision_refinery = VisionRefinery(genie=genie, only_if_empty=True)
chunks = vision_refinery.enrich(chunks)

Export

from openingestion.porter import JSONPorter

# JSONL (one chunk per line)
JSONPorter(lines=True)(chunks, file="output.jsonl")

# Pretty JSON array
JSONPorter(lines=False, indent=2)(chunks, file="output.json")

Architecture

Each stage follows a uniform Abstract Base Class pattern:

Abstract method: process() / chunk() / enrich() / export()
Batch processing: process_batch() / chunk_batch() / enrich_batch() / export_batch()
Callable shortcut: instance(input) == instance.main_method(input)
Unified logging via loguru

The three core data models flow through the whole pipeline:

FetchedDocument  →  ContentBlock  →  RagChunk
   (Fetcher)          (Chef)       (Chunker + Refinery)

BlockKind (TEXT, TITLE, TABLE, IMAGE, LIST, EQUATION, DISCARDED) is preserved from Chef through to the final export.

License

MIT — see LICENSE.

Note on optional dependencies: the [mineru] extra installs MinerU which is licensed under AGPL-3.0. Its licence terms apply when that extra is installed.

See specv3.md for full technical specifications.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.4

Apr 30, 2026

0.1.3

Apr 24, 2026

0.1.2

Apr 24, 2026

0.1.1

Apr 24, 2026

0.1.0

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openingestion-0.1.4.tar.gz (82.5 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

openingestion-0.1.4-py3-none-any.whl (99.1 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file openingestion-0.1.4.tar.gz.

File metadata

Download URL: openingestion-0.1.4.tar.gz
Upload date: Apr 30, 2026
Size: 82.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for openingestion-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`4542c8bacccc6fc3399c1ca88f8a80aa81c12947aa0df681618f2a5a5287e655`
MD5	`89ef81c5b80b606e88787ee46b361d65`
BLAKE2b-256	`c0cb1a8b8f79c81bf0c2318d1232b48f857051f806b66d6d9c6eec0d148d21c7`

See more details on using hashes here.

File details

Details for the file openingestion-0.1.4-py3-none-any.whl.

File metadata

Download URL: openingestion-0.1.4-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 99.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for openingestion-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0b6c9672db146fdf6c08cd20040200305defdbaabfb5fd47a9d9367b88f9f23d`
MD5	`4a727b489f9b092a50767fee3afff1b3`
BLAKE2b-256	`9ed4288dcbc751f3cb98b5cdc19a2e835cf55833fc24a1b0b75457bc5cd10012`

See more details on using hashes here.

openingestion 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

openingestion

Overview

Installation

From PyPI

From source (editable)

Windows / PowerShell

Optional extras

Quick start

Fetchers

Refineries

Export

Architecture

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes