Skip to main content

RAG ingestion pipeline — Chef → Chunker → Refinery → Porter

Project description

openingestion

Modular RAG ingestion pipeline — from raw documents to retrieval-ready chunks.

Fetcher → Chef → Chunker → Refinery → Porter

Version: 0.1.4 · Python: 3.10 – 3.13 · License: MIT


Overview

openingestion orchestrates the full journey from raw documents to enriched, retrieval-ready chunks through five composable stages:

Stage Classes Input → Output
Fetcher LocalFileFetcher, WebFetcher, SharepointFetcher Source → FetchedDocument[]
Chef MinerUChef, DoclingChef File/dir → ContentBlock[]
Chunker TokenChunker, SentenceChunker, SemanticChunker, SlumberChunker, BlockChunker, PageChunker, SectionChunker, RecursiveChunker ContentBlock[]RagChunk[]
Refinery RagRefinery, ContextualRagRefinery, VisionRefinery RagChunk[] → enriched RagChunk[]
Porter JSONPorter, to_dicts, to_langchain, to_llamaindex RagChunk[] → target format

Installation

From PyPI

pip install openingestion

The base install (loguru + chonkie-core) is intentionally minimal — no heavy ML dependencies. Parsers, tokenizers, and advanced chunkers are opt-in extras.

From source (editable)

git clone https://github.com/Isopope/openIngestion.git
cd openIngestion
pip install -e .

Windows / PowerShell

openingestion requires Python 3.10 – 3.13 (3.14+ not yet supported due to MinerU).

py -3.13 -m venv .venv
. .\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -e .

Optional extras

Extra Installs Use case
mineru mineru[pipeline]==3.1.0 GPU-accelerated PDF, Office, and image parsing (full layout analysis)
docling docling CPU-only PDF parsing (IBM Docling, no GPU required)
semantic sentence-transformers, numpy, scipy SemanticChunker — embedding-based splitting
slumber openai, pydantic, tenacity, tqdm SlumberChunker + OpenAIGenie — LLM-guided chunking
tiktoken tiktoken Exact OpenAI tokenizer (cl100k_base, o200k_base …)
hf-tokenizers tokenizers Fast HuggingFace tokenizers (Rust, BPE/WordPiece)
transformers transformers HuggingFace AutoTokenizer (full model loading)
langchain langchain-core output_format="langchain"
llamaindex llama-index-core output_format="llamaindex"
web playwright WebFetcher — render websites to PDF/HTML
sharepoint msal, office365-rest-python-client SharepointFetcher — Microsoft 365 / SharePoint

Convenience bundles:

# CPU pipeline (Docling + semantic + tiktoken)
pip install -e ".[cpu]"

# GPU pipeline (MinerU + semantic + tiktoken)
pip install -e ".[mineru,gpu]"

# Everything
pip install -e ".[all]"

Individual extras:

pip install -e ".[mineru]"          # MinerU parser (GPU recommended)
pip install -e ".[docling]"         # Docling parser (CPU)
pip install -e ".[semantic]"        # SemanticChunker
pip install -e ".[slumber]"         # SlumberChunker + OpenAI
pip install -e ".[web]"             # WebFetcher (then: playwright install chromium)
pip install -e ".[sharepoint]"      # SharepointFetcher

Quick start

from openingestion import ingest, ingest_from_output, ingest_from_json

# Parse a raw PDF with MinerU (requires [mineru] extra)
chunks = ingest("rapport.pdf")

# Parse native Office documents with MinerU 3.1+
chunks = ingest("support.docx")
chunks = ingest("deck.pptx")
chunks = ingest("tableau.xlsx")

# Skip re-parsing — reuse an existing MinerU output directory
chunks = ingest_from_output("./output/rapport/auto/")

# Load directly from a content_list.json
chunks = ingest_from_json("./output/rapport/auto/rapport_content_list.json")

# Use Docling instead of MinerU (CPU, no GPU needed)
chunks = ingest("rapport.pdf", parser="docling", strategy="by_sentence")

# Full control
chunks = ingest(
    "rapport.pdf",
    parser="mineru",                # or "docling"
    strategy="by_token",            # by_block | by_token | by_sentence | by_semantic | by_slumber
    max_tokens=512,
    overlap_tokens=64,
    image_mode="path",              # path | base64 | skip | ignore
    infer_captions=True,
    output_format="chunks",         # chunks | dicts | langchain | llamaindex
)

# Export to LangChain Documents
docs = ingest("rapport.pdf", output_format="langchain")

# Export to LlamaIndex TextNodes
nodes = ingest("rapport.pdf", output_format="llamaindex")

Fetchers

from openingestion.fetcher import LocalFileFetcher, WebFetcher, SharepointFetcher

# Local filesystem
fetcher = LocalFileFetcher(ext=[".pdf"])
docs = fetcher(dir="./inputs/")

# Website → PDF (requires [web] extra + playwright install chromium)
fetcher = WebFetcher(output_dir="./downloads/", mode="pdf")
docs = fetcher.fetch(urls=["https://example.com"])

# SharePoint / Microsoft 365 (requires [sharepoint] extra)
fetcher = SharepointFetcher(
    client_id="...", client_secret="...", tenant_id="...",
    output_dir="./downloads/",
)
docs = fetcher.fetch(site_url="https://tenant.sharepoint.com/sites/MySite")

Refineries

from openingestion.refinery import RagRefinery, ContextualRagRefinery, VisionRefinery
from openingestion.genie import OpenAIGenie

# Standard enrichment: token count, content hash, image paths
refinery = RagRefinery(output_dir="./output/doc/auto/", image_mode="path")
chunks = refinery.enrich(chunks)

# Contextual RAG: LLM-generated doc summary + per-chunk context
genie = OpenAIGenie(model="gpt-4o-mini", api_key="sk-...")
ctx_refinery = ContextualRagRefinery(genie=genie, generate_doc_summary=True)
chunks = ctx_refinery.enrich(chunks)

# Vision: extract text from scanned tables / images via GPT-4o
vision_refinery = VisionRefinery(genie=genie, only_if_empty=True)
chunks = vision_refinery.enrich(chunks)

Export

from openingestion.porter import JSONPorter

# JSONL (one chunk per line)
JSONPorter(lines=True)(chunks, file="output.jsonl")

# Pretty JSON array
JSONPorter(lines=False, indent=2)(chunks, file="output.json")

Architecture

Each stage follows a uniform Abstract Base Class pattern:

  • Abstract method: process() / chunk() / enrich() / export()
  • Batch processing: process_batch() / chunk_batch() / enrich_batch() / export_batch()
  • Callable shortcut: instance(input) == instance.main_method(input)
  • Unified logging via loguru

The three core data models flow through the whole pipeline:

FetchedDocument  →  ContentBlock  →  RagChunk
   (Fetcher)          (Chef)       (Chunker + Refinery)

BlockKind (TEXT, TITLE, TABLE, IMAGE, LIST, EQUATION, DISCARDED) is preserved from Chef through to the final export.


License

MIT — see LICENSE.

Note on optional dependencies: the [mineru] extra installs MinerU which is licensed under AGPL-3.0. Its licence terms apply when that extra is installed.

See specv3.md for full technical specifications.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openingestion-0.1.4.tar.gz (82.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openingestion-0.1.4-py3-none-any.whl (99.1 kB view details)

Uploaded Python 3

File details

Details for the file openingestion-0.1.4.tar.gz.

File metadata

  • Download URL: openingestion-0.1.4.tar.gz
  • Upload date:
  • Size: 82.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for openingestion-0.1.4.tar.gz
Algorithm Hash digest
SHA256 4542c8bacccc6fc3399c1ca88f8a80aa81c12947aa0df681618f2a5a5287e655
MD5 89ef81c5b80b606e88787ee46b361d65
BLAKE2b-256 c0cb1a8b8f79c81bf0c2318d1232b48f857051f806b66d6d9c6eec0d148d21c7

See more details on using hashes here.

File details

Details for the file openingestion-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: openingestion-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 99.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for openingestion-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 0b6c9672db146fdf6c08cd20040200305defdbaabfb5fd47a9d9367b88f9f23d
MD5 4a727b489f9b092a50767fee3afff1b3
BLAKE2b-256 9ed4288dcbc751f3cb98b5cdc19a2e835cf55833fc24a1b0b75457bc5cd10012

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page