Skip to main content

RAG ingestion pipeline — Chef → Chunker → Refinery → Porter

Project description

openingestion

Pipeline d'ingestion RAG au-dessus de MinerU / Docling.

Fetcher → Chef → Chunker → Refinery → Porter

Installation

1. Cloner et installer en mode éditable

git clone <repo-url>
cd openingestion
pip install -e .

L'installation éditable (-e) est obligatoire pour que les imports from openingestion import … se résolvent correctement depuis les scripts et notebooks, car la racine du dépôt est le package Python.

1bis. Setup Windows / PowerShell

Le projet demande Python >= 3.10. Sur Windows, un setup simple ressemble à :

py -3.14 -m venv .venv
. .\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -e .

Pour un premier run CPU sans GPU, ajoutez Docling :

python -m pip install -e ".[docling]"

2. Extras optionnels

# Parser MinerU (GPU recommandé)
pip install -e ".[mineru]"

# Parser Docling (CPU, pas de GPU nécessaire)
pip install -e ".[docling]"

# Chunking sémantique (sentence-transformers + scipy)
pip install -e ".[semantic]"

# Chunking LLM-guidé SlumberChunker + OpenAIGenie
pip install -e ".[slumber]"

# Tokenizer OpenAI exact (cl100k_base, o200k_base…)
pip install -e ".[tiktoken]"

# Tokenizer HuggingFace rapide (Rust, BPE/WordPiece…)
pip install -e ".[hf-tokenizers]"

# AutoTokenizer HuggingFace (transformers complet)
pip install -e ".[transformers]"

# Tout à la fois
pip install -e ".[mineru,docling,semantic,slumber,tiktoken]"

Utilisation rapide

from openingestion import ingest

# Depuis un PDF brut (MinerU tourne en arrière-plan)
chunks = ingest("rapport.pdf")

# Depuis un répertoire de sortie MinerU existant (pas de re-parsing)
chunks = ingest("./output/rapport/auto/")

# Avec Docling (CPU, pas de GPU)
chunks = ingest("rapport.pdf", parser="docling", strategy="by_token")

# Format LangChain
docs = ingest("rapport.pdf", output_format="langchain")

Architecture

Étape Classe Rôle
Chef MinerUChef, DoclingChef Parse le document → ContentBlock[]
Chunker TokenChunker, SentenceChunker, SemanticChunker Groupe les blocs → RagChunk[]
Refinery RagRefinery, ContextualRagRefinery Enrichit les chunks (tokens, hash, images, contexte LLM)
Porter JSONPorter, to_langchain, to_llamaindex Exporte vers le format cible

Voir specv3.md pour les spécifications techniques détaillées.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openingestion-0.1.1.tar.gz (78.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openingestion-0.1.1-py3-none-any.whl (96.7 kB view details)

Uploaded Python 3

File details

Details for the file openingestion-0.1.1.tar.gz.

File metadata

  • Download URL: openingestion-0.1.1.tar.gz
  • Upload date:
  • Size: 78.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for openingestion-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bb4eac7a3df93bd36933412c39c39e73ee89ac23b905ffb45614daa0312338eb
MD5 2912170c724e5d1f17803f9d350c3272
BLAKE2b-256 2f64fa4778fedefca74363345d3d2c94ed0b1796e506fc7e32768abf675da3f9

See more details on using hashes here.

File details

Details for the file openingestion-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: openingestion-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 96.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for openingestion-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0a143393f8c0278bc3bd47b8933eb779fab717e0d86476530b53016b6d9411cf
MD5 ee675148aa86935f7a2c0232d164b4ee
BLAKE2b-256 c5e4c5de2e655fb008d3237874de417823345d87eaceec1c153320a377a6f13f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page