RAG ingestion pipeline — Chef → Chunker → Refinery → Porter
Project description
openingestion
Pipeline d'ingestion RAG au-dessus de MinerU / Docling.
Fetcher → Chef → Chunker → Refinery → Porter
Installation
1. Cloner et installer en mode éditable
git clone <repo-url>
cd openingestion
pip install -e .
L'installation éditable (
-e) est obligatoire pour que les importsfrom openingestion import …se résolvent correctement depuis les scripts et notebooks, car la racine du dépôt est le package Python.
1bis. Setup Windows / PowerShell
Le projet demande Python >= 3.10. Sur Windows, un setup simple ressemble à :
py -3.14 -m venv .venv
. .\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -e .
Pour un premier run CPU sans GPU, ajoutez Docling :
python -m pip install -e ".[docling]"
2. Extras optionnels
# Parser MinerU (GPU recommandé)
pip install -e ".[mineru]"
# Parser Docling (CPU, pas de GPU nécessaire)
pip install -e ".[docling]"
# Chunking sémantique (sentence-transformers + scipy)
pip install -e ".[semantic]"
# Chunking LLM-guidé SlumberChunker + OpenAIGenie
pip install -e ".[slumber]"
# Tokenizer OpenAI exact (cl100k_base, o200k_base…)
pip install -e ".[tiktoken]"
# Tokenizer HuggingFace rapide (Rust, BPE/WordPiece…)
pip install -e ".[hf-tokenizers]"
# AutoTokenizer HuggingFace (transformers complet)
pip install -e ".[transformers]"
# Tout à la fois
pip install -e ".[mineru,docling,semantic,slumber,tiktoken]"
Utilisation rapide
from openingestion import ingest
# Depuis un PDF brut (MinerU tourne en arrière-plan)
chunks = ingest("rapport.pdf")
# Depuis un répertoire de sortie MinerU existant (pas de re-parsing)
chunks = ingest("./output/rapport/auto/")
# Avec Docling (CPU, pas de GPU)
chunks = ingest("rapport.pdf", parser="docling", strategy="by_token")
# Format LangChain
docs = ingest("rapport.pdf", output_format="langchain")
Architecture
| Étape | Classe | Rôle |
|---|---|---|
| Chef | MinerUChef, DoclingChef |
Parse le document → ContentBlock[] |
| Chunker | TokenChunker, SentenceChunker, SemanticChunker… |
Groupe les blocs → RagChunk[] |
| Refinery | RagRefinery, ContextualRagRefinery |
Enrichit les chunks (tokens, hash, images, contexte LLM) |
| Porter | JSONPorter, to_langchain, to_llamaindex |
Exporte vers le format cible |
Voir specv3.md pour les spécifications techniques détaillées.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openingestion-0.1.1.tar.gz.
File metadata
- Download URL: openingestion-0.1.1.tar.gz
- Upload date:
- Size: 78.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb4eac7a3df93bd36933412c39c39e73ee89ac23b905ffb45614daa0312338eb
|
|
| MD5 |
2912170c724e5d1f17803f9d350c3272
|
|
| BLAKE2b-256 |
2f64fa4778fedefca74363345d3d2c94ed0b1796e506fc7e32768abf675da3f9
|
File details
Details for the file openingestion-0.1.1-py3-none-any.whl.
File metadata
- Download URL: openingestion-0.1.1-py3-none-any.whl
- Upload date:
- Size: 96.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a143393f8c0278bc3bd47b8933eb779fab717e0d86476530b53016b6d9411cf
|
|
| MD5 |
ee675148aa86935f7a2c0232d164b4ee
|
|
| BLAKE2b-256 |
c5e4c5de2e655fb008d3237874de417823345d87eaceec1c153320a377a6f13f
|