Skip to main content

RAG ingestion pipeline — Chef → Chunker → Refinery → Porter

Project description

openingestion

Pipeline d'ingestion RAG au-dessus de MinerU / Docling.

Fetcher → Chef → Chunker → Refinery → Porter

Installation

1. Cloner et installer en mode éditable

git clone <repo-url>
cd openingestion
pip install -e .

L'installation éditable (-e) est obligatoire pour que les imports from openingestion import … se résolvent correctement depuis les scripts et notebooks, car la racine du dépôt est le package Python.

1bis. Setup Windows / PowerShell

Le projet demande Python >= 3.10. Sur Windows, un setup simple ressemble à :

py -3.14 -m venv .venv
. .\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -e .

Pour un premier run CPU sans GPU, ajoutez Docling :

python -m pip install -e ".[docling]"

2. Extras optionnels

# Parser MinerU (GPU recommandé)
pip install -e ".[mineru]"

# Parser Docling (CPU, pas de GPU nécessaire)
pip install -e ".[docling]"

# Chunking sémantique (sentence-transformers + scipy)
pip install -e ".[semantic]"

# Chunking LLM-guidé SlumberChunker + OpenAIGenie
pip install -e ".[slumber]"

# Tokenizer OpenAI exact (cl100k_base, o200k_base…)
pip install -e ".[tiktoken]"

# Tokenizer HuggingFace rapide (Rust, BPE/WordPiece…)
pip install -e ".[hf-tokenizers]"

# AutoTokenizer HuggingFace (transformers complet)
pip install -e ".[transformers]"

# Tout à la fois
pip install -e ".[mineru,docling,semantic,slumber,tiktoken]"

Utilisation rapide

from openingestion import ingest

# Depuis un PDF brut (MinerU tourne en arrière-plan)
chunks = ingest("rapport.pdf")

# Depuis un répertoire de sortie MinerU existant (pas de re-parsing)
chunks = ingest("./output/rapport/auto/")

# Avec Docling (CPU, pas de GPU)
chunks = ingest("rapport.pdf", parser="docling", strategy="by_token")

# Format LangChain
docs = ingest("rapport.pdf", output_format="langchain")

Architecture

Étape Classe Rôle
Chef MinerUChef, DoclingChef Parse le document → ContentBlock[]
Chunker TokenChunker, SentenceChunker, SemanticChunker Groupe les blocs → RagChunk[]
Refinery RagRefinery, ContextualRagRefinery Enrichit les chunks (tokens, hash, images, contexte LLM)
Porter JSONPorter, to_langchain, to_llamaindex Exporte vers le format cible

Voir specv3.md pour les spécifications techniques détaillées.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openingestion-0.1.0.tar.gz (78.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openingestion-0.1.0-py3-none-any.whl (96.6 kB view details)

Uploaded Python 3

File details

Details for the file openingestion-0.1.0.tar.gz.

File metadata

  • Download URL: openingestion-0.1.0.tar.gz
  • Upload date:
  • Size: 78.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for openingestion-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2259ff0291cdb266adffd9f8859a3b9ecff3e982bc5c558891bfc21fe49343d0
MD5 330ea59bedc38e6805a491f96e602f66
BLAKE2b-256 c633ab52a3db62953727148accc18134c9b96f7468ec36785d7fc486b1966e15

See more details on using hashes here.

File details

Details for the file openingestion-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: openingestion-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 96.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for openingestion-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9149b0e615263c922a81c823dd3a29081f3eb15d30de2b3d0b0f1e2e177d6984
MD5 47290644bff71ee82d0f802a1b392a56
BLAKE2b-256 777a126b230af6058b9ba6dbcdbd8791a481bbf51689fadee76b99441354c448

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page