Pipeline modulaire de traitement documentaire IA — 4 briques (Parsing, Retrieval, Question, Génération) × N formats.

These details have not been verified by PyPI

Project links

Project description

docpipeline

Pipeline modulaire de traitement documentaire IA — architecture 4 briques × N formats inspirée du document de spécification Faseya.

Le code n'est jamais le problème — c'est la clarté de l'organisation. La valeur se construit autour du LLM (parsing en amont, reconstitution en aval), pas dans le LLM lui-même.

Philosophie

Modularité totale — chaque brique a un input clair, un output clair, zéro interdépendance non maîtrisée
LLM uniquement où c'est justifié — extraction, classification, conversion = 100% sans LLM (heuristiques + libs spécialisées). LLM réservé à : traduction, résumé, agent SQL en langage naturel
DataFrames standardisés — sortie cohérente des parseurs (page, ligne, bbox, style, …) pour réutilisation aux étapes suivantes

Architecture

                    ┌─────────────────────────────────────────┐
                    │  4 BRIQUES TRANSVERSES                  │
                    ├─────────────────────────────────────────┤
                    │  Parsing → Retrieval → Question → Gen   │
                    └─────────────────────────────────────────┘
                                       │
        ┌──────────────┬───────────────┼──────────────┬──────────────┐
        ▼              ▼               ▼              ▼              ▼
      PDF           Word            Excel       Translation     Excel SQL
   pipeline       pipeline        pipeline       pipeline      Agent (NL)

docpipeline/
├── parsing/
│   ├── pdf/         classifier (3 niveaux), extractor (texte+style+images), tables
│   ├── word/        parsing XML natif (TOC, spans, tableaux)
│   └── excel/       ingestion → SQLite/Parquet
├── conversion/      PDF → Word (Smart, Text, OCR + Enhancer)
├── retrieval/       filtrage progressif (keyword → regex → embeddings)
├── generation/      client LLM unifié (OpenAI + Anthropic) + résumé
├── translation/     glossaire métier + traduction Word préservant les styles
└── excel_agent/     agent SQL : question NL → SQL → résultat

Installation

pip install docpipeline

ou depuis les sources :

git clone https://github.com/BosterJack/docpipeline.git
cd docpipeline
pip install -e .

Dépendances système optionnelles

OCR : installer Tesseract puis pip install pytesseract
PaddleOCR : pip install paddleocr (alternative GPU-friendly)

Usage rapide

Classification PDF (sans LLM)

from docpipeline.parsing.pdf import classify_pdf

result = classify_pdf("contrat.pdf")
print(result.category.value)   # word_native | design_tool | scanned | other
print(result.confidence)        # 0.95
print(result.signals)           # ['meta:word_creator']

Conversion PDF → Word

from docpipeline.conversion import convert_pdf_to_word

result = convert_pdf_to_word("contrat.pdf", "contrat.docx")
print(result.engine_used)       # TextConverter (pdf2docx)
print(result.enhanced)          # True (post-traitement appliqué)

Parsing Word natif

from docpipeline.parsing.word import parse_word

doc = parse_word("contrat.docx")
print(doc.toc)                  # Table des matières
print(doc.tables[0])            # Premier tableau (DataFrame)
print(len(doc.spans))           # Spans avec ID stables

Excel → Agent SQL en langage naturel

from docpipeline.excel_agent import ExcelSQLAgent

agent = ExcelSQLAgent("sinistres.xlsx")  # nécessite OPENAI_API_KEY
result = agent.ask("Quelle ligne a le montant le plus élevé ?")
print(result.sql)               # SELECT * FROM sinistres ORDER BY ...
print(result.answer)            # DataFrame du résultat

Traduction Word avec préservation des styles

from docpipeline.translation import translate_word, Glossary, GlossaryEntry

glossary = Glossary([
    GlossaryEntry("IA", "fr", {"en": ["Individual Accident"]}, "insurance"),
    GlossaryEntry("BI", "fr", {"en": ["Business Interruption"]}, "insurance"),
])

translate_word("contrat.docx", target_lang="en", glossary=glossary)
# Génère contrat_en.docx avec spans/styles/couleurs préservés

Le LLM, où et pourquoi ?

Brique	LLM ?	Pourquoi
Classification PDF	❌	Heuristiques métadonnées + analyse contenu PyMuPDF
Extraction texte/images	❌	PyMuPDF natif
Tableaux PDF→Excel	❌	pdfplumber + détection de fragments
Parsing Word	❌	XML natif via python-docx
Excel → SQLite	❌	pandas + sqlite3
Conversion PDF→Word	❌	PyMuPDF (Smart) / pdf2docx (Text) / Tesseract (OCR)
Retrieval	❌	keyword + regex + embeddings (optionnel)
Traduction	✅	Sémantique cross-langue + glossaire contextuel
Agent SQL Excel	✅	Compréhension de question NL
Résumé	✅	Synthèse de contenu

Tests

pytest tests/ -v
# 59 passed

Démonstration interactive avec fichiers réels :

python -X utf8 demo.py        # toutes les démos
python -X utf8 demo.py 1      # une démo isolée (1 à 7)

Crédits

Architecture inspirée du document de spécification interne Faseya IA
Convertisseurs PDF → Word portés depuis CHRISTMardochee/pdf2word — code intégré et personnalisé (couleurs neutres, multi-langues OCR, sélection auto par classification)

Licence

MIT — voir LICENSE

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.4

May 5, 2026

0.5.3

May 5, 2026

0.5.2

May 5, 2026

0.5.1

May 5, 2026

0.5.0

May 5, 2026

0.4.1

May 4, 2026

0.4.0

May 4, 2026

0.3.0

May 4, 2026

0.2.2

May 4, 2026

0.2.1

May 4, 2026

0.2.0

May 4, 2026

This version

0.1.0

May 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpipeline-0.1.0.tar.gz (41.8 kB view details)

Uploaded May 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docpipeline-0.1.0-py3-none-any.whl (47.4 kB view details)

Uploaded May 4, 2026 Python 3

File details

Details for the file docpipeline-0.1.0.tar.gz.

File metadata

Download URL: docpipeline-0.1.0.tar.gz
Upload date: May 4, 2026
Size: 41.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.6

File hashes

Hashes for docpipeline-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b47dd508e821301da69b11376aaae4d78ce676ff0460b850bf351914e3f996d3`
MD5	`78fd28710b0a1c9df414fb21596b9db4`
BLAKE2b-256	`6b730e5a91ae28975495caa0601fee142936e31c434d6bb7f95cb53ce4244708`

See more details on using hashes here.

File details

Details for the file docpipeline-0.1.0-py3-none-any.whl.

File metadata

Download URL: docpipeline-0.1.0-py3-none-any.whl
Upload date: May 4, 2026
Size: 47.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.6

File hashes

Hashes for docpipeline-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`147f945c3612a061427aeedf2d3ca1e4095e5f8d904d017444ac4814485b2f77`
MD5	`0c0276b504ea4600e98b4603a3ed2954`
BLAKE2b-256	`e2580761509031e2da3247e54c577b4de3f1eea80c3a3983738b71ac161a710a`

See more details on using hashes here.

docpipeline 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

docpipeline

Philosophie

Architecture

Installation

Dépendances système optionnelles

Usage rapide

Classification PDF (sans LLM)

Conversion PDF → Word

Parsing Word natif

Excel → Agent SQL en langage naturel

Traduction Word avec préservation des styles

Le LLM, où et pourquoi ?

Tests

Crédits

Licence

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes