Unified document parsing, structured extraction, vector ingestion, and RAG pipeline SDK
Project description
docpipe
Unified document parsing, structured extraction, vector ingestion, and RAG pipeline SDK.
Overview
docpipe connects document parsing (Docling), LLM-based structured extraction (LangExtract + LangChain), vector ingestion (pgvector via LangChain), and RAG querying into a single composable pipeline.
Four independent pipelines, composable together:
- Parse — Unstructured docs → parsed text/markdown via Docling
- Extract — Text → structured entities via LLM (LangExtract or LangChain)
- Ingest — Parsed chunks → embeddings → your vector DB (LangChain + pgvector)
- RAG — Questions → grounded answers with source citations (5 retrieval strategies)
docpipe never stores your data. It connects to your infrastructure and gets out of the way.
Install
pip install docpipe-sdk # Core only
pip install "docpipe-sdk[docling]" # + Document parsing (PDF, DOCX, images, ...)
pip install "docpipe-sdk[langextract]" # + Google LangExtract
pip install "docpipe-sdk[openai]" # + OpenAI embeddings & LLM
pip install "docpipe-sdk[google]" # + Google Gemini
pip install "docpipe-sdk[ollama]" # + Ollama (local models)
pip install "docpipe-sdk[pgvector]" # + PostgreSQL vector store
pip install "docpipe-sdk[rag]" # + Hybrid search (BM25)
pip install "docpipe-sdk[rerank]" # + Local reranking (FlashRank)
pip install "docpipe-sdk[server]" # + FastAPI server
pip install "docpipe-sdk[all]" # Everything
Quick Start
Parse a document
import docpipe
doc = docpipe.parse("invoice.pdf")
print(doc.markdown)
print(doc.text)
Extract structured data
schema = docpipe.ExtractionSchema(
description="Extract invoice line items with amounts",
model_id="gemini-2.5-flash",
)
results = docpipe.extract(doc.text, schema)
for r in results:
print(r.entity_class, r.text, r.attributes)
Full parse + extract pipeline
result = docpipe.run("invoice.pdf", schema)
print(result.parsed.markdown)
print(result.extractions)
Ingest into your vector DB
config = docpipe.IngestionConfig(
connection_string="postgresql://user:pass@localhost:5432/mydb",
table_name="invoices",
embedding_provider="openai",
embedding_model="text-embedding-3-small",
)
docpipe.ingest("invoice.pdf", config=config)
Incremental ingestion (skip unchanged files)
config = docpipe.IngestionConfig(
...,
incremental=True, # skips files already in the DB by SHA-256 hash
)
docpipe.ingest("invoice.pdf", config=config)
# → Skipped 'invoice.pdf' (unchanged, incremental mode)
RAG — ask questions against your documents
rag_config = docpipe.RAGConfig(
connection_string="postgresql://user:pass@localhost:5432/mydb",
table_name="invoices",
embedding_provider="openai",
embedding_model="text-embedding-3-small",
llm_provider="openai",
llm_model="gpt-4o",
strategy="hyde", # naive | hyde | multi_query | parent_document | hybrid
)
result = docpipe.rag("What is the total amount on the invoice?", config=rag_config)
print(result.answer) # grounded answer with inline citations
print(result.sources) # ["invoice.pdf"]
print(result.chunks) # retrieved chunks with scores
Structured RAG output
from pydantic import BaseModel
class InvoiceSummary(BaseModel):
total: float
currency: str
vendor: str
result = docpipe.rag(
"Summarize the invoice",
config=docpipe.RAGConfig(..., output_model=InvoiceSummary),
)
summary = result.structured # InvoiceSummary(total=4250.0, currency='USD', vendor='Acme')
With reranking
rag_config = docpipe.RAGConfig(
...,
strategy="naive",
reranker="flashrank", # local, no API key (pip install docpipe-sdk[rerank])
rerank_top_n=5,
)
Evaluate RAG quality
from docpipe import EvalConfig, EvalQuestion, EvalPipeline
questions = [
EvalQuestion(
question="What is the invoice total?",
expected_answer="$4,250",
expected_sources=["invoice.pdf"],
),
]
cfg = EvalConfig(rag_config=rag_config, questions=questions,
metrics=["hit_rate", "answer_similarity"])
result = EvalPipeline(cfg).run()
print(result.metrics.hit_rate) # 0.9
print(result.metrics.answer_similarity) # 0.85
RAG Strategies
| Strategy | How it works | Best for |
|---|---|---|
naive |
Vector similarity search | Well-formed queries, fast responses |
hyde |
LLM generates hypothetical answer → embed → retrieve | Complex / technical queries (highest accuracy) |
multi_query |
Expand into N query variants → union results | Vague or short queries |
parent_document |
Retrieve seed chunks → expand context by source | Long documents, context coherence |
hybrid |
Dense vector + BM25 keyword via EnsembleRetriever | Exact terms, proper nouns, IDs |
CLI
# Parse
docpipe parse invoice.pdf --format markdown
# Extract
docpipe extract "some text" --schema schema.yaml --model gemini-2.5-flash
# Ingest (with incremental mode)
docpipe ingest invoice.pdf \
--db "postgresql://..." --table invoices \
--embedding-provider openai --embedding-model text-embedding-3-small \
--incremental
# RAG query
docpipe rag query "What is the total?" \
--db "postgresql://..." --table invoices \
--strategy hyde \
--llm-provider openai --llm-model gpt-4o \
--embedding-provider openai --embedding-model text-embedding-3-small \
--reranker flashrank
# Evaluate RAG quality
docpipe evaluate run \
--questions qa.json \
--db "postgresql://..." --table invoices \
--llm-provider openai --llm-model gpt-4o \
--embedding-provider openai --embedding-model text-embedding-3-small \
--metrics hit_rate,answer_similarity
# Start API server
docpipe serve --port 8000
# List installed plugins
docpipe plugins list
qa.json format for evaluation
[
{
"question": "What is the invoice total?",
"expected_answer": "$4,250",
"expected_sources": ["invoice.pdf"]
}
]
API Server
Start the FastAPI server:
docpipe serve --host 0.0.0.0 --port 8000
# or via Docker
docker run -p 8000:8000 --env-file .env docpipe
Endpoints:
| Method | Path | Description |
|---|---|---|
GET |
/health |
Health check + plugin listing |
POST |
/parse |
Parse a document |
POST |
/extract |
Extract structured data |
POST |
/run |
Parse + extract |
POST |
/ingest |
Ingest into vector DB |
POST |
/search |
Vector similarity search |
POST |
/rag/query |
RAG question answering |
POST |
/evaluate/run |
Evaluate RAG quality |
GET |
/plugins |
List registered plugins |
Docker
# API server
docker run -p 8000:8000 --env-file .env docpipe
# Parse in container
docker run -v ./data:/data docpipe parse /data/invoice.pdf --format markdown
# Ingest from container
docker run --env-file .env docpipe ingest /data/invoice.pdf \
--db "postgresql://user:pass@mydb.example.com:5432/mydb" \
--table invoices \
--embedding-provider openai --embedding-model text-embedding-3-small
Plugin System
Register custom parsers or extractors via Python entry points:
# In your package's pyproject.toml
[project.entry-points."docpipe.parsers"]
my_parser = "my_package:MyParser"
[project.entry-points."docpipe.extractors"]
my_extractor = "my_package:MyExtractor"
Implement the BaseParser or BaseExtractor protocol (structural subtyping — no inheritance required):
class MyParser:
name = "my_parser"
def parse(self, source: str, **kwargs) -> docpipe.ParsedDocument: ...
async def aparse(self, source: str, **kwargs) -> docpipe.ParsedDocument: ...
def is_available(self) -> bool: ...
def supported_formats(self) -> list[str]: ...
See CONTRIBUTING.md for a full walkthrough.
Supported Providers
| Component | Providers |
|---|---|
| Parsing | Docling (PDF, DOCX, XLSX, PPTX, HTML, images, audio, video) |
| Extraction | LangExtract (Google), LangChain with_structured_output |
| Embeddings | OpenAI, Google Gemini, Ollama, HuggingFace |
| Vector store | PostgreSQL + pgvector |
| LLM (RAG) | OpenAI, Google Gemini, Ollama, Anthropic |
| Reranking | FlashRank (local), Cohere |
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docpipe_sdk-0.3.0.tar.gz.
File metadata
- Download URL: docpipe_sdk-0.3.0.tar.gz
- Upload date:
- Size: 41.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4aa1edcc0084589fbb950211fd83e1eae646a9bd409dc23247b3276097e102a7
|
|
| MD5 |
87e76cffa9f8f7910775d1f23a17bad2
|
|
| BLAKE2b-256 |
c5a4d2ef5dd9746dd60a69cdb7a6118a62f7f305f09caaf10c248b6e4aee564e
|
File details
Details for the file docpipe_sdk-0.3.0-py3-none-any.whl.
File metadata
- Download URL: docpipe_sdk-0.3.0-py3-none-any.whl
- Upload date:
- Size: 37.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32bf1b8e86f7bb67b8045292f4fe6272ca757ef36eadc423bcf185258e77f628
|
|
| MD5 |
d4444d8be461f0e5b79558f1fec4884f
|
|
| BLAKE2b-256 |
b6de5cf1045abb449b62e7df85decdf86603e46b693da7feb7acc9436a654383
|