Unified document parsing, structured extraction, vector ingestion, and RAG pipeline SDK
Project description
docpipe
Unified document parsing, structured extraction, vector ingestion, and RAG pipeline SDK.
Overview
docpipe connects document parsing (Docling / GLM-OCR), LLM-based structured extraction (LangExtract + LangChain), vector ingestion (pgvector or optional turbovec), and RAG querying into a single composable pipeline.
Four pipelines, composable together:
- Parse — Unstructured docs → parsed text/markdown
- Extract — Text → structured entities via LLM
- Ingest — Chunks → embeddings → your vector store
- RAG — Questions → grounded answers with citations (six retrieval strategies)
docpipe never stores your data. It connects to your infrastructure and gets out of the way.
Full documentation (install extras, Docker, API reference, RAG strategies, observability, turbovec, plugins): docpipe docs · Marketing site
Install
pip install docpipe-sdk
# API server + OpenTelemetry (optional)
pip install "docpipe-sdk[server,observability]"
Optional extras (docling, openai, google, pgvector, turbovec, rag, rerank, http, all, …) are listed on the Install guide.
For unreleased commits: pip install git+https://github.com/thesunnysinha/docpipe.git
Quick start
import docpipe
# Parse
doc = docpipe.parse("invoice.pdf")
print(doc.markdown)
# Extract
schema = docpipe.ExtractionSchema(
description="Extract invoice line items with amounts",
model_id="gemini-2.5-flash",
)
results = docpipe.extract(doc.text, schema)
# Ingest + RAG (configure your DB + providers)
config = docpipe.IngestionConfig(
connection_string="postgresql://user:pass@localhost:5432/mydb",
table_name="invoices",
embedding_provider="openai",
embedding_model="text-embedding-3-small",
)
docpipe.ingest("invoice.pdf", config=config)
rag_config = docpipe.RAGConfig(
connection_string=config.connection_string,
table_name=config.table_name,
embedding_provider="openai",
embedding_model="text-embedding-3-small",
llm_provider="openai",
llm_model="gpt-4o",
strategy="hyde",
)
result = docpipe.query("What is the total on the invoice?", config=rag_config)
print(result.answer)
CLI: docpipe parse, docpipe ingest, docpipe rag query, docpipe serve — see CLI & API server.
Docker: docker pull ghcr.io/thesunnysinha/docpipe:latest — compose examples and env vars are in the Docker guide and .env.example.
Learn more
| Topic | Where |
|---|---|
| Install extras & providers | docs |
REST API (/ingest, /rag/query, /rag/stream, …) |
docs |
RAG strategies (naive, hyde, hybrid, auto, …) |
docs |
| Observability (OTEL, Prometheus, JSON logs) | docs · .env.example |
| turbovec (local file indices) | docs |
| Custom parsers / extractors | CONTRIBUTING.md |
| Environment variables | .env.example · config reference |
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docpipe_sdk-0.5.3.tar.gz.
File metadata
- Download URL: docpipe_sdk-0.5.3.tar.gz
- Upload date:
- Size: 179.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11d688dc6cd14535599c8e88b3399eaa139d1f9a820de2df0e8964ac0be0f06c
|
|
| MD5 |
8e59a4daa6ac5b12ad282438aa760120
|
|
| BLAKE2b-256 |
e4940c228ec938bacfb3979add7f0c223dfdcb0dc6d48d3df372e158c129656c
|
File details
Details for the file docpipe_sdk-0.5.3-py3-none-any.whl.
File metadata
- Download URL: docpipe_sdk-0.5.3-py3-none-any.whl
- Upload date:
- Size: 63.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd52a2c882529710e97c5eab635a1cd3549c21af91044a1e8a78504fdbe443a8
|
|
| MD5 |
500ac933200af8c57b06dff1787cd342
|
|
| BLAKE2b-256 |
bb4c324ed9d728245280fd3efb0d75d0205a20d38307e28548040a53be74e2bf
|