Skip to main content

Unified document parsing, structured extraction, vector ingestion, and RAG pipeline SDK

Project description

docpipe

Unified document parsing, structured extraction, vector ingestion, and RAG pipeline SDK.

PyPI Python License: MIT Docker Website

Overview

docpipe connects document parsing (Docling / GLM-OCR), LLM-based structured extraction (LangExtract + LangChain), vector ingestion (pgvector), and RAG querying into a single composable pipeline.

Four independent pipelines, composable together:

  1. Parse — Unstructured docs → parsed text/markdown via Docling or GLM-OCR
  2. Extract — Text → structured entities via LLM (LangExtract or LangChain)
  3. Ingest — Parsed chunks → embeddings → your vector DB (LangChain + pgvector)
  4. RAG — Questions → grounded answers with source citations (6 retrieval strategies)

docpipe never stores your data. It connects to your infrastructure and gets out of the way.


Install

pip install docpipe-sdk                  # Core only
pip install "docpipe-sdk[docling]"       # + Document parsing via Docling (PDF, DOCX, images, ...)
pip install "docpipe-sdk[glm-ocr]"       # + Document parsing via GLM-OCR (state-of-the-art OCR)
pip install "docpipe-sdk[langextract]"   # + Google LangExtract
pip install "docpipe-sdk[openai]"        # + OpenAI embeddings & LLM
pip install "docpipe-sdk[anthropic]"     # + Anthropic Claude
pip install "docpipe-sdk[google]"        # + Google Gemini
pip install "docpipe-sdk[ollama]"        # + Ollama (local models)
pip install "docpipe-sdk[pgvector]"      # + PostgreSQL vector store
pip install "docpipe-sdk[rag]"           # + Hybrid search (BM25 + langchain-classic)
pip install "docpipe-sdk[rerank]"        # + Local reranking (FlashRank)
pip install "docpipe-sdk[server]"        # + FastAPI server
pip install "docpipe-sdk[all]"           # Everything

Quick Start

Parse a document

import docpipe

# Default: Docling parser
doc = docpipe.parse("invoice.pdf")
print(doc.markdown)
print(doc.text)

# GLM-OCR parser (state-of-the-art OCR, best for scanned/image-heavy docs)
doc = docpipe.parse("scanned_report.pdf", parser="glm-ocr")
print(doc.markdown)

Extract structured data

schema = docpipe.ExtractionSchema(
    description="Extract invoice line items with amounts",
    model_id="gemini-2.5-flash",
)
results = docpipe.extract(doc.text, schema)
for r in results:
    print(r.entity_class, r.text, r.attributes)

Full parse + extract pipeline

result = docpipe.run("invoice.pdf", schema)
print(result.parsed.markdown)
print(result.extractions)

Ingest into your vector DB

config = docpipe.IngestionConfig(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    table_name="invoices",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
)
docpipe.ingest("invoice.pdf", config=config)

Incremental ingestion (skip unchanged files)

config = docpipe.IngestionConfig(
    ...,
    incremental=True,  # skips files already in the DB by SHA-256 hash
)
docpipe.ingest("invoice.pdf", config=config)
# → Skipped 'invoice.pdf' (unchanged, incremental mode)

RAG — ask questions against your documents

rag_config = docpipe.RAGConfig(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    table_name="invoices",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
    llm_provider="openai",
    llm_model="gpt-4o",
    strategy="hyde",   # naive | hyde | multi_query | parent_document | hybrid | auto
)
result = docpipe.rag("What is the total amount on the invoice?", config=rag_config)
print(result.answer)   # grounded answer with inline citations
print(result.sources)  # ["invoice.pdf"]
print(result.chunks)   # retrieved chunks with scores

Structured RAG output

from pydantic import BaseModel

class InvoiceSummary(BaseModel):
    total: float
    currency: str
    vendor: str

result = docpipe.rag(
    "Summarize the invoice",
    config=docpipe.RAGConfig(..., output_model=InvoiceSummary),
)
summary = result.structured  # InvoiceSummary(total=4250.0, currency='USD', vendor='Acme')

With reranking

rag_config = docpipe.RAGConfig(
    ...,
    strategy="naive",
    reranker="flashrank",   # local, no API key (pip install "docpipe-sdk[rerank]")
    rerank_top_n=5,
)

Evaluate RAG quality

from docpipe import EvalConfig, EvalQuestion, EvalPipeline

questions = [
    EvalQuestion(
        question="What is the invoice total?",
        expected_answer="$4,250",
        expected_sources=["invoice.pdf"],
    ),
]
cfg = EvalConfig(rag_config=rag_config, questions=questions,
                 metrics=["hit_rate", "answer_similarity"])
result = EvalPipeline(cfg).run()
print(result.metrics.hit_rate)          # 0.9
print(result.metrics.answer_similarity) # 0.85

RAG Strategies

Strategy How it works Best for
naive Vector similarity search Well-formed queries, fast responses
hyde LLM generates hypothetical answer → embed → retrieve Complex / technical queries (highest accuracy)
multi_query Expand into N query variants → union results Vague or short queries
parent_document Retrieve seed chunks → expand context by source Long documents, context coherence
hybrid Dense vector + BM25 keyword via EnsembleRetriever Exact terms, proper nouns, IDs
auto LLM classifies question → dispatches to optimal strategy Mixed workloads, unknown query types

CLI

# Parse
docpipe parse invoice.pdf --format markdown

# Extract
docpipe extract "some text" --schema schema.yaml --model gemini-2.5-flash

# Ingest (with incremental mode)
docpipe ingest invoice.pdf \
    --db "postgresql://..." --table invoices \
    --embedding-provider openai --embedding-model text-embedding-3-small \
    --incremental

# RAG query
docpipe rag query "What is the total?" \
    --db "postgresql://..." --table invoices \
    --strategy hyde \
    --llm-provider openai --llm-model gpt-4o \
    --embedding-provider openai --embedding-model text-embedding-3-small \
    --reranker flashrank

# Evaluate RAG quality
docpipe evaluate run \
    --questions qa.json \
    --db "postgresql://..." --table invoices \
    --llm-provider openai --llm-model gpt-4o \
    --embedding-provider openai --embedding-model text-embedding-3-small \
    --metrics hit_rate,answer_similarity

# Start API server
docpipe serve --port 8000

# List installed plugins
docpipe plugins list

qa.json format for evaluation

[
  {
    "question": "What is the invoice total?",
    "expected_answer": "$4,250",
    "expected_sources": ["invoice.pdf"]
  }
]

API Server

Start the FastAPI server:

docpipe serve --host 0.0.0.0 --port 8000

Endpoints:

Method Path Description
GET /health Health check + plugin listing
POST /parse Parse a document
POST /extract Extract structured data
POST /run Parse + extract
POST /ingest Ingest into vector DB
POST /search Vector similarity search
POST /rag/query RAG question answering
POST /evaluate/run Evaluate RAG quality
GET /plugins List registered plugins

Docker

The official image is published to GitHub Container Registry and updated automatically on every release.

docker pull ghcr.io/thesunnysinha/docpipe:latest

Run the API server

docker run -p 8000:8000 --env-file .env \
    ghcr.io/thesunnysinha/docpipe:latest

Parse or ingest a document

# Parse
docker run -v ./data:/data \
    ghcr.io/thesunnysinha/docpipe:latest \
    parse /data/invoice.pdf --format markdown

# Ingest
docker run --env-file .env -v ./data:/data \
    ghcr.io/thesunnysinha/docpipe:latest \
    ingest /data/invoice.pdf \
    --db "postgresql://..." --table invoices \
    --embedding-provider openai --embedding-model text-embedding-3-small

Docker Compose — server + pgvector (zero config)

cp .env.example .env   # fill in your API key
docker compose up -d
# docker-compose.yml
services:
  docpipe:
    image: ghcr.io/thesunnysinha/docpipe:latest
    ports:
      - "8000:8000"
    env_file: .env
    volumes:
      - ./data:/data
    depends_on:
      db:
        condition: service_healthy

  db:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_USER: docpipe
      POSTGRES_PASSWORD: docpipe
      POSTGRES_DB: docpipe
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U docpipe"]
      interval: 5s
      retries: 5

volumes:
  pgdata:

A full-stack variant with Adminer (DB UI) is in docker-compose.full.yml.

Available tags

Tag Description
latest Most recent build from main
0.4.1, 0.4 Specific release versions
sha-<hash> Exact commit build

Plugin System

Register custom parsers or extractors via Python entry points:

# In your package's pyproject.toml
[project.entry-points."docpipe.parsers"]
my_parser = "my_package:MyParser"

[project.entry-points."docpipe.extractors"]
my_extractor = "my_package:MyExtractor"

Implement the BaseParser or BaseExtractor protocol (structural subtyping — no inheritance required):

class MyParser:
    name = "my_parser"

    def parse(self, source: str, **kwargs) -> docpipe.ParsedDocument: ...
    async def aparse(self, source: str, **kwargs) -> docpipe.ParsedDocument: ...
    def is_available(self) -> bool: ...
    def supported_formats(self) -> list[str]: ...

See CONTRIBUTING.md for a full walkthrough.


Supported Providers

Component Providers
Parsing Docling (PDF, DOCX, XLSX, PPTX, HTML, images), GLM-OCR (state-of-the-art multimodal OCR)
Extraction LangExtract (Google), LangChain with_structured_output
Embeddings OpenAI, Google Gemini, Ollama, HuggingFace
Vector store PostgreSQL + pgvector
LLM (RAG) OpenAI, Anthropic Claude, Google Gemini, Ollama
Reranking FlashRank (local), Cohere

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpipe_sdk-0.4.3.tar.gz (55.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpipe_sdk-0.4.3-py3-none-any.whl (41.5 kB view details)

Uploaded Python 3

File details

Details for the file docpipe_sdk-0.4.3.tar.gz.

File metadata

  • Download URL: docpipe_sdk-0.4.3.tar.gz
  • Upload date:
  • Size: 55.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for docpipe_sdk-0.4.3.tar.gz
Algorithm Hash digest
SHA256 962cbe4aafc100c9f6a0b4182b4a0f990a1a7edf6add4fce15145ab6f026515a
MD5 d585dba2e973f095ca72934dfe685a25
BLAKE2b-256 90bb536d40a2b79d765cc21ddf2db3876b6b726c0fdcced29427aa12f834ce6e

See more details on using hashes here.

File details

Details for the file docpipe_sdk-0.4.3-py3-none-any.whl.

File metadata

  • Download URL: docpipe_sdk-0.4.3-py3-none-any.whl
  • Upload date:
  • Size: 41.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for docpipe_sdk-0.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 97e2b960e3a0467db2d5e29d9badda07474fdd8cc82e6e075d4319fd97548537
MD5 8ee699a29461bd03a2917e5fa6a7d6d6
BLAKE2b-256 62f70ca687ec0d325b822a82ffae63cd91b1c454621d4650fca2537cf6ed24a9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page