Skip to main content

Unified document parsing, structured extraction, vector ingestion, and RAG pipeline SDK

Project description

docpipe

Unified document parsing, structured extraction, vector ingestion, and RAG pipeline SDK.

PyPI Python License: MIT Docker Website

Overview

docpipe connects document parsing (Docling / GLM-OCR), LLM-based structured extraction (LangExtract + LangChain), vector ingestion (pgvector or optional turbovec), and RAG querying into a single composable pipeline.

Four pipelines, composable together:

  1. Parse — Unstructured docs → parsed text/markdown
  2. Extract — Text → structured entities via LLM
  3. Ingest — Chunks → embeddings → your vector store
  4. RAG — Questions → grounded answers with citations (six retrieval strategies)

docpipe never stores your data. It connects to your infrastructure and gets out of the way.

Full documentation (install extras, Docker, API reference, RAG strategies, observability, turbovec, plugins): docpipe docs · Marketing site


Install

pip install docpipe-sdk
# API server + OpenTelemetry (optional)
pip install "docpipe-sdk[server,observability]"

Optional extras (docling, openai, google, pgvector, turbovec, rag, rerank, http, all, …) are listed on the Install guide.

For unreleased commits: pip install git+https://github.com/thesunnysinha/docpipe.git


Quick start

import docpipe

# Parse
doc = docpipe.parse("invoice.pdf")
print(doc.markdown)

# Extract
schema = docpipe.ExtractionSchema(
    description="Extract invoice line items with amounts",
    model_id="gemini-2.5-flash",
)
results = docpipe.extract(doc.text, schema)

# Ingest + RAG (configure your DB + providers)
config = docpipe.IngestionConfig(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    table_name="invoices",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
)
docpipe.ingest("invoice.pdf", config=config)

rag_config = docpipe.RAGConfig(
    connection_string=config.connection_string,
    table_name=config.table_name,
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
    llm_provider="openai",
    llm_model="gpt-4o",
    strategy="hyde",
)
result = docpipe.query("What is the total on the invoice?", config=rag_config)
print(result.answer)

CLI: docpipe parse, docpipe ingest, docpipe rag query, docpipe serve — see CLI & API server.

Docker: docker pull ghcr.io/thesunnysinha/docpipe:latest — compose examples and env vars are in the Docker guide and .env.example.


Learn more

Topic Where
Install extras & providers docs
REST API (/ingest, /rag/query, /rag/stream, …) docs
RAG strategies (naive, hyde, hybrid, auto, …) docs
Observability (OTEL, Prometheus, JSON logs) docs · .env.example
turbovec (local file indices) docs
Custom parsers / extractors CONTRIBUTING.md
Jingo sidecar integration Jingo

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpipe_sdk-0.5.1.tar.gz (178.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpipe_sdk-0.5.1-py3-none-any.whl (63.3 kB view details)

Uploaded Python 3

File details

Details for the file docpipe_sdk-0.5.1.tar.gz.

File metadata

  • Download URL: docpipe_sdk-0.5.1.tar.gz
  • Upload date:
  • Size: 178.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for docpipe_sdk-0.5.1.tar.gz
Algorithm Hash digest
SHA256 ef79cc412fe387a2222f8506204de78e3519c302ad9c31b390505190c4dcca9e
MD5 3a97b11d591407d84ffa1513116aeb81
BLAKE2b-256 53e2163687bab99e968987bd0d980eddbe19c3fa257ea714ae0e6a85fc722e24

See more details on using hashes here.

File details

Details for the file docpipe_sdk-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: docpipe_sdk-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 63.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for docpipe_sdk-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6f95a3b52e20c8aed3830776ac2836b97877a20ab84fef465ed66462d76fc10d
MD5 0173ab0627634e2cc9e0dd14203d08f0
BLAKE2b-256 96c58edcb8c0cba3010105d1b40318e7cbeef9a6c4a42043e0a77d01604d4419

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page