Skip to main content

Unified document parsing, structured extraction, vector ingestion, and RAG pipeline SDK

Project description

docpipe

Unified document parsing, structured extraction, vector ingestion, and RAG pipeline SDK.

PyPI Python License: MIT Docker Website

Overview

docpipe connects document parsing (Docling / GLM-OCR), LLM-based structured extraction (LangExtract + LangChain), vector ingestion (pgvector or optional turbovec), and RAG querying into a single composable pipeline.

Four pipelines, composable together:

  1. Parse — Unstructured docs → parsed text/markdown
  2. Extract — Text → structured entities via LLM
  3. Ingest — Chunks → embeddings → your vector store
  4. RAG — Questions → grounded answers with citations (six retrieval strategies)

docpipe never stores your data. It connects to your infrastructure and gets out of the way.

Full documentation (install extras, Docker, API reference, RAG strategies, observability, turbovec, plugins): docpipe docs · Marketing site


Install

pip install docpipe-sdk
# API server + OpenTelemetry (optional)
pip install "docpipe-sdk[server,observability]"

Optional extras (docling, openai, google, pgvector, turbovec, rag, rerank, http, all, …) are listed on the Install guide.

For unreleased commits: pip install git+https://github.com/thesunnysinha/docpipe.git


Quick start

import docpipe

# Parse
doc = docpipe.parse("invoice.pdf")
print(doc.markdown)

# Extract
schema = docpipe.ExtractionSchema(
    description="Extract invoice line items with amounts",
    model_id="gemini-2.5-flash",
)
results = docpipe.extract(doc.text, schema)

# Ingest + RAG (configure your DB + providers)
config = docpipe.IngestionConfig(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    table_name="invoices",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
)
docpipe.ingest("invoice.pdf", config=config)

rag_config = docpipe.RAGConfig(
    connection_string=config.connection_string,
    table_name=config.table_name,
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
    llm_provider="openai",
    llm_model="gpt-4o",
    strategy="hyde",
)
result = docpipe.query("What is the total on the invoice?", config=rag_config)
print(result.answer)

CLI: docpipe parse, docpipe ingest, docpipe rag query, docpipe serve — see CLI & API server.

Docker: docker pull ghcr.io/thesunnysinha/docpipe:latest — compose examples and env vars are in the Docker guide and .env.example.


Learn more

Topic Where
Install extras & providers docs
REST API (/ingest, /rag/query, /rag/stream, …) docs
RAG strategies (naive, hyde, hybrid, auto, …) docs
Observability (OTEL, Prometheus, JSON logs) docs · .env.example
turbovec (local file indices) docs
Custom parsers / extractors CONTRIBUTING.md
Environment variables .env.example · config reference

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpipe_sdk-0.5.2.tar.gz (178.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpipe_sdk-0.5.2-py3-none-any.whl (63.2 kB view details)

Uploaded Python 3

File details

Details for the file docpipe_sdk-0.5.2.tar.gz.

File metadata

  • Download URL: docpipe_sdk-0.5.2.tar.gz
  • Upload date:
  • Size: 178.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for docpipe_sdk-0.5.2.tar.gz
Algorithm Hash digest
SHA256 78d52c2889ce7b714a4fdef878c17bd565ae874ba45ce281ac43a154df8ac99d
MD5 f2c885074de24b976559f35755d70220
BLAKE2b-256 aee7745c3edd3efe04d233a292b8d25a55150c887a2dff1436302191cbc77fbf

See more details on using hashes here.

File details

Details for the file docpipe_sdk-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: docpipe_sdk-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 63.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for docpipe_sdk-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 849ac3668a27fee6ed4fa365d4f08289325c2319f55e2c048c1aab5581bdc9f5
MD5 36e54d0f4f58da240c5eee2aa7db90a4
BLAKE2b-256 ad5541add50e6960ed49ff076c676dc5a6b03607d8e834d2b51f164a4b6e8e9d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page