Skip to main content

Unified document parsing, structured extraction, vector ingestion, and RAG pipeline SDK

Project description

docpipe

Unified document parsing, structured extraction, vector ingestion, and RAG pipeline SDK.

PyPI Python License: MIT Docker Website

Overview

docpipe connects document parsing (Docling / GLM-OCR), LLM-based structured extraction (LangExtract + LangChain), vector ingestion (pgvector or optional turbovec), and RAG querying into a single composable pipeline.

Four pipelines, composable together:

  1. Parse — Unstructured docs → parsed text/markdown
  2. Extract — Text → structured entities via LLM
  3. Ingest — Chunks → embeddings → your vector store
  4. RAG — Questions → grounded answers with citations (six retrieval strategies)

docpipe never stores your data. It connects to your infrastructure and gets out of the way.

Full documentation (install extras, Docker, API reference, RAG strategies, observability, turbovec, plugins): docpipe docs · Marketing site


Install

pip install docpipe-sdk
# API server + OpenTelemetry (optional)
pip install "docpipe-sdk[server,observability]"

Optional extras (docling, openai, google, pgvector, turbovec, rag, rerank, http, all, …) are listed on the Install guide.

For unreleased commits: pip install git+https://github.com/thesunnysinha/docpipe.git


Quick start

import docpipe

# Parse
doc = docpipe.parse("invoice.pdf")
print(doc.markdown)

# Extract
schema = docpipe.ExtractionSchema(
    description="Extract invoice line items with amounts",
    model_id="gemini-2.5-flash",
)
results = docpipe.extract(doc.text, schema)

# Ingest + RAG (configure your DB + providers)
config = docpipe.IngestionConfig(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    table_name="invoices",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
)
docpipe.ingest("invoice.pdf", config=config)

rag_config = docpipe.RAGConfig(
    connection_string=config.connection_string,
    table_name=config.table_name,
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
    llm_provider="openai",
    llm_model="gpt-4o",
    strategy="hyde",
)
result = docpipe.query("What is the total on the invoice?", config=rag_config)
print(result.answer)

CLI: docpipe parse, docpipe ingest, docpipe rag query, docpipe serve — see CLI & API server.

Docker: docker pull ghcr.io/thesunnysinha/docpipe:latest — compose examples and env vars are in the Docker guide and .env.example.


Learn more

Topic Where
Install extras & providers docs
REST API (/ingest, /rag/query, /rag/stream, …) docs
RAG strategies (naive, hyde, hybrid, auto, …) docs
Observability (OTEL, Prometheus, JSON logs) docs · .env.example
turbovec (local file indices) docs
Custom parsers / extractors CONTRIBUTING.md
Environment variables .env.example · config reference

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpipe_sdk-0.5.3.tar.gz (179.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpipe_sdk-0.5.3-py3-none-any.whl (63.4 kB view details)

Uploaded Python 3

File details

Details for the file docpipe_sdk-0.5.3.tar.gz.

File metadata

  • Download URL: docpipe_sdk-0.5.3.tar.gz
  • Upload date:
  • Size: 179.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for docpipe_sdk-0.5.3.tar.gz
Algorithm Hash digest
SHA256 11d688dc6cd14535599c8e88b3399eaa139d1f9a820de2df0e8964ac0be0f06c
MD5 8e59a4daa6ac5b12ad282438aa760120
BLAKE2b-256 e4940c228ec938bacfb3979add7f0c223dfdcb0dc6d48d3df372e158c129656c

See more details on using hashes here.

File details

Details for the file docpipe_sdk-0.5.3-py3-none-any.whl.

File metadata

  • Download URL: docpipe_sdk-0.5.3-py3-none-any.whl
  • Upload date:
  • Size: 63.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for docpipe_sdk-0.5.3-py3-none-any.whl
Algorithm Hash digest
SHA256 dd52a2c882529710e97c5eab635a1cd3549c21af91044a1e8a78504fdbe443a8
MD5 500ac933200af8c57b06dff1787cd342
BLAKE2b-256 bb4c324ed9d728245280fd3efb0d75d0205a20d38307e28548040a53be74e2bf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page