Skip to main content

Unified document parsing, structured extraction, and vector ingestion pipeline

Project description

docpipe

Unified document parsing, structured extraction, and vector ingestion pipeline.

Overview

docpipe connects document parsing (Docling), LLM-based structured extraction (LangExtract + LangChain), and vector ingestion (pgvector via LangChain) into a single composable pipeline.

Three independent pipelines, composable together:

  1. Parse: Unstructured docs → parsed text/markdown (Docling)
  2. Extract: Text → structured entities via LLM (LangExtract or LangChain)
  3. Ingest: Parsed chunks → embeddings → your vector DB (LangChain + pgvector)

Install

# Core only
pip install docpipe

# With all backends
pip install "docpipe[all]"

# Pick what you need
pip install "docpipe[docling]"              # Document parsing
pip install "docpipe[langextract]"          # Google LangExtract
pip install "docpipe[openai]"              # OpenAI embeddings + LLM
pip install "docpipe[pgvector]"            # PostgreSQL vector store
pip install "docpipe[server]"              # FastAPI server

Quick Start

Python API

import docpipe

# Parse a document
doc = docpipe.parse("invoice.pdf")
print(doc.markdown)

# Extract structured data
schema = docpipe.ExtractionSchema(
    description="Extract invoice line items with amounts",
    model_id="gemini-2.5-flash",
)
results = docpipe.extract(doc.text, schema)

# Full pipeline
result = docpipe.run("invoice.pdf", schema)

# Ingest into your vector DB
config = docpipe.IngestionConfig(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    table_name="invoices",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
)
docpipe.ingest("invoice.pdf", config=config)

CLI

docpipe parse invoice.pdf --format markdown
docpipe extract "John Doe, age 30" --schema schema.yaml --model gemini-2.5-flash
docpipe run invoice.pdf --schema schema.yaml --model gemini-2.5-flash
docpipe ingest invoice.pdf --db "postgresql://..." --table invoices \
    --embedding-provider openai --embedding-model text-embedding-3-small
docpipe search "total amount" --db "postgresql://..." --table invoices \
    --embedding-provider openai --embedding-model text-embedding-3-small
docpipe serve
docpipe plugins list

Docker

# API server
docker run -p 8000:8000 --env-file .env docpipe

# CLI
docker run -v ./data:/data docpipe parse /data/invoice.pdf

Plugin System

Third-party packages can register as plugins via entry points:

# In your package's pyproject.toml
[project.entry-points."docpipe.parsers"]
my_parser = "my_package:MyParser"

[project.entry-points."docpipe.extractors"]
my_extractor = "my_package:MyExtractor"

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpipe_sdk-0.1.0.tar.gz (25.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpipe_sdk-0.1.0-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file docpipe_sdk-0.1.0.tar.gz.

File metadata

  • Download URL: docpipe_sdk-0.1.0.tar.gz
  • Upload date:
  • Size: 25.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for docpipe_sdk-0.1.0.tar.gz
Algorithm Hash digest
SHA256 797273baf910379ce9b995e9d588fb9bcdd5390e8d7556d809af92977d18d754
MD5 784f05085257cf43d8fca0b90b1ce3d1
BLAKE2b-256 113c54867237ffda48b83441edc0b97565e5e04951bbd804aeaf51cd1d4fa021

See more details on using hashes here.

File details

Details for the file docpipe_sdk-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: docpipe_sdk-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for docpipe_sdk-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fba30c2d8dd51b3eb5dd5c596b3d92a294730d3e2a90d90084b0c148dd31cd94
MD5 594346eb235b8228875576f935ed2902
BLAKE2b-256 8505cc7fb4ebd308d1a53b83fa874e2a57598bd5368335ebd7cce2e22130f65e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page