Unified document parsing, structured extraction, and vector ingestion pipeline
Project description
docpipe
Unified document parsing, structured extraction, and vector ingestion pipeline.
Overview
docpipe connects document parsing (Docling), LLM-based structured extraction (LangExtract + LangChain), and vector ingestion (pgvector via LangChain) into a single composable pipeline.
Three independent pipelines, composable together:
- Parse: Unstructured docs → parsed text/markdown (Docling)
- Extract: Text → structured entities via LLM (LangExtract or LangChain)
- Ingest: Parsed chunks → embeddings → your vector DB (LangChain + pgvector)
Install
# Core only
pip install docpipe
# With all backends
pip install "docpipe[all]"
# Pick what you need
pip install "docpipe[docling]" # Document parsing
pip install "docpipe[langextract]" # Google LangExtract
pip install "docpipe[openai]" # OpenAI embeddings + LLM
pip install "docpipe[pgvector]" # PostgreSQL vector store
pip install "docpipe[server]" # FastAPI server
Quick Start
Python API
import docpipe
# Parse a document
doc = docpipe.parse("invoice.pdf")
print(doc.markdown)
# Extract structured data
schema = docpipe.ExtractionSchema(
description="Extract invoice line items with amounts",
model_id="gemini-2.5-flash",
)
results = docpipe.extract(doc.text, schema)
# Full pipeline
result = docpipe.run("invoice.pdf", schema)
# Ingest into your vector DB
config = docpipe.IngestionConfig(
connection_string="postgresql://user:pass@localhost:5432/mydb",
table_name="invoices",
embedding_provider="openai",
embedding_model="text-embedding-3-small",
)
docpipe.ingest("invoice.pdf", config=config)
CLI
docpipe parse invoice.pdf --format markdown
docpipe extract "John Doe, age 30" --schema schema.yaml --model gemini-2.5-flash
docpipe run invoice.pdf --schema schema.yaml --model gemini-2.5-flash
docpipe ingest invoice.pdf --db "postgresql://..." --table invoices \
--embedding-provider openai --embedding-model text-embedding-3-small
docpipe search "total amount" --db "postgresql://..." --table invoices \
--embedding-provider openai --embedding-model text-embedding-3-small
docpipe serve
docpipe plugins list
Docker
# API server
docker run -p 8000:8000 --env-file .env docpipe
# CLI
docker run -v ./data:/data docpipe parse /data/invoice.pdf
Plugin System
Third-party packages can register as plugins via entry points:
# In your package's pyproject.toml
[project.entry-points."docpipe.parsers"]
my_parser = "my_package:MyParser"
[project.entry-points."docpipe.extractors"]
my_extractor = "my_package:MyExtractor"
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docpipe_sdk-0.1.0.tar.gz.
File metadata
- Download URL: docpipe_sdk-0.1.0.tar.gz
- Upload date:
- Size: 25.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
797273baf910379ce9b995e9d588fb9bcdd5390e8d7556d809af92977d18d754
|
|
| MD5 |
784f05085257cf43d8fca0b90b1ce3d1
|
|
| BLAKE2b-256 |
113c54867237ffda48b83441edc0b97565e5e04951bbd804aeaf51cd1d4fa021
|
File details
Details for the file docpipe_sdk-0.1.0-py3-none-any.whl.
File metadata
- Download URL: docpipe_sdk-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fba30c2d8dd51b3eb5dd5c596b3d92a294730d3e2a90d90084b0c148dd31cd94
|
|
| MD5 |
594346eb235b8228875576f935ed2902
|
|
| BLAKE2b-256 |
8505cc7fb4ebd308d1a53b83fa874e2a57598bd5368335ebd7cce2e22130f65e
|