Document processor CLI: file in → markdown out. Vision LLM extraction, LLM refinement, multi-provider (OpenAI, Azure, Anthropic, Ollama).

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language
- Python :: 3.11
- Python :: 3.12
Topic
- Software Development :: Libraries :: Python Modules

Project description

docproc

docproc logo

docproc
Turn messy documents into clean markdown for AI pipelines.

Document → Markdown → AI

docproc is a document-to-markdown extraction engine. It converts PDFs, DOCX, PPTX, and XLSX into clean structured markdown while preserving equations, figures, and embedded images. It is designed to power LLM pipelines, RAG systems, and document processing workflows.

Features

PDF → Markdown — Native text extraction plus vision-based handling of embedded images
DOCX → Markdown — Full document structure and formatting
PPTX → Markdown — Slides to structured content
XLSX → Markdown — Spreadsheets to readable tables
Equation preservation — LaTeX and math kept intact (with optional LLM refinement)
Figure extraction — Every image, diagram, and label described by a vision model
Clean structured output — Ready for LLMs, RAG, and downstream pipelines

Example

Before: A PDF with mixed text, equations, and diagrams.

After: A single .md file with extracted text, LaTeX math blocks, and every figure explained by the vision model—ready to embed, chunk, or feed into an LLM.

docproc --file paper.pdf -o paper.md

Installation

pip install git+https://github.com/rithulkamesh/docproc.git

Or with uv:

uv tool install git+https://github.com/rithulkamesh/docproc.git

From source:

git clone https://github.com/rithulkamesh/docproc.git && cd docproc
uv sync --python 3.12

Usage

One-time config (generates docproc.yaml from your .env):

docproc init-config --env .env

Extract a document to markdown:

docproc --file input.pdf -o output.md

Optional: --config path, -v for verbose output. Shell completions: docproc completions bash or docproc completions zsh.

Python library

Install the package, then use the Docproc facade with instance-scoped config (PEP 561 typing via py.typed):

from docproc import Docproc

Docproc.from_config_path("docproc.yaml").extract_to_file("input.pdf", "output.md")

# Or minimal OpenAI in code (uses OPENAI_API_KEY):
Docproc.with_openai().extract_to_file("input.pdf", "output.md")

# String output for RAG / LLM pipelines:
md = Docproc.from_env().extract("paper.pdf")

Lower-level API: extract_document_to_text, parse_config, docprocConfig. Runnable samples: examples/.

Why docproc?

Naive PDF parsers often drop equations, misread layouts, and leave images as black boxes. docproc uses native extractors where possible (PyMuPDF, python-docx, etc.) and runs a vision model on every embedded image—so diagrams, charts, and equations become text or LaTeX that your AI stack can actually use. Optional LLM refinement cleans markdown and normalizes math. The result is document content that fits cleanly into RAG pipelines and LLM context windows instead of noisy, incomplete text.

Architecture

docproc ships as a CLI and an importable Python library; there is no bundled server or database for extraction. The pipeline is:

Load — Read the file (PDF/DOCX/PPTX/XLSX) and extract full text from the native layer.
Vision — For PDFs, run a vision model on every embedded image; get descriptions, LaTeX, or structured captions.
Refine (optional) — LLM pass to tidy markdown, normalize LaTeX, and strip boilerplate.
Sanitize — Dedupe and clean; write a single .md file.

Configuration lives in docproc.yaml (or generated via docproc init-config --env .env). AI providers: OpenAI, Azure, Anthropic, Ollama, LiteLLM. See docs/ARCHITECTURE.md and docs/CONFIGURATION.md for details.

Demo (docproc // edu)

The demo/ is a full study workspace: upload docs, chat over them, generate notes and flashcards, create and take assessments. It’s a separate Go + React app that calls this CLI when a document is uploaded. See demo/README.md.

Docs

Doc	Description
docs/README.md	Index
docs/CONFIGURATION.md	Config schema, providers, ingest, RAG
docs/ARCHITECTURE.md	Pipeline, CLI, Python library
docs/AZURE_SETUP.md	Azure OpenAI and Vision setup
docs/ASSESSMENTS_AI.md	Assessments and grading in the demo

Environment: DOCPROC_CONFIG for config path (default: docproc.yaml). Provider keys: OPENAI_API_KEY, AZURE_OPENAI_*, ANTHROPIC_API_KEY, etc. See .env.example.

Contributing

Pull requests welcome. Run the tests before sending.

License

MIT. See LICENSE.md.

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language
- Python :: 3.11
- Python :: 3.12
Topic
- Software Development :: Libraries :: Python Modules

Release history Release notifications | RSS feed

This version

2.1.1

Apr 8, 2026

2.0.1

Mar 5, 2026

1.0.0

Mar 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docproc-2.1.1.tar.gz (35.4 kB view details)

Uploaded Apr 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docproc-2.1.1-py3-none-any.whl (52.2 kB view details)

Uploaded Apr 8, 2026 Python 3

File details

Details for the file docproc-2.1.1.tar.gz.

File metadata

Download URL: docproc-2.1.1.tar.gz
Upload date: Apr 8, 2026
Size: 35.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docproc-2.1.1.tar.gz
Algorithm	Hash digest
SHA256	`caddabdc322f86786ee85b8853d3572e719139fe126cfcea1c742415e6687f50`
MD5	`59cbfdc8a0554f7c3a2674d4df6e0d9b`
BLAKE2b-256	`31299940871f8477ec54fc06c19f4b03d67baac338df97d53dd61a252c62164d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docproc-2.1.1.tar.gz:

Publisher: pypi-publish.yml on rithulkamesh/docproc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docproc-2.1.1.tar.gz
- Subject digest: caddabdc322f86786ee85b8853d3572e719139fe126cfcea1c742415e6687f50
- Sigstore transparency entry: 1251478528
- Sigstore integration time: Apr 8, 2026
Source repository:
- Permalink: rithulkamesh/docproc@671c57879332c949e8cace4fb5c1138bdb40166b
- Branch / Tag: refs/tags/v2.1.1
- Owner: https://github.com/rithulkamesh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@671c57879332c949e8cace4fb5c1138bdb40166b
- Trigger Event: push

File details

Details for the file docproc-2.1.1-py3-none-any.whl.

File metadata

Download URL: docproc-2.1.1-py3-none-any.whl
Upload date: Apr 8, 2026
Size: 52.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docproc-2.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c4e52e55a098e4a937ad4aadd6d2989ae2670f3604e5de3a94e15ac973446778`
MD5	`01c99786adc9d5c044394c42183f434d`
BLAKE2b-256	`e9bd7221eda66f8a1ee10f20714a70277dd66dc9643ce98bff01f1f531847068`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docproc-2.1.1-py3-none-any.whl:

Publisher: pypi-publish.yml on rithulkamesh/docproc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docproc-2.1.1-py3-none-any.whl
- Subject digest: c4e52e55a098e4a937ad4aadd6d2989ae2670f3604e5de3a94e15ac973446778
- Sigstore transparency entry: 1251478552
- Sigstore integration time: Apr 8, 2026
Source repository:
- Permalink: rithulkamesh/docproc@671c57879332c949e8cace4fb5c1138bdb40166b
- Branch / Tag: refs/tags/v2.1.1
- Owner: https://github.com/rithulkamesh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@671c57879332c949e8cace4fb5c1138bdb40166b
- Trigger Event: push

docproc 2.1.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

docproc

Features

Example

Installation

Usage

Python library

Why docproc?

Architecture

Demo (docproc // edu)

Docs

Contributing

License

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance