Professional document conversion library (PDF ↔ LaTeX)

These details have not been verified by PyPI

Project links

Project description

DocStream

DocStream is a professional open-source document conversion library that turns any PDF into structured LaTeX + PDF output — powered by AI (Gemini & Groq) and Pandoc Lua templates.

How It Works

DocStream uses a 3-stage pipeline:

Stage 1 — EXTRACTION        Stage 2 — STRUCTURING       Stage 3 — RENDERING
─────────────────────       ──────────────────────       ───────────────────────────
                                                         
  PDF file                    List[Block]                  DocumentAST
     │                             │                            │
     ▼                             ▼                            ▼
  PDFExtractor               DocumentStructurer          DocumentRenderer
  (PyMuPDF)                  (Gemini Flash)              (Pandoc + XeLaTeX)
     │                       (Groq fallback)                    │
     │  font metadata,            │                             │  Lua writer
     │  bounding boxes,           │  JSON → AST                │  (report/ieee/resume)
     │  tables, OCR               │  validation                 │
     ▼                             ▼                            ▼
  List[Block]               DocumentAST                  .tex  +  .pdf

Stage 1 — Extraction (PDFExtractor)

Reads each PDF page with PyMuPDF
Extracts text blocks with font size, bold/italic flags, bounding boxes, and page numbers
Detects scanned PDFs (< 100 chars) and falls back to Tesseract OCR
Detects tables with find_tables() and converts them to Markdown

Stage 2 — Structuring (DocumentStructurer)

Sends extracted blocks to Gemini 1.5 Flash (primary) or Groq Llama-3 (fallback)
Parses the AI JSON response into a validated DocumentAST
Retries with exponential backoff (2 retries per provider)

Stage 3 — Rendering (DocumentRenderer)

Converts DocumentAST to Pandoc JSON format
Runs pandoc -f json -t <template.lua> to generate LaTeX
Compiles with xelatex -interaction=nonstopmode (twice for cross-references)
Parses .log for ! error lines and surfaces them clearly

Architecture

docstream/
├── docstream/
│   ├── __init__.py           ← Public API: convert(), extract(), structure(), render()
│   ├── cli.py                ← CLI entry point (argparse)
│   ├── core/
│   │   ├── extractor.py      ← PDFExtractor (PyMuPDF + Tesseract OCR fallback)
│   │   ├── structurer.py     ← DocumentStructurer (Gemini Flash + Groq fallback)
│   │   └── renderer.py       ← DocumentRenderer (Pandoc + XeLaTeX)
│   ├── templates/
│   │   ├── report.lua        ← Pandoc Lua writer: academic report
│   │   ├── ieee.lua          ← Pandoc Lua writer: IEEE two-column
│   │   └── resume.lua        ← Pandoc Lua writer: compact resume
│   ├── models/
│   │   └── document.py       ← Pydantic models (DocumentAST, Block, ConversionResult…)
│   └── exceptions.py         ← Exception hierarchy
├── tests/                    ← pytest suite (64 tests)
├── pyproject.toml            ← uv-managed, ruff + mypy configured
└── Makefile                  ← make install / test / lint / docs

Installation

# Recommended: using uv
uv add docstream

# Or using pip
pip install docstream

System dependencies

# Pandoc (required for LaTeX generation)
sudo apt install pandoc -y

# XeLaTeX (required for PDF compilation)
sudo apt install texlive-xetex texlive-latex-extra texlive-fonts-recommended -y

# Tesseract (optional — only needed for scanned PDFs)
sudo apt install tesseract-ocr -y

API keys

cp .env.example .env
# Edit .env:
#   GEMINI_API_KEY=your-gemini-key
#   GROQ_API_KEY=your-groq-key   (optional fallback)

Python API

One-liner conversion

from docstream import convert

result = convert("paper.pdf", template="ieee", output_dir="./out")
print(result.pdf_path)   # ./out/document.pdf
print(result.tex_path)   # ./out/document.tex

Step-by-step pipeline

from docstream import extract, structure, render

# Stage 1 — extract raw blocks from PDF
blocks = extract("paper.pdf")
print(f"Extracted {len(blocks)} blocks")

# Stage 2 — structure blocks into an AST with AI
ast = structure(blocks)
print(f"Title: {ast.title}, Sections: {len(ast.sections)}")

# Stage 3 — render AST to LaTeX + PDF
result = render(ast, template="report", output_dir="./out")
if result.success:
    print(f"PDF saved to {result.pdf_path}")
else:
    print(f"Rendering failed: {result.error}")

With explicit API keys

from docstream import extract, structure

blocks = extract("paper.pdf")
ast = structure(blocks, gemini_key="your-key", groq_key="your-groq-key")

Error handling

from docstream import convert
from docstream.exceptions import ExtractionError, StructuringError, RenderingError

try:
    result = convert("document.pdf", template="report")
except ExtractionError as e:
    print(f"Could not read PDF: {e}")
except StructuringError as e:
    print(f"AI structuring failed: {e}")
except RenderingError as e:
    print(f"LaTeX compilation failed: {e}")

Available templates

Name	Description
`report`	Academic report — article class, 1in margins
`ieee`	IEEE two-column conference format
`resume`	Clean resume — compact, no section numbers

CLI

Convert a PDF

# Full pipeline: PDF → LaTeX + PDF
docstream convert paper.pdf --template ieee --output ./out

# Short flags
docstream convert paper.pdf -t report -o ./output

Extract raw blocks

# Print extracted blocks as JSON to stdout
docstream extract paper.pdf

# Save to file
docstream extract paper.pdf --output blocks.json

List templates

docstream templates list

Version

docstream --version

Development

# Install all dependencies
make install

# Run tests
make test

# Lint + format check
make lint

# Auto-fix formatting
make format

# Type check
make typecheck

# All checks at once
make check

Contributing

Contributions are welcome! See CONTRIBUTING.md for setup, code style, and PR process.

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstream-0.1.0.tar.gz (173.3 kB view details)

Uploaded Mar 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docstream-0.1.0-py3-none-any.whl (41.3 kB view details)

Uploaded Mar 10, 2026 Python 3

File details

Details for the file docstream-0.1.0.tar.gz.

File metadata

Download URL: docstream-0.1.0.tar.gz
Upload date: Mar 10, 2026
Size: 173.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for docstream-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a703b49ce37b82d0af7ad2e858416b9f43730f8c423ac77af704bd1be9f78602`
MD5	`6a151e79688dd564a4e8b83a2ff28c8a`
BLAKE2b-256	`f6fd85550294a3d4bb1ae422aa63e0659460f3cc082a7c3766cd164aed713c4b`

See more details on using hashes here.

File details

Details for the file docstream-0.1.0-py3-none-any.whl.

File metadata

Download URL: docstream-0.1.0-py3-none-any.whl
Upload date: Mar 10, 2026
Size: 41.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for docstream-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b82e0b68c7d5f8af0872b27c018c60cff5b79348126dcbc82607b9073e2c5d90`
MD5	`f9ac53e1393ed1f036a098e73c051b68`
BLAKE2b-256	`c691627866584ea12ae1a03a81527734b78a33c8c54f78a3c18a8533da7c809a`

See more details on using hashes here.

docstream 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DocStream

How It Works

Architecture

Installation

System dependencies

API keys

Python API

One-liner conversion

Step-by-step pipeline

With explicit API keys

Error handling

Available templates

CLI

Convert a PDF

Extract raw blocks

List templates

Version

Development

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes