Professional document conversion library (PDF ↔ LaTeX)
Project description
DocStream
DocStream is a professional open-source document conversion library that turns any PDF into structured LaTeX + PDF output — powered by AI (Gemini & Groq) and Pandoc Lua templates.
How It Works
DocStream uses a 3-stage pipeline:
Stage 1 — EXTRACTION Stage 2 — STRUCTURING Stage 3 — RENDERING
───────────────────── ────────────────────── ───────────────────────────
PDF file List[Block] DocumentAST
│ │ │
▼ ▼ ▼
PDFExtractor DocumentStructurer DocumentRenderer
(PyMuPDF) (Gemini Flash) (Pandoc + XeLaTeX)
│ (Groq fallback) │
│ font metadata, │ │ Lua writer
│ bounding boxes, │ JSON → AST │ (report/ieee/resume)
│ tables, OCR │ validation │
▼ ▼ ▼
List[Block] DocumentAST .tex + .pdf
Stage 1 — Extraction (PDFExtractor)
- Reads each PDF page with PyMuPDF
- Extracts text blocks with font size, bold/italic flags, bounding boxes, and page numbers
- Detects scanned PDFs (< 100 chars) and falls back to Tesseract OCR
- Detects tables with
find_tables()and converts them to Markdown
Stage 2 — Structuring (DocumentStructurer)
- Sends extracted blocks to Gemini 1.5 Flash (primary) or Groq Llama-3 (fallback)
- Parses the AI JSON response into a validated
DocumentAST - Retries with exponential backoff (2 retries per provider)
Stage 3 — Rendering (DocumentRenderer)
- Converts
DocumentASTto Pandoc JSON format - Runs
pandoc -f json -t <template.lua>to generate LaTeX - Compiles with
xelatex -interaction=nonstopmode(twice for cross-references) - Parses
.logfor!error lines and surfaces them clearly
Architecture
docstream/
├── docstream/
│ ├── __init__.py ← Public API: convert(), extract(), structure(), render()
│ ├── cli.py ← CLI entry point (argparse)
│ ├── core/
│ │ ├── extractor.py ← PDFExtractor (PyMuPDF + Tesseract OCR fallback)
│ │ ├── structurer.py ← DocumentStructurer (Gemini Flash + Groq fallback)
│ │ └── renderer.py ← DocumentRenderer (Pandoc + XeLaTeX)
│ ├── templates/
│ │ ├── report.lua ← Pandoc Lua writer: academic report
│ │ ├── ieee.lua ← Pandoc Lua writer: IEEE two-column
│ │ └── resume.lua ← Pandoc Lua writer: compact resume
│ ├── models/
│ │ └── document.py ← Pydantic models (DocumentAST, Block, ConversionResult…)
│ └── exceptions.py ← Exception hierarchy
├── tests/ ← pytest suite (64 tests)
├── pyproject.toml ← uv-managed, ruff + mypy configured
└── Makefile ← make install / test / lint / docs
Installation
# Recommended: using uv
uv add docstream
# Or using pip
pip install docstream
System dependencies
# Pandoc (required for LaTeX generation)
sudo apt install pandoc -y
# XeLaTeX (required for PDF compilation)
sudo apt install texlive-xetex texlive-latex-extra texlive-fonts-recommended -y
# Tesseract (optional — only needed for scanned PDFs)
sudo apt install tesseract-ocr -y
API keys
cp .env.example .env
# Edit .env:
# GEMINI_API_KEY=your-gemini-key
# GROQ_API_KEY=your-groq-key (optional fallback)
Python API
One-liner conversion
from docstream import convert
result = convert("paper.pdf", template="ieee", output_dir="./out")
print(result.pdf_path) # ./out/document.pdf
print(result.tex_path) # ./out/document.tex
Step-by-step pipeline
from docstream import extract, structure, render
# Stage 1 — extract raw blocks from PDF
blocks = extract("paper.pdf")
print(f"Extracted {len(blocks)} blocks")
# Stage 2 — structure blocks into an AST with AI
ast = structure(blocks)
print(f"Title: {ast.title}, Sections: {len(ast.sections)}")
# Stage 3 — render AST to LaTeX + PDF
result = render(ast, template="report", output_dir="./out")
if result.success:
print(f"PDF saved to {result.pdf_path}")
else:
print(f"Rendering failed: {result.error}")
With explicit API keys
from docstream import extract, structure
blocks = extract("paper.pdf")
ast = structure(blocks, gemini_key="your-key", groq_key="your-groq-key")
Error handling
from docstream import convert
from docstream.exceptions import ExtractionError, StructuringError, RenderingError
try:
result = convert("document.pdf", template="report")
except ExtractionError as e:
print(f"Could not read PDF: {e}")
except StructuringError as e:
print(f"AI structuring failed: {e}")
except RenderingError as e:
print(f"LaTeX compilation failed: {e}")
Available templates
| Name | Description |
|---|---|
report |
Academic report — article class, 1in margins |
ieee |
IEEE two-column conference format |
resume |
Clean resume — compact, no section numbers |
CLI
Convert a PDF
# Full pipeline: PDF → LaTeX + PDF
docstream convert paper.pdf --template ieee --output ./out
# Short flags
docstream convert paper.pdf -t report -o ./output
Extract raw blocks
# Print extracted blocks as JSON to stdout
docstream extract paper.pdf
# Save to file
docstream extract paper.pdf --output blocks.json
List templates
docstream templates list
Version
docstream --version
Development
# Install all dependencies
make install
# Run tests
make test
# Lint + format check
make lint
# Auto-fix formatting
make format
# Type check
make typecheck
# All checks at once
make check
Contributing
Contributions are welcome! See CONTRIBUTING.md for setup, code style, and PR process.
License
MIT © 2024 DocStream Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docstream-0.1.0.tar.gz.
File metadata
- Download URL: docstream-0.1.0.tar.gz
- Upload date:
- Size: 173.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a703b49ce37b82d0af7ad2e858416b9f43730f8c423ac77af704bd1be9f78602
|
|
| MD5 |
6a151e79688dd564a4e8b83a2ff28c8a
|
|
| BLAKE2b-256 |
f6fd85550294a3d4bb1ae422aa63e0659460f3cc082a7c3766cd164aed713c4b
|
File details
Details for the file docstream-0.1.0-py3-none-any.whl.
File metadata
- Download URL: docstream-0.1.0-py3-none-any.whl
- Upload date:
- Size: 41.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b82e0b68c7d5f8af0872b27c018c60cff5b79348126dcbc82607b9073e2c5d90
|
|
| MD5 |
f9ac53e1393ed1f036a098e73c051b68
|
|
| BLAKE2b-256 |
c691627866584ea12ae1a03a81527734b78a33c8c54f78a3c18a8533da7c809a
|