Skip to main content

Professional document conversion library (PDF ↔ LaTeX)

Project description

DocStream

CI License: MIT Python 3.11+ Code style: ruff

DocStream is a professional open-source document conversion library that turns any PDF into structured LaTeX + PDF output — powered by AI (Gemini & Groq) and Pandoc Lua templates.


How It Works

DocStream uses a 3-stage pipeline:

Stage 1 — EXTRACTION        Stage 2 — STRUCTURING       Stage 3 — RENDERING
─────────────────────       ──────────────────────       ───────────────────────────
                                                         
  PDF file                    List[Block]                  DocumentAST
     │                             │                            │
     ▼                             ▼                            ▼
  PDFExtractor               DocumentStructurer          DocumentRenderer
  (PyMuPDF)                  (Gemini Flash)              (Pandoc + XeLaTeX)
     │                       (Groq fallback)                    │
     │  font metadata,            │                             │  Lua writer
     │  bounding boxes,           │  JSON → AST                │  (report/ieee/resume)
     │  tables, OCR               │  validation                 │
     ▼                             ▼                            ▼
  List[Block]               DocumentAST                  .tex  +  .pdf

Stage 1 — Extraction (PDFExtractor)

  • Reads each PDF page with PyMuPDF
  • Extracts text blocks with font size, bold/italic flags, bounding boxes, and page numbers
  • Detects scanned PDFs (< 100 chars) and falls back to Tesseract OCR
  • Detects tables with find_tables() and converts them to Markdown

Stage 2 — Structuring (DocumentStructurer)

  • Sends extracted blocks to Gemini 1.5 Flash (primary) or Groq Llama-3 (fallback)
  • Parses the AI JSON response into a validated DocumentAST
  • Retries with exponential backoff (2 retries per provider)

Stage 3 — Rendering (DocumentRenderer)

  • Converts DocumentAST to Pandoc JSON format
  • Runs pandoc -f json -t <template.lua> to generate LaTeX
  • Compiles with xelatex -interaction=nonstopmode (twice for cross-references)
  • Parses .log for ! error lines and surfaces them clearly

Architecture

docstream/
├── docstream/
│   ├── __init__.py           ← Public API: convert(), extract(), structure(), render()
│   ├── cli.py                ← CLI entry point (argparse)
│   ├── core/
│   │   ├── extractor.py      ← PDFExtractor (PyMuPDF + Tesseract OCR fallback)
│   │   ├── structurer.py     ← DocumentStructurer (Gemini Flash + Groq fallback)
│   │   └── renderer.py       ← DocumentRenderer (Pandoc + XeLaTeX)
│   ├── templates/
│   │   ├── report.lua        ← Pandoc Lua writer: academic report
│   │   ├── ieee.lua          ← Pandoc Lua writer: IEEE two-column
│   │   └── resume.lua        ← Pandoc Lua writer: compact resume
│   ├── models/
│   │   └── document.py       ← Pydantic models (DocumentAST, Block, ConversionResult…)
│   └── exceptions.py         ← Exception hierarchy
├── tests/                    ← pytest suite (64 tests)
├── pyproject.toml            ← uv-managed, ruff + mypy configured
└── Makefile                  ← make install / test / lint / docs

Installation

# Recommended: using uv
uv add docstream

# Or using pip
pip install docstream

System dependencies

# Pandoc (required for LaTeX generation)
sudo apt install pandoc -y

# XeLaTeX (required for PDF compilation)
sudo apt install texlive-xetex texlive-latex-extra texlive-fonts-recommended -y

# Tesseract (optional — only needed for scanned PDFs)
sudo apt install tesseract-ocr -y

API keys

cp .env.example .env
# Edit .env:
#   GEMINI_API_KEY=your-gemini-key
#   GROQ_API_KEY=your-groq-key   (optional fallback)

Python API

One-liner conversion

from docstream import convert

result = convert("paper.pdf", template="ieee", output_dir="./out")
print(result.pdf_path)   # ./out/document.pdf
print(result.tex_path)   # ./out/document.tex

Step-by-step pipeline

from docstream import extract, structure, render

# Stage 1 — extract raw blocks from PDF
blocks = extract("paper.pdf")
print(f"Extracted {len(blocks)} blocks")

# Stage 2 — structure blocks into an AST with AI
ast = structure(blocks)
print(f"Title: {ast.title}, Sections: {len(ast.sections)}")

# Stage 3 — render AST to LaTeX + PDF
result = render(ast, template="report", output_dir="./out")
if result.success:
    print(f"PDF saved to {result.pdf_path}")
else:
    print(f"Rendering failed: {result.error}")

With explicit API keys

from docstream import extract, structure

blocks = extract("paper.pdf")
ast = structure(blocks, gemini_key="your-key", groq_key="your-groq-key")

Error handling

from docstream import convert
from docstream.exceptions import ExtractionError, StructuringError, RenderingError

try:
    result = convert("document.pdf", template="report")
except ExtractionError as e:
    print(f"Could not read PDF: {e}")
except StructuringError as e:
    print(f"AI structuring failed: {e}")
except RenderingError as e:
    print(f"LaTeX compilation failed: {e}")

Available templates

Name Description
report Academic report — article class, 1in margins
ieee IEEE two-column conference format
resume Clean resume — compact, no section numbers

CLI

Convert a PDF

# Full pipeline: PDF → LaTeX + PDF
docstream convert paper.pdf --template ieee --output ./out

# Short flags
docstream convert paper.pdf -t report -o ./output

Extract raw blocks

# Print extracted blocks as JSON to stdout
docstream extract paper.pdf

# Save to file
docstream extract paper.pdf --output blocks.json

List templates

docstream templates list

Version

docstream --version

Development

# Install all dependencies
make install

# Run tests
make test

# Lint + format check
make lint

# Auto-fix formatting
make format

# Type check
make typecheck

# All checks at once
make check

Contributing

Contributions are welcome! See CONTRIBUTING.md for setup, code style, and PR process.


License

MIT © 2024 DocStream Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstream-0.1.0.tar.gz (173.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docstream-0.1.0-py3-none-any.whl (41.3 kB view details)

Uploaded Python 3

File details

Details for the file docstream-0.1.0.tar.gz.

File metadata

  • Download URL: docstream-0.1.0.tar.gz
  • Upload date:
  • Size: 173.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for docstream-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a703b49ce37b82d0af7ad2e858416b9f43730f8c423ac77af704bd1be9f78602
MD5 6a151e79688dd564a4e8b83a2ff28c8a
BLAKE2b-256 f6fd85550294a3d4bb1ae422aa63e0659460f3cc082a7c3766cd164aed713c4b

See more details on using hashes here.

File details

Details for the file docstream-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: docstream-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 41.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for docstream-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b82e0b68c7d5f8af0872b27c018c60cff5b79348126dcbc82607b9073e2c5d90
MD5 f9ac53e1393ed1f036a098e73c051b68
BLAKE2b-256 c691627866584ea12ae1a03a81527734b78a33c8c54f78a3c18a8533da7c809a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page