Convert math-heavy PDFs to Markdown using Marker OCR with optional LLM enhancement

These details have not been verified by PyPI

Project links

Project description

PDF Transcriber

Convert math-heavy PDFs to Markdown using Marker OCR with optional LLM enhancement.

Installation

Recommended (isolated environment)

uv tool install pdf-transcriber

Alternative (pip)

pip install pdf-transcriber

Verify installation

pdf-transcriber-cli check

Three Ways to Use

Interface	Command	Use Case
CLI	`pdf-transcriber-cli transcribe paper.pdf`	Direct terminal usage
Skill	`/transcribe paper.pdf`	Claude Code slash command
MCP Server	`pdf-transcriber`	Claude Code background integration

1. CLI (Direct Terminal Usage)

# Basic transcription
pdf-transcriber-cli transcribe ~/Downloads/paper.pdf

# High quality mode
pdf-transcriber-cli transcribe ~/Downloads/paper.pdf -q high-quality

# Disable LLM (faster, less accurate)
pdf-transcriber-cli transcribe ~/Downloads/paper.pdf --no-llm

# Skip automatic linting
pdf-transcriber-cli transcribe ~/Downloads/paper.pdf --no-lint

# Health check
pdf-transcriber-cli check

2. Claude Code Skill (Slash Command)

# Install the skill
pdf-transcriber-cli install-skill

# Then in Claude Code:
/transcribe ~/Downloads/paper.pdf

3. MCP Server (Claude Code Integration)

Note: This is a standard MCP (Model Context Protocol) server. While examples show Claude Code configuration, it works with any MCP-compatible agent orchestrator (Cursor, Cline, custom agents, etc.).

Add to ~/.claude/settings.json:

{
  "mcpServers": {
    "pdf-transcriber": {
      "command": "pdf-transcriber",
      "env": {
        "PDF_TRANSCRIBER_OUTPUT_DIR": "~/Documents/transcriptions"
      }
    }
  }
}

LLM-Enhanced OCR Setup

PDF Transcriber can use a local vision LLM (VLM) to significantly improve OCR accuracy, especially for:

Complex mathematical notation
Handwritten annotations
Low-quality scans
Tables and figures

Quick Start

Install Ollama: https://ollama.ai
Pull a vision model:
```
ollama pull qwen2.5vl:3b
```
Start Ollama:
```
ollama serve
```

LLM enhancement is enabled by default. To disable:

# CLI
pdf-transcriber-cli transcribe paper.pdf --no-llm

# Environment variable
PDF_TRANSCRIBER_USE_LLM=false

Recommended Vision Models

Model	Size	RAM Required	Quality	Speed	Best For
`qwen2.5vl:3b`	3.2 GB	8 GB	Good	Fast	Default - laptops, CI
`qwen2.5vl:7b`	5.5 GB	16 GB	Better	Medium	Workstations
`qwen3-vl:4b`	3.5 GB	10 GB	Best (newest)	Medium	Best quality/size

Important: Only vision models (VLMs) work. Text-only models like llama3 won't process images.

Choosing a Model

8GB RAM / M1 MacBook: qwen2.5vl:3b (default)
16GB RAM / M2/M3 Pro: qwen2.5vl:7b or qwen3-vl:4b
24GB+ / NVIDIA GPU: llava:13b or larger
CI/Automated pipelines: qwen2.5vl:3b or disable LLM (--no-llm)

To use a different model:

# Environment variable
PDF_TRANSCRIBER_OLLAMA_MODEL=qwen3-vl:4b

# Or pull and use
ollama pull qwen3-vl:4b

Without LLM Enhancement

If you don't want to run a local LLM:

pdf-transcriber-cli transcribe paper.pdf --no-llm

This uses Marker OCR alone, which is still excellent for clean, typed PDFs.

Configuration

All settings can be configured via environment variables:

Variable	Description	Default
`PDF_TRANSCRIBER_OUTPUT_DIR`	Where transcriptions are saved	`./transcriptions`
`PDF_TRANSCRIBER_QUALITY`	fast, balanced, high-quality	`balanced`
`PDF_TRANSCRIBER_USE_GPU`	Enable GPU acceleration	Auto-detected
`PDF_TRANSCRIBER_USE_LLM`	Enable LLM-enhanced OCR	`true`
`PDF_TRANSCRIBER_OLLAMA_URL`	Ollama server URL	`http://localhost:11434`
`PDF_TRANSCRIBER_OLLAMA_MODEL`	Vision model for OCR	`qwen2.5vl:3b`
`PDF_TRANSCRIBER_CHUNK_SIZE`	Pages per chunk (large PDFs)	`25`
`PDF_TRANSCRIBER_AUTO_CHUNK_THRESHOLD`	Auto-chunk above this page count	`100`

CLI Commands

Command	Description
`transcribe <pdf>`	Transcribe a PDF to Markdown
`check`	Health check (config, paths, Ollama)
`install-skill`	Install Claude Code `/transcribe` skill

MCP Tools

When running as an MCP server, these tools are available:

Tool	Description
`transcribe_pdf`	Convert PDF to Markdown
`clear_transcription_cache`	Free ~2GB memory from cached OCR models
`lint_paper`	Fix common OCR artifacts

MCP Server vs CLI + Skill: When to Use What

Context Usage Comparison

Approach	Context Overhead	Best For
MCP Server	~1,200 tokens (3 tools)	Frequent transcription, linting workflows
CLI + Skill	~200 tokens (skill definition only)	Occasional use, context-constrained sessions
CLI only	0 tokens	Automation, CI pipelines

Recommendation

Frequent transcription: Use MCP Server — tools always available
Occasional transcription: Use CLI + Skill — minimal context overhead
CI/CD pipelines: Use CLI only — zero agent orchestrator dependency

Quality Presets

Preset	DPI	Resolution	Use Case
`fast`	100	~1275×1650px	Quick previews, simple documents
`balanced`	150	~1913×2475px	Default - best quality/speed
`high-quality`	200	~2550×3300px	Complex math, small text

Linting

Transcriptions are automatically linted after transcription to fix common OCR artifacts. The original (pre-lint) version is saved as {name}.original.md.

Available Lint Rules

Markdown Structure Rules

Rule	Auto-Fix	Description
`excessive_blank_lines`	✅	Reduces >2 consecutive blank lines
`trailing_whitespace`	✅	Removes spaces/tabs at end of lines
`leading_whitespace`	✅	Fixes inconsistent leading whitespace
`header_whitespace`	✅	Normalizes spacing around headers
`sparse_table_row`	⚠️	Warns about table rows >50% empty cells
`orphaned_list_marker`	⚠️	Warns about list markers with no content

PDF Artifact Rules

Rule	Auto-Fix	Description
`page_number`	✅	Removes standalone page numbers like "42"
`page_marker`	✅	Removes page break markers
`orphaned_label`	✅	Removes orphaned LaTeX labels like `def:Tilt`
`hyphenation_artifact`	✅	Rejoins words split across lines (`hy-\nphenated`)
`html_artifacts`	✅	Converts HTML tags to markdown equivalents
`html_math_notation`	✅	Converts `<sup>2</sup>` to $^2$ in math context
`footnote_spacing`	✅	Fixes spacing around footnote markers
`malformed_footnote`	⚠️	Warns about malformed footnote references
`garbled_text`	⚠️	Warns about corrupted/nonsense text fragments
`repeated_line`	⚠️	Warns about likely running headers/footers

Math Notation Rules

Rule	Auto-Fix	Description
`unicode_math_symbols`	✅	Converts Unicode math (α, →, ∈) to LaTeX (`\alpha`, `\to`, `\in`)
`unwrapped_math_expressions`	✅	Wraps bare math expressions in $...$
`broken_math_delimiters`	✅	Fixes unbalanced `$` delimiters
`space_in_math_variable`	✅	Removes spaces in variable names (`x _1` → `x_1`)
`display_math_whitespace`	✅	Normalizes whitespace around `$$...$$` blocks
`repetition_hallucination`	⚠️	Warns about repeated sequences (OCR hallucination)

Running Specific Rules

To run only specific lint rules (via MCP or programmatically):

# Run only math-related rules
lint_paper(paper_path, rules=["unicode_math_symbols", "broken_math_delimiters"])

# Run only whitespace cleanup
lint_paper(paper_path, rules=["excessive_blank_lines", "trailing_whitespace"])

# Preview issues without fixing
lint_paper(paper_path, fix=False)

Customizing for Your Workflow

If you're seeing specific patterns in your PDFs, you can run targeted lint passes:

For math-heavy papers:

lint_paper(path, rules=[
    "unicode_math_symbols",
    "unwrapped_math_expressions",
    "broken_math_delimiters",
    "space_in_math_variable"
])

For scanned books with page numbers:

lint_paper(path, rules=[
    "page_number",
    "page_marker",
    "repeated_line",  # catches running headers
    "hyphenation_artifact"
])

For cleaning up whitespace only:

lint_paper(path, rules=[
    "excessive_blank_lines",
    "trailing_whitespace",
    "display_math_whitespace"
])

Adding Custom Lint Rules

If you're seeing specific patterns in your PDFs that aren't caught by existing rules, you can add custom rules.

Rules are generator functions that take the file content and yield LintIssue objects:

# my_rules.py
import re
from pdf_transcriber.core.linter.models import LintIssue, Severity, Fix

def my_custom_rule(content: str):
    """
    Detect and fix a specific pattern in your PDFs.

    Rules are generators that yield LintIssue objects.
    """
    pattern = re.compile(r'PATTERN_TO_MATCH')

    for match in pattern.finditer(content):
        line_num = content[:match.start()].count('\n') + 1

        yield LintIssue(
            rule="my_custom_rule",
            severity=Severity.AUTO_FIX,  # or WARNING for manual review
            line=line_num,
            message="Description of the issue",
            fix=Fix(
                old=match.group(),
                new="replacement text"
            )
        )

To register your rule, add it to rules/__init__.py:

# In RULES dict:
"my_custom_rule": my_module.my_custom_rule,

# If auto-fixable, add to DEFAULT_AUTO_FIX:
DEFAULT_AUTO_FIX.add("my_custom_rule")

Severity levels:

Level	Use Case
`Severity.AUTO_FIX`	Safe to fix automatically (provide a `Fix`)
`Severity.WARNING`	Needs human review
`Severity.ERROR`	Must be addressed before use

Disabling Automatic Linting

To skip linting during transcription:

# CLI
pdf-transcriber-cli transcribe paper.pdf --no-lint

# MCP tool
transcribe_pdf(pdf_path, lint=False)

You can then run linting manually later with custom rules.

License

MIT

Contributing

Issues and PRs welcome at https://github.com/AugustSchmidt/pdf-transcriber

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Feb 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_transcriber-1.0.0.tar.gz (42.7 kB view details)

Uploaded Feb 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_transcriber-1.0.0-py3-none-any.whl (52.3 kB view details)

Uploaded Feb 4, 2026 Python 3

File details

Details for the file pdf_transcriber-1.0.0.tar.gz.

File metadata

Download URL: pdf_transcriber-1.0.0.tar.gz
Upload date: Feb 4, 2026
Size: 42.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.21

File hashes

Hashes for pdf_transcriber-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`99ec22b5a7fa73129a0591f30dc492384ae4e8b9b27ef0a73fb8939e86b4c4f5`
MD5	`d846afaacadf1e91b2b1b52efce7c4cf`
BLAKE2b-256	`cc144098992e7a348c5412f4518902fde271b1d760a427fb85601820e3521414`

See more details on using hashes here.

File details

Details for the file pdf_transcriber-1.0.0-py3-none-any.whl.

File metadata

Download URL: pdf_transcriber-1.0.0-py3-none-any.whl
Upload date: Feb 4, 2026
Size: 52.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.21

File hashes

Hashes for pdf_transcriber-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`31c27f7239dc6549c3f24e9099b6e04743982105beed4cc4892bc093a2adcabd`
MD5	`3da1a6ba0b64bd293d5647eea9759003`
BLAKE2b-256	`c30b8144ba909d948e6893935aebd5bce5bbfd15944b0fa6e4c03c29364d8d0b`

See more details on using hashes here.

pdf-transcriber 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF Transcriber

Installation

Recommended (isolated environment)

Alternative (pip)

Verify installation

Three Ways to Use

1. CLI (Direct Terminal Usage)

2. Claude Code Skill (Slash Command)

3. MCP Server (Claude Code Integration)

LLM-Enhanced OCR Setup

Quick Start

Recommended Vision Models

Choosing a Model

Without LLM Enhancement

Configuration

CLI Commands

MCP Tools

MCP Server vs CLI + Skill: When to Use What

Context Usage Comparison

Recommendation

Quality Presets

Linting

Available Lint Rules

Markdown Structure Rules

PDF Artifact Rules

Math Notation Rules

Running Specific Rules

Customizing for Your Workflow

Adding Custom Lint Rules

Disabling Automatic Linting

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes