Skip to main content

CLI for PDF text extraction using Marker's layout-aware pipeline

Project description

Marker OCR CLI

CI PyPI version Python 3.11+ License: MIT

A command-line tool for OCR processing using Marker's layout-aware pipeline. Extract text, equations, tables, and figures from PDFs with high accuracy.

Installation

Requires Python 3.11+ and a GPU (recommended).

pip install marker-ocr-cli

Or from source:

git clone https://github.com/r-uben/marker-ocr-cli.git
cd marker-ocr-cli
uv sync

Quick start

# Process a single file
marker-ocr paper.pdf

# Process a directory
marker-ocr ./papers/ -o ./results/

# Preview what would be processed (no model loading)
marker-ocr ./papers/ --dry-run

# Process specific pages
marker-ocr paper.pdf --pages 0-5

# Force OCR on all pages
marker-ocr paper.pdf --force-ocr

Options

Usage: marker-ocr [OPTIONS] INPUT_PATH

Options:
  -o, --output-dir PATH           Output directory (default: <input_dir>/marker_ocr_output/)
  --pages TEXT                    Page range (e.g., '0-5' or '1,3,5')
  --force-ocr                     Force OCR on all pages regardless of embedded text

  --reprocess                     Reprocess already-processed files
  --dry-run                       List files without loading models
  -q, --quiet                     Suppress all output except errors
  -v, --verbose                   Enable verbose/debug output
  --info                          Show system and device info
  --version                       Show version
  --help                          Show this message

Output structure

marker_ocr_output/
├── document_name/
│   ├── document_name.md        # OCR markdown (clean text only)
│   └── figures/                # extracted figures
│       ├── figure_1.png
│       └── figure_2.png
├── another_document/
│   └── ...
└── metadata.json               # processing stats, checksums, file list

How it works

Marker uses a pipeline of specialized models rather than a single end-to-end model:

  • Surya -- layout detection and reading order
  • Surya OCR -- text recognition
  • Texify -- equation detection and LaTeX conversion

This approach is faster and more accurate than single-model solutions, especially for academic papers with complex layouts, equations, and tables.

Development

# Install dev dependencies
uv sync --extra dev

# Run tests
uv run pytest

# Lint
uv run ruff check .

# Format
uv run ruff format .

# Type check
uv run mypy marker_ocr/ --ignore-missing-imports

Limitations

  • Supported formats: PDF only (Marker processes PDFs natively)
  • Models: ~4-5 GB VRAM (auto-downloads on first run)
  • GPU recommended for reasonable speed (supports CUDA and MPS)

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

marker_ocr_cli-0.1.0.tar.gz (155.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

marker_ocr_cli-0.1.0-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file marker_ocr_cli-0.1.0.tar.gz.

File metadata

  • Download URL: marker_ocr_cli-0.1.0.tar.gz
  • Upload date:
  • Size: 155.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for marker_ocr_cli-0.1.0.tar.gz
Algorithm Hash digest
SHA256 50973edea5cbd16d28759152a37759606d046f9df09cb2e85651a2b9f9b53e0d
MD5 485579972fe3599dc0e51632df2feee3
BLAKE2b-256 88cc708f97d2f8af440d885a031750e576c8549c46ba587fd810abf39e0468e5

See more details on using hashes here.

File details

Details for the file marker_ocr_cli-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: marker_ocr_cli-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for marker_ocr_cli-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c5f940a547ef1684884388219e40ce04f3a70e074f1994fe2170ac3511395322
MD5 788a886dc52810f5ce568fa33142bd43
BLAKE2b-256 28b48ebd80e2cf2863c82e8c1d5a4e4af10bdc8c24e4f8bb0daf8a5196c2e70d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page