Skip to main content

CLI for PDF text extraction using Marker's layout-aware pipeline

Project description

Marker OCR CLI

CI PyPI version Python 3.11+ License: MIT

A command-line tool for OCR processing using Marker's layout-aware pipeline. Extract text, equations, tables, and figures from PDFs with high accuracy.

Installation

Requires Python 3.11+ and a GPU (recommended).

pip install marker-ocr-cli

Or from source:

git clone https://github.com/r-uben/marker-ocr-cli.git
cd marker-ocr-cli
uv sync

Quick start

# Process a single file
marker-ocr paper.pdf

# Process a directory
marker-ocr ./papers/ -o ./results/

# Preview what would be processed (no model loading)
marker-ocr ./papers/ --dry-run

# Process specific pages
marker-ocr paper.pdf --pages 0-5

# Force OCR on all pages
marker-ocr paper.pdf --force-ocr

Options

Usage: marker-ocr [OPTIONS] INPUT_PATH

Options:
  -o, --output-dir PATH           Output directory (default: <input_dir>/marker_ocr_output/)
  --pages TEXT                    Page range (e.g., '0-5' or '1,3,5')
  --force-ocr                     Force OCR on all pages regardless of embedded text

  --device [auto|cpu|cuda|mps]    Inference device (default: cpu on Apple Silicon)
  --reprocess                     Reprocess already-processed files
  --dry-run                       List files without loading models
  -q, --quiet                     Suppress all output except errors
  -v, --verbose                   Enable verbose/debug output
  --info                          Show system and device info
  --version                       Show version
  --help                          Show this message

Output structure

marker_ocr_output/
├── document_name/
│   ├── document_name.md        # OCR markdown (clean text only)
│   └── figures/                # extracted figures
│       ├── figure_1.png
│       └── figure_2.png
├── another_document/
│   └── ...
└── metadata.json               # processing stats, checksums, file list

How it works

Marker uses a pipeline of specialized models rather than a single end-to-end model:

  • Surya -- layout detection and reading order
  • Surya OCR -- text recognition
  • Texify -- equation detection and LaTeX conversion

This approach is faster and more accurate than single-model solutions, especially for academic papers with complex layouts, equations, and tables.

Development

# Install dev dependencies
uv sync --extra dev

# Run tests
uv run pytest

# Lint
uv run ruff check .

# Format
uv run ruff format .

# Type check
uv run mypy marker_ocr/ --ignore-missing-imports

Limitations

  • Supported formats: PDF only (Marker processes PDFs natively)
  • Models: ~4-5 GB VRAM (auto-downloads on first run)
  • GPU recommended for reasonable speed (supports CUDA and MPS)

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

marker_ocr_cli-0.2.0.tar.gz (156.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

marker_ocr_cli-0.2.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file marker_ocr_cli-0.2.0.tar.gz.

File metadata

  • Download URL: marker_ocr_cli-0.2.0.tar.gz
  • Upload date:
  • Size: 156.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for marker_ocr_cli-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b005cc588549eee07f7cf553c4671c58173653fa957c5c5aa6bd0656ed5448f0
MD5 1c710a283777123b980c3a77f6addcf4
BLAKE2b-256 af986b625442aa8b8797a708bdfa6d79188075ac143a66a939b140de5bedb7f3

See more details on using hashes here.

File details

Details for the file marker_ocr_cli-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: marker_ocr_cli-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for marker_ocr_cli-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 db2bdc345ca9065e1fe751b5ba1cfc92658f246061046237156785b48fe02230
MD5 2a9722a0aa92c8f6e5b3b6e418cc6d22
BLAKE2b-256 dd981f8a61a93fdee6a4ca1dff945065bb543e08b2655c167daf1dee7518e3c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page