CLI for PDF text extraction using Marker's layout-aware pipeline

These details have not been verified by PyPI

Project links

Project description

Marker OCR CLI

A command-line tool for OCR processing using Marker's layout-aware pipeline. Extract text, equations, tables, and figures from PDFs with high accuracy.

Installation

Requires Python 3.11+ and a GPU (recommended).

pip install marker-ocr-cli

Or from source:

git clone https://github.com/r-uben/marker-ocr-cli.git
cd marker-ocr-cli
uv sync

Quick start

# Process a single file
marker-ocr paper.pdf

# Process a directory
marker-ocr ./papers/ -o ./results/

# Preview what would be processed (no model loading)
marker-ocr ./papers/ --dry-run

# Process specific pages
marker-ocr paper.pdf --pages 0-5

# Force OCR on all pages
marker-ocr paper.pdf --force-ocr

Options

Usage: marker-ocr [OPTIONS] INPUT_PATH

Options:
  -o, --output-dir PATH           Output directory (default: <input_dir>/marker_ocr_output/)
  --pages TEXT                    Page range (e.g., '0-5' or '1,3,5')
  --force-ocr                     Force OCR on all pages regardless of embedded text

  --reprocess                     Reprocess already-processed files
  --dry-run                       List files without loading models
  -q, --quiet                     Suppress all output except errors
  -v, --verbose                   Enable verbose/debug output
  --info                          Show system and device info
  --version                       Show version
  --help                          Show this message

Output structure

marker_ocr_output/
├── document_name/
│   ├── document_name.md        # OCR markdown (clean text only)
│   └── figures/                # extracted figures
│       ├── figure_1.png
│       └── figure_2.png
├── another_document/
│   └── ...
└── metadata.json               # processing stats, checksums, file list

How it works

Marker uses a pipeline of specialized models rather than a single end-to-end model:

Surya -- layout detection and reading order
Surya OCR -- text recognition
Texify -- equation detection and LaTeX conversion

This approach is faster and more accurate than single-model solutions, especially for academic papers with complex layouts, equations, and tables.

Development

# Install dev dependencies
uv sync --extra dev

# Run tests
uv run pytest

# Lint
uv run ruff check .

# Format
uv run ruff format .

# Type check
uv run mypy marker_ocr/ --ignore-missing-imports

Limitations

Supported formats: PDF only (Marker processes PDFs natively)
Models: ~4-5 GB VRAM (auto-downloads on first run)
GPU recommended for reasonable speed (supports CUDA and MPS)

License

MIT License - see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Mar 12, 2026

This version

0.1.0

Mar 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

marker_ocr_cli-0.1.0.tar.gz (155.1 kB view details)

Uploaded Mar 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

marker_ocr_cli-0.1.0-py3-none-any.whl (12.3 kB view details)

Uploaded Mar 12, 2026 Python 3

File details

Details for the file marker_ocr_cli-0.1.0.tar.gz.

File metadata

Download URL: marker_ocr_cli-0.1.0.tar.gz
Upload date: Mar 12, 2026
Size: 155.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for marker_ocr_cli-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`50973edea5cbd16d28759152a37759606d046f9df09cb2e85651a2b9f9b53e0d`
MD5	`485579972fe3599dc0e51632df2feee3`
BLAKE2b-256	`88cc708f97d2f8af440d885a031750e576c8549c46ba587fd810abf39e0468e5`

See more details on using hashes here.

File details

Details for the file marker_ocr_cli-0.1.0-py3-none-any.whl.

File metadata

Download URL: marker_ocr_cli-0.1.0-py3-none-any.whl
Upload date: Mar 12, 2026
Size: 12.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for marker_ocr_cli-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c5f940a547ef1684884388219e40ce04f3a70e074f1994fe2170ac3511395322`
MD5	`788a886dc52810f5ce568fa33142bd43`
BLAKE2b-256	`28b48ebd80e2cf2863c82e8c1d5a4e4af10bdc8c24e4f8bb0daf8a5196c2e70d`

See more details on using hashes here.

marker-ocr-cli 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Marker OCR CLI

Installation

Quick start

Options

Output structure

How it works

Development

Limitations

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes