CLI for PDF text extraction using Marker's layout-aware pipeline
Project description
Marker OCR CLI
A command-line tool for OCR processing using Marker's layout-aware pipeline. Extract text, equations, tables, and figures from PDFs with high accuracy.
Installation
Requires Python 3.11+ and a GPU (recommended).
pip install marker-ocr-cli
Or from source:
git clone https://github.com/r-uben/marker-ocr-cli.git
cd marker-ocr-cli
uv sync
Quick start
# Process a single file
marker-ocr paper.pdf
# Process a directory
marker-ocr ./papers/ -o ./results/
# Preview what would be processed (no model loading)
marker-ocr ./papers/ --dry-run
# Process specific pages
marker-ocr paper.pdf --pages 0-5
# Force OCR on all pages
marker-ocr paper.pdf --force-ocr
Options
Usage: marker-ocr [OPTIONS] INPUT_PATH
Options:
-o, --output-dir PATH Output directory (default: <input_dir>/marker_ocr_output/)
--pages TEXT Page range (e.g., '0-5' or '1,3,5')
--force-ocr Force OCR on all pages regardless of embedded text
--reprocess Reprocess already-processed files
--dry-run List files without loading models
-q, --quiet Suppress all output except errors
-v, --verbose Enable verbose/debug output
--info Show system and device info
--version Show version
--help Show this message
Output structure
marker_ocr_output/
├── document_name/
│ ├── document_name.md # OCR markdown (clean text only)
│ └── figures/ # extracted figures
│ ├── figure_1.png
│ └── figure_2.png
├── another_document/
│ └── ...
└── metadata.json # processing stats, checksums, file list
How it works
Marker uses a pipeline of specialized models rather than a single end-to-end model:
- Surya -- layout detection and reading order
- Surya OCR -- text recognition
- Texify -- equation detection and LaTeX conversion
This approach is faster and more accurate than single-model solutions, especially for academic papers with complex layouts, equations, and tables.
Development
# Install dev dependencies
uv sync --extra dev
# Run tests
uv run pytest
# Lint
uv run ruff check .
# Format
uv run ruff format .
# Type check
uv run mypy marker_ocr/ --ignore-missing-imports
Limitations
- Supported formats: PDF only (Marker processes PDFs natively)
- Models: ~4-5 GB VRAM (auto-downloads on first run)
- GPU recommended for reasonable speed (supports CUDA and MPS)
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file marker_ocr_cli-0.1.0.tar.gz.
File metadata
- Download URL: marker_ocr_cli-0.1.0.tar.gz
- Upload date:
- Size: 155.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50973edea5cbd16d28759152a37759606d046f9df09cb2e85651a2b9f9b53e0d
|
|
| MD5 |
485579972fe3599dc0e51632df2feee3
|
|
| BLAKE2b-256 |
88cc708f97d2f8af440d885a031750e576c8549c46ba587fd810abf39e0468e5
|
File details
Details for the file marker_ocr_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: marker_ocr_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5f940a547ef1684884388219e40ce04f3a70e074f1994fe2170ac3511395322
|
|
| MD5 |
788a886dc52810f5ce568fa33142bd43
|
|
| BLAKE2b-256 |
28b48ebd80e2cf2863c82e8c1d5a4e4af10bdc8c24e4f8bb0daf8a5196c2e70d
|