CLI utility to convert PDFs and supported document formats to Markdown/JSON/HTML with Marker API.
Project description
PDF to markdown CLI
Command-line utility for converting PDFs and other supported documents into Markdown, JSON, or HTML using the Marker API.
Why use this tool
- Converts single files or entire directories
- Automatically splits large PDFs into chunks and merges results
- Persists request state locally so interrupted runs can recover
- Rewrites and copies extracted images into deterministic output folders
- Supports OCR/LLM tuning flags from the Marker API
Supported formats
Input
- PDF (
.pdf) - Word (
.doc,.docx,.odt) - PowerPoint (
.ppt,.pptx,.odp) - Spreadsheets (
.xls,.xlsx,.ods) - EPUB/HTML (
.epub,.html) - Images (
.png,.jpg,.jpeg,.webp,.gif,.tiff)
Output
- Markdown (
.md, default) - JSON (
.json) - HTML (
.html)
Installation
pip install pdf-to-markdown-cli
From source:
git clone https://github.com/SokolskyNikita/pdf-to-markdown-cli.git
cd pdf-to-markdown-cli
pip install -e .
Quick start
export MARKER_PDF_KEY="your_api_key"
pdf-to-md ./examples/equations.pdf
Process a directory:
pdf-to-md ./docs
Use JSON or HTML output:
pdf-to-md ./examples/equations.pdf --json
pdf-to-md ./examples/equations.pdf --html
CLI options
input: input file or directory path--json: output JSON instead of Markdown--html: output HTML instead of Markdown--langs: comma-separated OCR languages (default:English)--llm: enable LLM-enhanced processing--strip: redo OCR--noimg: disable image extraction--force: force OCR on all pages--pages: include page delimiters--max: enable all OCR enhancement flags (--llm --strip --force)-mp,--max-pages: process only the first N pages--no-chunk: disable PDF chunking-cs,--chunk-size: PDF pages per chunk (default:25)-o,--output-dir: absolute output directory path-v,--verbose: debug logging--version: show installed package version
Development
Run tests:
python -m unittest discover -s tests -v
For contributions or questions, open a GitHub issue.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_to_markdown_cli-0.5.2.tar.gz.
File metadata
- Download URL: pdf_to_markdown_cli-0.5.2.tar.gz
- Upload date:
- Size: 32.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2f4fd12ae5e3dcef5ea69ac747c13a9664bec21ad80733247bb0bf75824e516
|
|
| MD5 |
a65ea04fa619f9d06cec37098130624c
|
|
| BLAKE2b-256 |
75035ed068ed8d71b6db74e796743cdf36cdab41fed1cb3889e90d2cc0ef3d51
|
File details
Details for the file pdf_to_markdown_cli-0.5.2-py3-none-any.whl.
File metadata
- Download URL: pdf_to_markdown_cli-0.5.2-py3-none-any.whl
- Upload date:
- Size: 32.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aaeb6512503be3b81205ed8a7c95ee9d17da38863c8987494f2c3e8e5a18f773
|
|
| MD5 |
18fd2f19ba57763be7c579ffb4a01336
|
|
| BLAKE2b-256 |
a3142405d50620d7e300fdc37c3dbd391d2b4820c17b451b4eccc7e32359b6b1
|