Skip to main content

CLI utility to convert PDFs and supported document formats to Markdown/JSON/HTML with Marker API.

Project description

PDF to markdown CLI

PyPI Python versions License: MIT

Command-line utility for converting PDFs and other supported documents into Markdown, JSON, or HTML using the Marker API.

Why use this tool

  • Converts single files or entire directories
  • Automatically splits large PDFs into chunks and merges results
  • Persists request state locally so interrupted runs can recover
  • Rewrites and copies extracted images into deterministic output folders
  • Supports OCR/LLM tuning flags from the Marker API

Supported formats

Input

  • PDF (.pdf)
  • Word (.doc, .docx, .odt)
  • PowerPoint (.ppt, .pptx, .odp)
  • Spreadsheets (.xls, .xlsx, .ods)
  • EPUB/HTML (.epub, .html)
  • Images (.png, .jpg, .jpeg, .webp, .gif, .tiff)

Output

  • Markdown (.md, default)
  • JSON (.json)
  • HTML (.html)

Installation

pip install pdf-to-markdown-cli

From source:

git clone https://github.com/SokolskyNikita/pdf-to-markdown-cli.git
cd pdf-to-markdown-cli
pip install -e .

Quick start

export MARKER_PDF_KEY="your_api_key"
pdf-to-md ./examples/equations.pdf

Process a directory:

pdf-to-md ./docs

Use JSON or HTML output:

pdf-to-md ./examples/equations.pdf --json
pdf-to-md ./examples/equations.pdf --html

CLI options

  • input: input file or directory path
  • --json: output JSON instead of Markdown
  • --html: output HTML instead of Markdown
  • --langs: comma-separated OCR languages (default: English)
  • --llm: enable LLM-enhanced processing
  • --strip: redo OCR
  • --noimg: disable image extraction
  • --force: force OCR on all pages
  • --pages: include page delimiters
  • --max: enable all OCR enhancement flags (--llm --strip --force)
  • -mp, --max-pages: process only the first N pages
  • --no-chunk: disable PDF chunking
  • -cs, --chunk-size: PDF pages per chunk (default: 25)
  • -o, --output-dir: absolute output directory path
  • -v, --verbose: debug logging
  • --version: show installed package version

Development

Run tests:

python -m unittest discover -s tests -v

For contributions or questions, open a GitHub issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_to_markdown_cli-0.5.2.tar.gz (32.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_to_markdown_cli-0.5.2-py3-none-any.whl (32.7 kB view details)

Uploaded Python 3

File details

Details for the file pdf_to_markdown_cli-0.5.2.tar.gz.

File metadata

  • Download URL: pdf_to_markdown_cli-0.5.2.tar.gz
  • Upload date:
  • Size: 32.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for pdf_to_markdown_cli-0.5.2.tar.gz
Algorithm Hash digest
SHA256 f2f4fd12ae5e3dcef5ea69ac747c13a9664bec21ad80733247bb0bf75824e516
MD5 a65ea04fa619f9d06cec37098130624c
BLAKE2b-256 75035ed068ed8d71b6db74e796743cdf36cdab41fed1cb3889e90d2cc0ef3d51

See more details on using hashes here.

File details

Details for the file pdf_to_markdown_cli-0.5.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_to_markdown_cli-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 aaeb6512503be3b81205ed8a7c95ee9d17da38863c8987494f2c3e8e5a18f773
MD5 18fd2f19ba57763be7c579ffb4a01336
BLAKE2b-256 a3142405d50620d7e300fdc37c3dbd391d2b4820c17b451b4eccc7e32359b6b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page