Simple CLI tool to convert PDFs to Markdown using Marker AI

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

carloscasalar

These details have not been verified by PyPI

Project description

pdf2md-ocr

Simple CLI tool to convert PDFs to Markdown using Marker AI.

Quick Start

Recommended (no installation needed):

uvx pdf2md-ocr input.pdf -o output.md

Traditional installation:

pip install pdf2md-ocr
pdf2md-ocr input.pdf -o output.md

Usage

# Convert PDF to Markdown (output same name with .md extension)
pdf2md-ocr document.pdf

# Specify output file
pdf2md-ocr document.pdf -o result.md

# Convert specific page range (page numbering starts at 1)
pdf2md-ocr document.pdf --start-page 2 --end-page 5

# Convert from page 3 to the end
pdf2md-ocr document.pdf --start-page 3

# Convert from the beginning to page 10
pdf2md-ocr document.pdf --end-page 10

# Show cache location and size
pdf2md-ocr document.pdf --show-cache-info

# Show help
pdf2md-ocr --help

# Show version
pdf2md-ocr --version

Page Range Options

--start-page N: Starting page number (1-based, inclusive). If omitted, starts from page 1.
--end-page M: Ending page number (1-based, inclusive). If omitted, goes to the last page.

Both options are optional and can be combined:

Use only --start-page to convert from a specific page to the end.
Use only --end-page to convert from the beginning to a specific page.
Use both to convert a specific range.

Important: Page numbering starts at 1 (not 0).

First Run

The first time you run pdf2md-ocr, it will download ~2-3GB of AI models. These models are cached locally and reused for all future conversions.

To see where models are cached:

pdf2md-ocr input.pdf --show-cache-info

This will show the cache location and size after conversion. Cache locations, typically:

macOS: ~/Library/Caches/datalab/models/
Linux: ~/.cache/datalab/models/
Windows: %LOCALAPPDATA%\datalab\models\

To clear the cache: Simply delete the cache directory shown in the info above, or use make clean-cache if developing locally.

Subsequent runs will be much faster since the models are already cached.

Requirements

Python 3.10 or higher
~2GB disk space for AI models (one-time download)

Development

For development, a Makefile is provided with common tasks:

# Install dependencies
make install-dev

# Run tests
make test

# Run tests with verbose output
make test-verbose

# Clean build artifacts
make clean

# Clear AI model cache (frees ~3GB disk space)
make clean-cache

# Build distribution packages
make build

# See all available commands
make help

How It Works

This tool is a minimal wrapper around the excellent marker-pdf library, which uses AI models to:

Detect text, tables, and equations in PDFs
Extract content with proper formatting
Convert to clean Markdown

License

GPL-3.0-or-later

This project is licensed under the GNU General Public License v3.0 or later to comply with the marker-pdf library license (GPL-3.0-or-later).

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

carloscasalar

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.1

Jan 18, 2026

This version

1.0.0

Dec 13, 2025

0.0.5

Dec 13, 2025

0.0.4

Nov 20, 2025

0.0.3

Nov 16, 2025

0.0.2

Nov 16, 2025

0.0.1

Nov 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2md_ocr-1.0.0.tar.gz (185.0 kB view details)

Uploaded Dec 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf2md_ocr-1.0.0-py3-none-any.whl (6.3 kB view details)

Uploaded Dec 13, 2025 Python 3

File details

Details for the file pdf2md_ocr-1.0.0.tar.gz.

File metadata

Download URL: pdf2md_ocr-1.0.0.tar.gz
Upload date: Dec 13, 2025
Size: 185.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf2md_ocr-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`5b8180d827ff40d0ae4d2e5fa06ea32b7b7faa5315d45050eaca2709c3571b30`
MD5	`c7d7a507b71fb32cdc4858e46604d8c2`
BLAKE2b-256	`cf6a42ce6a20e791fef495d24652dc01337a9128c513b55ba0b73dcebb457ce4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2md_ocr-1.0.0.tar.gz:

Publisher: publish-to-pypi.yml on carloscasalar/pdf2md-ocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdf2md_ocr-1.0.0.tar.gz
- Subject digest: 5b8180d827ff40d0ae4d2e5fa06ea32b7b7faa5315d45050eaca2709c3571b30
- Sigstore transparency entry: 763414658
- Sigstore integration time: Dec 13, 2025
Source repository:
- Permalink: carloscasalar/pdf2md-ocr@4d30006d1fc4c7a22093fb939dcc0fc7408245a5
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/carloscasalar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@4d30006d1fc4c7a22093fb939dcc0fc7408245a5
- Trigger Event: push

File details

Details for the file pdf2md_ocr-1.0.0-py3-none-any.whl.

File metadata

Download URL: pdf2md_ocr-1.0.0-py3-none-any.whl
Upload date: Dec 13, 2025
Size: 6.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf2md_ocr-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cc6789ea0496416e5fa488ebc1d3b9a506c60357a021019ad9472f3c9594fafb`
MD5	`684b0dac47ca53d549211a76cd96dada`
BLAKE2b-256	`44b261f33f75019314e8ad9fe8a38ee93c9b81d006cea999c7e4c1706d21904a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2md_ocr-1.0.0-py3-none-any.whl:

Publisher: publish-to-pypi.yml on carloscasalar/pdf2md-ocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdf2md_ocr-1.0.0-py3-none-any.whl
- Subject digest: cc6789ea0496416e5fa488ebc1d3b9a506c60357a021019ad9472f3c9594fafb
- Sigstore transparency entry: 763414660
- Sigstore integration time: Dec 13, 2025
Source repository:
- Permalink: carloscasalar/pdf2md-ocr@4d30006d1fc4c7a22093fb939dcc0fc7408245a5
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/carloscasalar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@4d30006d1fc4c7a22093fb939dcc0fc7408245a5
- Trigger Event: push

pdf2md-ocr 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pdf2md-ocr

Quick Start

Usage

Page Range Options

First Run

Requirements

Development

How It Works

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance