Skip to main content

Simple CLI tool to convert PDFs to Markdown using Marker AI

Project description

pdf2md-ocr

Simple CLI tool to convert PDFs to Markdown using Marker AI.

Quick Start

Recommended (no installation needed):

uvx pdf2md-ocr input.pdf -o output.md

Traditional installation:

pip install pdf2md-ocr
pdf2md-ocr input.pdf -o output.md

Usage

# Convert PDF to Markdown (output same name with .md extension)
pdf2md-ocr document.pdf

# Specify output file
pdf2md-ocr document.pdf -o result.md

# Show cache location and size
pdf2md-ocr document.pdf --show-cache-info

# Show help
pdf2md-ocr --help

# Show version
pdf2md-ocr --version

First Run

The first time you run pdf2md-ocr, it will download ~2-3GB of AI models. These models are cached locally and reused for all future conversions.

To see where models are cached:

pdf2md-ocr input.pdf --show-cache-info

This will show the cache location and size after conversion. Cache locations, typically:

  • macOS: ~/Library/Caches/datalab/models/
  • Linux: ~/.cache/datalab/models/
  • Windows: %LOCALAPPDATA%\datalab\models\

To clear the cache: Simply delete the cache directory shown in the info above, or use make clean-cache if developing locally.

Subsequent runs will be much faster since the models are already cached.

Requirements

  • Python 3.10 or higher
  • ~2GB disk space for AI models (one-time download)

Development

For development, a Makefile is provided with common tasks:

# Install dependencies
make install-dev

# Run tests
make test

# Run tests with verbose output
make test-verbose

# Clean build artifacts
make clean

# Clear AI model cache (frees ~3GB disk space)
make clean-cache

# Build distribution packages
make build

# See all available commands
make help

How It Works

This tool is a minimal wrapper around the excellent marker-pdf library, which uses AI models to:

  1. Detect text, tables, and equations in PDFs
  2. Extract content with proper formatting
  3. Convert to clean Markdown

License

GPL-3.0-or-later

This project is licensed under the GNU General Public License v3.0 or later to comply with the marker-pdf library license (GPL-3.0-or-later).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2md_ocr-0.0.3.tar.gz (21.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2md_ocr-0.0.3-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file pdf2md_ocr-0.0.3.tar.gz.

File metadata

  • Download URL: pdf2md_ocr-0.0.3.tar.gz
  • Upload date:
  • Size: 21.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf2md_ocr-0.0.3.tar.gz
Algorithm Hash digest
SHA256 dfa4037257fb2cfbfc268d7f0398edbd689501dbe2dbd5ae0a014dd1d50303af
MD5 5aa3a4f3f1cdfaa196465a937e7ec7b9
BLAKE2b-256 5a21e8cb43ac1727fad9d40e7cc261b6527c25db227d4332a0878129d33b1e98

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2md_ocr-0.0.3.tar.gz:

Publisher: publish-to-pypi.yml on carloscasalar/pdf2md-ocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pdf2md_ocr-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: pdf2md_ocr-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf2md_ocr-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6c4292fb60608080a32743201361592edccbe7ba99e2a0d3c0e002b3ccddc436
MD5 bb4db906d214a1abc3f94d19fbf9c793
BLAKE2b-256 bc07051407f26c25113cf763a3040c2be3e77195b6839639930bc2732f54cfd2

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2md_ocr-0.0.3-py3-none-any.whl:

Publisher: publish-to-pypi.yml on carloscasalar/pdf2md-ocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page