Skip to main content

A Python tool for converting PDF files to Markdown

Project description

pdf2dotmd

A Python CLI tool that converts PDF files to Markdown format with intelligent layout analysis.

Features

  • Layout-aware text extraction — reconstructs logical reading order from PDF spatial data
  • Multi-column detection — handles two-column and multi-column layouts
  • Table extraction — converts PDF tables to Markdown pipe tables
  • Heading inference — detects headings from font size hierarchy
  • Header/footer filtering — automatically removes repeated page headers and footers
  • Image extraction — extracts embedded images to an assets/ directory
  • Ignore images mode--ignore-images flag for text-only output
  • Page range selection — convert specific pages only
  • Batch conversion — process multiple PDF files with wildcards

Installation

pip install pdf2dotmd

Usage

# Output to stdout
pdf2dotmd input.pdf

# Output to file
pdf2dotmd input.pdf -o output.md

# Skip images, output single Markdown file
pdf2dotmd input.pdf --ignore-images

# Batch conversion
pdf2dotmd *.pdf -o output_dir/

# Convert only specific pages
pdf2dotmd input.pdf -p 1-3
pdf2dotmd input.pdf -p 1-5,8,10-12

# Verbose logging
pdf2dotmd input.pdf -v

How It Works

  1. Character extraction — uses pdfplumber to extract individual characters with position data
  2. Line grouping — clusters characters into text lines by y-coordinate proximity
  3. Block formation — groups lines into paragraphs based on horizontal alignment and vertical spacing
  4. Column detection — identifies multi-column layouts by analyzing horizontal text density gaps
  5. Reading order — sorts blocks top-to-bottom, left-to-right, handling spanning titles
  6. Header/footer removal — detects repeated elements across pages
  7. Heading inference — maps font sizes to heading levels (H1-H6)

Limitations

  • Scanned PDFs — OCR is not supported; scanned/image-only PDFs will produce empty output
  • Encrypted PDFs — password-protected PDFs are not supported
  • Complex layouts — highly irregular layouts may not parse perfectly

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2dotmd-0.0.1.tar.gz (14.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2dotmd-0.0.1-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file pdf2dotmd-0.0.1.tar.gz.

File metadata

  • Download URL: pdf2dotmd-0.0.1.tar.gz
  • Upload date:
  • Size: 14.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf2dotmd-0.0.1.tar.gz
Algorithm Hash digest
SHA256 7414fcb40bf684e948638feabfb88cbc0a2bc31504a74ccdfba937003d1417fb
MD5 9a33aa673db2b095df369e86a060bcf7
BLAKE2b-256 38d7c27bffbe25145dc322038cc938090426339e2132cb1e148e25af835271fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2dotmd-0.0.1.tar.gz:

Publisher: publish-pypi.yml on hnrobert/pdf2dotmd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pdf2dotmd-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: pdf2dotmd-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf2dotmd-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 14a95d2fee9fbbf76660e72c5d3de5897525635461c5e2e06807da7711b9df68
MD5 b23c9fa6a53eee05a87ba1d5f04e9029
BLAKE2b-256 d4b9fbf9a73892655f9d5fce59cc6de6f74cc0c868766ef407343089b5149326

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2dotmd-0.0.1-py3-none-any.whl:

Publisher: publish-pypi.yml on hnrobert/pdf2dotmd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page