A Python tool for converting PDF files to Markdown
Project description
pdf2dotmd
A Python CLI tool that converts PDF files to Markdown format with intelligent layout analysis.
Features
- Layout-aware text extraction — reconstructs logical reading order from PDF spatial data
- Multi-column detection — handles two-column and multi-column layouts
- Table extraction — converts PDF tables to Markdown pipe tables
- Heading inference — detects headings from font size hierarchy
- Header/footer filtering — automatically removes repeated page headers and footers
- Image extraction — extracts embedded images to an
assets/directory - Ignore images mode —
--ignore-imagesflag for text-only output - Page range selection — convert specific pages only
- Batch conversion — process multiple PDF files with wildcards
Installation
pip install pdf2dotmd
Usage
# Output to stdout
pdf2dotmd input.pdf
# Output to file
pdf2dotmd input.pdf -o output.md
# Skip images, output single Markdown file
pdf2dotmd input.pdf --ignore-images
# Batch conversion
pdf2dotmd *.pdf -o output_dir/
# Convert only specific pages
pdf2dotmd input.pdf -p 1-3
pdf2dotmd input.pdf -p 1-5,8,10-12
# Verbose logging
pdf2dotmd input.pdf -v
How It Works
- Character extraction — uses pdfplumber to extract individual characters with position data
- Line grouping — clusters characters into text lines by y-coordinate proximity
- Block formation — groups lines into paragraphs based on horizontal alignment and vertical spacing
- Column detection — identifies multi-column layouts by analyzing horizontal text density gaps
- Reading order — sorts blocks top-to-bottom, left-to-right, handling spanning titles
- Header/footer removal — detects repeated elements across pages
- Heading inference — maps font sizes to heading levels (H1-H6)
Limitations
- Scanned PDFs — OCR is not supported; scanned/image-only PDFs will produce empty output
- Encrypted PDFs — password-protected PDFs are not supported
- Complex layouts — highly irregular layouts may not parse perfectly
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf2dotmd-0.0.1.tar.gz.
File metadata
- Download URL: pdf2dotmd-0.0.1.tar.gz
- Upload date:
- Size: 14.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7414fcb40bf684e948638feabfb88cbc0a2bc31504a74ccdfba937003d1417fb
|
|
| MD5 |
9a33aa673db2b095df369e86a060bcf7
|
|
| BLAKE2b-256 |
38d7c27bffbe25145dc322038cc938090426339e2132cb1e148e25af835271fc
|
Provenance
The following attestation bundles were made for pdf2dotmd-0.0.1.tar.gz:
Publisher:
publish-pypi.yml on hnrobert/pdf2dotmd
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdf2dotmd-0.0.1.tar.gz -
Subject digest:
7414fcb40bf684e948638feabfb88cbc0a2bc31504a74ccdfba937003d1417fb - Sigstore transparency entry: 1194545539
- Sigstore integration time:
-
Permalink:
hnrobert/pdf2dotmd@aaca652f2a9849ee7df043f36cb1b5a8805fc226 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/hnrobert
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@aaca652f2a9849ee7df043f36cb1b5a8805fc226 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file pdf2dotmd-0.0.1-py3-none-any.whl.
File metadata
- Download URL: pdf2dotmd-0.0.1-py3-none-any.whl
- Upload date:
- Size: 17.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14a95d2fee9fbbf76660e72c5d3de5897525635461c5e2e06807da7711b9df68
|
|
| MD5 |
b23c9fa6a53eee05a87ba1d5f04e9029
|
|
| BLAKE2b-256 |
d4b9fbf9a73892655f9d5fce59cc6de6f74cc0c868766ef407343089b5149326
|
Provenance
The following attestation bundles were made for pdf2dotmd-0.0.1-py3-none-any.whl:
Publisher:
publish-pypi.yml on hnrobert/pdf2dotmd
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdf2dotmd-0.0.1-py3-none-any.whl -
Subject digest:
14a95d2fee9fbbf76660e72c5d3de5897525635461c5e2e06807da7711b9df68 - Sigstore transparency entry: 1194545555
- Sigstore integration time:
-
Permalink:
hnrobert/pdf2dotmd@aaca652f2a9849ee7df043f36cb1b5a8805fc226 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/hnrobert
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@aaca652f2a9849ee7df043f36cb1b5a8805fc226 -
Trigger Event:
workflow_dispatch
-
Statement type: