Skip to main content

Convert PDF and DOCX to clean, grep-friendly Markdown for AI/IDE workflows

Project description

file2md

Convert PDF and DOCX files to clean, grep-friendly Markdown optimized for AI tools and IDE workflows.

CI Python 3.11+ License: MIT

Features

  • PDF conversion — text extraction with PyMuPDF, page separators, scanned PDF detection
  • DOCX conversion — headings, lists, tables, bold/italic preserved as Markdown
  • Grep-friendly output — paragraph reflow, hyphenation fix, whitespace normalization
  • Drag-and-drop web UI — upload and convert files in your browser
  • Full CLI — single file, batch, and web server modes
  • Table extraction — PDF tables converted to GitHub-flavored Markdown
  • Header/footer removal — heuristic detection of repeating headers/footers
  • Metadata headers — source filename and conversion timestamp in output
  • YAML frontmatter — optional structured metadata for downstream tools

Installation

# CLI only (lightweight)
pip install file2md

# With web UI
pip install file2md[web]

# Development (all dependencies)
pip install file2md[all]

Quick Start

Web UI

file2md serve
# Open http://127.0.0.1:8000 and drag your files

CLI — Single File

# Basic conversion
file2md convert document.pdf -o document.md

# With all enhancements
file2md convert report.pdf -o report.md --clean --frontmatter --page-labels --extract-tables

CLI — Batch

# Convert all PDFs and DOCXs in a directory
file2md batch ./documents --out-dir ./markdown --recursive

CLI Reference

file2md convert

Convert a single PDF or DOCX file to Markdown.

Flag Description
-o / --output Output file path (defaults to input name with .md)
--clean Normalize whitespace, reflow paragraphs, fix hyphenation
--frontmatter Add YAML frontmatter (source, timestamp, converter)
--page-labels Add ## Page N headings (PDF only)
--extract-tables Detect and convert tables to GFM (PDF)
--max-chars N Truncate output at N characters
--overwrite Overwrite existing output file
--quiet Suppress warnings
--verbose Show detailed progress
--json Machine-readable JSON output

file2md batch

Batch convert all PDF/DOCX files in a directory.

Flag Description
--out-dir Output directory (required)
--recursive Process subdirectories
All flags from convert Same options available

file2md serve

Start the web UI server.

Flag Description
--host Host to bind to (default: 127.0.0.1)
--port Port to listen on (default: 8000)

Exit Codes

Code Meaning
0 Success
2 Unsupported file type
3 Extraction failed
4 Scanned PDF detected (no OCR)

Output Conventions

Metadata

Every converted file includes a metadata comment:

<!-- source: document.pdf | converted: 2026-02-26T12:00:00Z | converter: file2md v0.1.0 -->

With --frontmatter:

---
source: document.pdf
converted: 2026-02-26T12:00:00Z
converter: file2md v0.1.0
---

PDF Page Separators

Pages are separated by ---. With --page-labels:

## Page 1

Content of page 1...

---

## Page 2

Content of page 2...

Clean Mode (--clean)

  • Paragraph reflow — undoes hard line wraps from PDF extraction
  • Hyphenation fix — merges hyphen-\nated words across lines
  • Header/footer removal — detects and removes repeating page headers/footers
  • Whitespace normalization — collapses extra spaces, limits blank lines

Architecture

src/file2md/
├── convert.py          # Main entry point — dispatches by file type
├── pdf.py              # PDF → Markdown (PyMuPDF)
├── docx_converter.py   # DOCX → Markdown (python-docx)
├── normalize.py        # Text cleanup (reflow, hyphenation, headers/footers)
├── cli.py              # Click CLI (convert, batch, serve)
├── web.py              # FastAPI web server
├── utils.py            # Shared types, validation, metadata
└── templates/
    └── index.html      # Drag-and-drop web UI

Development

git clone https://github.com/faizkhairi/file2md.git
cd file2md
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows
pip install -e ".[all]"

# Run tests
pytest

# Lint
ruff check src/ tests/

# Build
python -m build

Known Limitations

  • No OCR — scanned/image-only PDFs are detected and rejected with a clear error (exit code 4). OCR support is planned for a future release.
  • Complex PDF layouts — multi-column documents, sidebars, and footnotes may produce text in unexpected order.
  • Nested DOCX lists — only flat bullet/numbered lists are supported. Nested and mixed lists are not preserved.
  • Merged table cells — may produce duplicated or empty cells in the Markdown output.

Troubleshooting

"All pages appear to be scanned images" — The PDF contains only images, no extractable text. You need to OCR the PDF first using a tool like ocrmypdf before converting.

Tables not appearing (PDF) — Use the --extract-tables flag. Table detection is off by default to keep output clean for text-heavy documents.

Output has hard line breaks — Use the --clean flag to enable paragraph reflow, which joins lines that were artificially broken by PDF formatting.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

file2md-0.1.0.tar.gz (22.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

file2md-0.1.0-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file file2md-0.1.0.tar.gz.

File metadata

  • Download URL: file2md-0.1.0.tar.gz
  • Upload date:
  • Size: 22.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for file2md-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3041a4890623e73c123b34f50e129e35706aa94b32f0b3179ec0e6b0980ee544
MD5 f0ccaeb57fd334172e1589d28efc3f99
BLAKE2b-256 8ecac4dd31810ea5a4d8797db116398fd966032de7562b8095183c5531e05630

See more details on using hashes here.

File details

Details for the file file2md-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: file2md-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for file2md-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cfaffd904c52e63e6abc506310f3d41752522a61123242844f6eb4cdac0df6cc
MD5 d81cf0b2064a762dc7240f9b70624e36
BLAKE2b-256 3b81ed8cf926b0f02fc019297bf865fe4d8595155c3d8d1fe849fdda7cc3d444

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page