Convert PDF and DOCX to clean, grep-friendly Markdown for AI/IDE workflows
Project description
file2md
Convert PDF and DOCX files to clean, grep-friendly Markdown optimized for AI tools and IDE workflows.
Features
- PDF conversion — text extraction with PyMuPDF, page separators, scanned PDF detection
- DOCX conversion — headings, lists, tables, bold/italic preserved as Markdown
- Grep-friendly output — paragraph reflow, hyphenation fix, whitespace normalization
- Drag-and-drop web UI — upload and convert files in your browser
- Full CLI — single file, batch, and web server modes
- Table extraction — PDF tables converted to GitHub-flavored Markdown
- Header/footer removal — heuristic detection of repeating headers/footers
- Metadata headers — source filename and conversion timestamp in output
- YAML frontmatter — optional structured metadata for downstream tools
Installation
# CLI only (lightweight)
pip install file2md
# With web UI
pip install file2md[web]
# Development (all dependencies)
pip install file2md[all]
Quick Start
Web UI
file2md serve
# Open http://127.0.0.1:8000 and drag your files
CLI — Single File
# Basic conversion
file2md convert document.pdf -o document.md
# With all enhancements
file2md convert report.pdf -o report.md --clean --frontmatter --page-labels --extract-tables
CLI — Batch
# Convert all PDFs and DOCXs in a directory
file2md batch ./documents --out-dir ./markdown --recursive
CLI Reference
file2md convert
Convert a single PDF or DOCX file to Markdown.
| Flag | Description |
|---|---|
-o / --output |
Output file path (defaults to input name with .md) |
--clean |
Normalize whitespace, reflow paragraphs, fix hyphenation |
--frontmatter |
Add YAML frontmatter (source, timestamp, converter) |
--page-labels |
Add ## Page N headings (PDF only) |
--extract-tables |
Detect and convert tables to GFM (PDF) |
--max-chars N |
Truncate output at N characters |
--overwrite |
Overwrite existing output file |
--quiet |
Suppress warnings |
--verbose |
Show detailed progress |
--json |
Machine-readable JSON output |
file2md batch
Batch convert all PDF/DOCX files in a directory.
| Flag | Description |
|---|---|
--out-dir |
Output directory (required) |
--recursive |
Process subdirectories |
All flags from convert |
Same options available |
file2md serve
Start the web UI server.
| Flag | Description |
|---|---|
--host |
Host to bind to (default: 127.0.0.1) |
--port |
Port to listen on (default: 8000) |
Exit Codes
| Code | Meaning |
|---|---|
0 |
Success |
2 |
Unsupported file type |
3 |
Extraction failed |
4 |
Scanned PDF detected (no OCR) |
Output Conventions
Metadata
Every converted file includes a metadata comment:
<!-- source: document.pdf | converted: 2026-02-26T12:00:00Z | converter: file2md v0.1.0 -->
With --frontmatter:
---
source: document.pdf
converted: 2026-02-26T12:00:00Z
converter: file2md v0.1.0
---
PDF Page Separators
Pages are separated by ---. With --page-labels:
## Page 1
Content of page 1...
---
## Page 2
Content of page 2...
Clean Mode (--clean)
- Paragraph reflow — undoes hard line wraps from PDF extraction
- Hyphenation fix — merges
hyphen-\natedwords across lines - Header/footer removal — detects and removes repeating page headers/footers
- Whitespace normalization — collapses extra spaces, limits blank lines
Architecture
src/file2md/
├── convert.py # Main entry point — dispatches by file type
├── pdf.py # PDF → Markdown (PyMuPDF)
├── docx_converter.py # DOCX → Markdown (python-docx)
├── normalize.py # Text cleanup (reflow, hyphenation, headers/footers)
├── cli.py # Click CLI (convert, batch, serve)
├── web.py # FastAPI web server
├── utils.py # Shared types, validation, metadata
└── templates/
└── index.html # Drag-and-drop web UI
Development
git clone https://github.com/faizkhairi/file2md.git
cd file2md
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
pip install -e ".[all]"
# Run tests
pytest
# Lint
ruff check src/ tests/
# Build
python -m build
Known Limitations
- No OCR — scanned/image-only PDFs are detected and rejected with a clear error (exit code 4). OCR support is planned for a future release.
- Complex PDF layouts — multi-column documents, sidebars, and footnotes may produce text in unexpected order.
- Nested DOCX lists — only flat bullet/numbered lists are supported. Nested and mixed lists are not preserved.
- Merged table cells — may produce duplicated or empty cells in the Markdown output.
Troubleshooting
"All pages appear to be scanned images" — The PDF contains only images, no extractable text. You need to OCR the PDF first using a tool like ocrmypdf before converting.
Tables not appearing (PDF) — Use the --extract-tables flag. Table detection is off by default to keep output clean for text-heavy documents.
Output has hard line breaks — Use the --clean flag to enable paragraph reflow, which joins lines that were artificially broken by PDF formatting.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file file2md-0.1.0.tar.gz.
File metadata
- Download URL: file2md-0.1.0.tar.gz
- Upload date:
- Size: 22.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3041a4890623e73c123b34f50e129e35706aa94b32f0b3179ec0e6b0980ee544
|
|
| MD5 |
f0ccaeb57fd334172e1589d28efc3f99
|
|
| BLAKE2b-256 |
8ecac4dd31810ea5a4d8797db116398fd966032de7562b8095183c5531e05630
|
File details
Details for the file file2md-0.1.0-py3-none-any.whl.
File metadata
- Download URL: file2md-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cfaffd904c52e63e6abc506310f3d41752522a61123242844f6eb4cdac0df6cc
|
|
| MD5 |
d81cf0b2064a762dc7240f9b70624e36
|
|
| BLAKE2b-256 |
3b81ed8cf926b0f02fc019297bf865fe4d8595155c3d8d1fe849fdda7cc3d444
|