Convert PDF and DOCX to clean, grep-friendly Markdown for AI/IDE workflows

These details have not been verified by PyPI

Project links

Project description

file2md

Convert PDF and DOCX files to clean, grep-friendly Markdown optimized for AI tools and IDE workflows.

Features

PDF conversion — text extraction with PyMuPDF, page separators, scanned PDF detection
DOCX conversion — headings, lists, tables, bold/italic preserved as Markdown
Grep-friendly output — paragraph reflow, hyphenation fix, whitespace normalization
Drag-and-drop web UI — upload and convert files in your browser
Full CLI — single file, batch, and web server modes
Table extraction — PDF tables converted to GitHub-flavored Markdown
Header/footer removal — heuristic detection of repeating headers/footers
Metadata headers — source filename and conversion timestamp in output
YAML frontmatter — optional structured metadata for downstream tools

Installation

# CLI only (lightweight)
pip install file2md

# With web UI
pip install file2md[web]

# Development (all dependencies)
pip install file2md[all]

Quick Start

Web UI

file2md serve
# Open http://127.0.0.1:8000 and drag your files

CLI — Single File

# Basic conversion
file2md convert document.pdf -o document.md

# With all enhancements
file2md convert report.pdf -o report.md --clean --frontmatter --page-labels --extract-tables

CLI — Batch

# Convert all PDFs and DOCXs in a directory
file2md batch ./documents --out-dir ./markdown --recursive

CLI Reference

`file2md convert`

Convert a single PDF or DOCX file to Markdown.

Flag	Description
`-o / --output`	Output file path (defaults to input name with `.md`)
`--clean`	Normalize whitespace, reflow paragraphs, fix hyphenation
`--frontmatter`	Add YAML frontmatter (source, timestamp, converter)
`--page-labels`	Add `## Page N` headings (PDF only)
`--extract-tables`	Detect and convert tables to GFM (PDF)
`--max-chars N`	Truncate output at N characters
`--overwrite`	Overwrite existing output file
`--quiet`	Suppress warnings
`--verbose`	Show detailed progress
`--json`	Machine-readable JSON output

`file2md batch`

Batch convert all PDF/DOCX files in a directory.

Flag	Description
`--out-dir`	Output directory (required)
`--recursive`	Process subdirectories
All flags from `convert`	Same options available

`file2md serve`

Start the web UI server.

Flag	Description
`--host`	Host to bind to (default: `127.0.0.1`)
`--port`	Port to listen on (default: `8000`)

Exit Codes

Code	Meaning
`0`	Success
`2`	Unsupported file type
`3`	Extraction failed
`4`	Scanned PDF detected (no OCR)

Output Conventions

Metadata

Every converted file includes a metadata comment:

<!-- source: document.pdf | converted: 2026-02-26T12:00:00Z | converter: file2md v0.1.0 -->

With --frontmatter:

---
source: document.pdf
converted: 2026-02-26T12:00:00Z
converter: file2md v0.1.0
---

PDF Page Separators

Pages are separated by ---. With --page-labels:

## Page 1

Content of page 1...

---

## Page 2

Content of page 2...

Clean Mode (`--clean`)

Paragraph reflow — undoes hard line wraps from PDF extraction
Hyphenation fix — merges hyphen-\nated words across lines
Header/footer removal — detects and removes repeating page headers/footers
Whitespace normalization — collapses extra spaces, limits blank lines

Architecture

src/file2md/
├── convert.py          # Main entry point — dispatches by file type
├── pdf.py              # PDF → Markdown (PyMuPDF)
├── docx_converter.py   # DOCX → Markdown (python-docx)
├── normalize.py        # Text cleanup (reflow, hyphenation, headers/footers)
├── cli.py              # Click CLI (convert, batch, serve)
├── web.py              # FastAPI web server
├── utils.py            # Shared types, validation, metadata
└── templates/
    └── index.html      # Drag-and-drop web UI

Development

git clone https://github.com/faizkhairi/file2md.git
cd file2md
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows
pip install -e ".[all]"

# Run tests
pytest

# Lint
ruff check src/ tests/

# Build
python -m build

Known Limitations

No OCR — scanned/image-only PDFs are detected and rejected with a clear error (exit code 4). OCR support is planned for a future release.
Complex PDF layouts — multi-column documents, sidebars, and footnotes may produce text in unexpected order.
Nested DOCX lists — only flat bullet/numbered lists are supported. Nested and mixed lists are not preserved.
Merged table cells — may produce duplicated or empty cells in the Markdown output.

Troubleshooting

"All pages appear to be scanned images" — The PDF contains only images, no extractable text. You need to OCR the PDF first using a tool like ocrmypdf before converting.

Tables not appearing (PDF) — Use the --extract-tables flag. Table detection is off by default to keep output clean for text-heavy documents.

Output has hard line breaks — Use the --clean flag to enable paragraph reflow, which joins lines that were artificially broken by PDF formatting.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Feb 26, 2026

This version

0.1.0

Feb 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

file2md-0.1.0.tar.gz (22.9 kB view details)

Uploaded Feb 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

file2md-0.1.0-py3-none-any.whl (21.0 kB view details)

Uploaded Feb 26, 2026 Python 3

File details

Details for the file file2md-0.1.0.tar.gz.

File metadata

Download URL: file2md-0.1.0.tar.gz
Upload date: Feb 26, 2026
Size: 22.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for file2md-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3041a4890623e73c123b34f50e129e35706aa94b32f0b3179ec0e6b0980ee544`
MD5	`f0ccaeb57fd334172e1589d28efc3f99`
BLAKE2b-256	`8ecac4dd31810ea5a4d8797db116398fd966032de7562b8095183c5531e05630`

See more details on using hashes here.

File details

Details for the file file2md-0.1.0-py3-none-any.whl.

File metadata

Download URL: file2md-0.1.0-py3-none-any.whl
Upload date: Feb 26, 2026
Size: 21.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for file2md-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cfaffd904c52e63e6abc506310f3d41752522a61123242844f6eb4cdac0df6cc`
MD5	`d81cf0b2064a762dc7240f9b70624e36`
BLAKE2b-256	`3b81ed8cf926b0f02fc019297bf865fe4d8595155c3d8d1fe849fdda7cc3d444`

See more details on using hashes here.

file2md 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

file2md

Features

Installation

Quick Start

Web UI

CLI — Single File

CLI — Batch

CLI Reference

file2md convert

file2md batch

file2md serve

Exit Codes

Output Conventions

Metadata

PDF Page Separators

Clean Mode (--clean)

Architecture

Development

Known Limitations

Troubleshooting

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`file2md convert`

`file2md batch`

`file2md serve`

Clean Mode (`--clean`)