Convert PDF, DOCX, HTML, and TXT files — or web pages by URL — to clean, LLM-optimized Markdown with YAML frontmatter.
Project description
any2md
Convert PDF, DOCX, HTML, and TXT files — or web pages by URL — to clean, LLM-optimized Markdown with YAML frontmatter.
One command. Any format. Consistent, structured output ready for language models.
Quick Start
pip install any2md
any2md report.pdf
any2md https://example.com/article
any2md --help
Output lands in ./Text/ by default:
---
title: "Quarterly Financial Report"
source_file: "report.pdf"
pages: 12
type: pdf
---
# Quarterly Financial Report
Document content here...
Features
| Feature | Description |
|---|---|
| Multi-format | PDF, DOCX, HTML (.html, .htm), TXT |
| URL fetching | Pass any http/https URL as input |
| YAML frontmatter | Title, source, page/word count, type |
| Batch processing | Single file, directory scan, or mixed inputs |
| Auto-routing | Dispatches to the correct converter by extension |
| Smart skip | Won't overwrite existing files unless --force |
| Filename sanitization | Spaces, special characters, unicode dashes handled |
| TXT structure detection | Infers headings, lists, code blocks from plain text |
| Title extraction | Pulls the first H1–H3 heading automatically |
| Link stripping | --strip-links removes hyperlinks, keeps text |
Installation
Requires Python 3.8+.
pip install any2md
From source
git clone https://github.com/rocklambros/any2md.git
cd any2md
pip install .
Dependencies
| Library | Purpose |
|---|---|
| PyMuPDF + pymupdf4llm | PDF extraction |
| mammoth + markdownify | DOCX conversion |
| trafilatura + BeautifulSoup | HTML/URL extraction |
Usage
Basic conversion
# Single file
any2md report.pdf
# Multiple files
any2md report.pdf proposal.docx "meeting notes.pdf"
# HTML file
any2md page.html
# Web page by URL
any2md https://example.com/article
# Plain text file
any2md notes.txt
# Mixed batch — PDFs, DOCX, HTML, TXT, and URLs together
any2md doc.pdf page.html notes.txt https://example.com
Directory scanning
# Scan a specific directory
any2md --input-dir ./documents
# Convert everything in the current directory (default behavior)
any2md
Options
# Custom output directory
any2md -o ./converted report.pdf
# Overwrite existing files
any2md --force
# Strip hyperlinks from output
any2md --strip-links doc.pdf
# Combine options
any2md -f -o ./out --strip-links docs/*.pdf docs/*.docx
Alternative invocations
# Module mode (works without installing via pip)
python -m any2md report.pdf
# Legacy script (backward compatibility)
python3 mdconv.py report.pdf
Output Format
Every converted file has YAML frontmatter followed by cleaned Markdown. The frontmatter fields vary by source format:
PDF — includes page count:
---
title: "Quarterly Financial Report"
source_file: "Q3 Report 2024.pdf"
pages: 12
type: pdf
---
DOCX — includes word count:
---
title: "Project Proposal"
source_file: "proposal.docx"
word_count: 3847
type: docx
---
HTML file — includes word count:
---
title: "Page Title"
source_file: "page.html"
word_count: 1234
type: html
---
TXT — structure inferred via heuristics, includes word count:
---
title: "Meeting Notes"
source_file: "notes.txt"
word_count: 892
type: txt
---
URL — records source URL instead of filename:
---
title: "Article Title"
source_url: "https://example.com/article"
word_count: 567
type: html
---
CLI Reference
usage: any2md [-h] [--input-dir PATH] [--force] [--output-dir PATH] [--strip-links] [files ...]
Convert PDF, DOCX, HTML, and TXT files to LLM-optimized Markdown.
positional arguments:
files Files or URLs to convert. Supports PDF, DOCX, HTML,
TXT files and http(s) URLs. If omitted, converts all
supported files in the current directory.
options:
-h, --help show this help message and exit
--input-dir, -i PATH Directory to scan for supported files (PDF, DOCX, HTML, TXT)
--force, -f Overwrite existing .md files
--output-dir, -o PATH Output directory (default: ./Text)
--strip-links Remove markdown links, keeping only the link text
Architecture
User Input (files, URLs, flags)
│
▼
cli.py ─── parse args, classify URLs vs file paths
│
▼
converters/__init__.py ─── dispatch by extension
│
┌────┼────┬────┐
▼ ▼ ▼ ▼
pdf docx html txt ─── format-specific extraction
│ │ │ │
└────┼────┴────┘
▼
utils.py ─── clean, title-extract, sanitize, frontmatter
│
▼
Output ─── YAML frontmatter + Markdown → output_dir/
Extraction pipelines
| Format | Pipeline |
|---|---|
pymupdf4llm.to_markdown() → clean → frontmatter |
|
| DOCX | mammoth (DOCX → HTML) → markdownify (HTML → Markdown) → clean → frontmatter |
| HTML/URL | BS4 pre-clean → trafilatura extract (fallback: markdownify) → clean → frontmatter |
| TXT | structurize() heuristics (headings, lists, code blocks) → clean → frontmatter |
Adding a new format
- Create
any2md/converters/newformat.pywith aconvert_newformat(path, output_dir, force, strip_links_flag) → boolfunction - Add the extension and function to
CONVERTERSinany2md/converters/__init__.py - Add the extension to
SUPPORTED_EXTENSIONS
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file any2md-0.5.0.tar.gz.
File metadata
- Download URL: any2md-0.5.0.tar.gz
- Upload date:
- Size: 10.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c60cc09495524b3054e5128804a15e140f0fa67f26985c83eb0718f09191bd4
|
|
| MD5 |
5342230f215d13f1bdbfc75b5613cfa9
|
|
| BLAKE2b-256 |
379e07b26ce8577721a10f39df39a0663aac1d3a7bff8f07879aa8cfcc0258f3
|
Provenance
The following attestation bundles were made for any2md-0.5.0.tar.gz:
Publisher:
publish.yml on rocklambros/any2md
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
any2md-0.5.0.tar.gz -
Subject digest:
5c60cc09495524b3054e5128804a15e140f0fa67f26985c83eb0718f09191bd4 - Sigstore transparency entry: 972803486
- Sigstore integration time:
-
Permalink:
rocklambros/any2md@26d861d5b157f4a4f4618d2bff9d062046d58272 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/rocklambros
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@26d861d5b157f4a4f4618d2bff9d062046d58272 -
Trigger Event:
release
-
Statement type:
File details
Details for the file any2md-0.5.0-py3-none-any.whl.
File metadata
- Download URL: any2md-0.5.0-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b0a1eacaafc60236c16491f407c06ee641a3388f69834cacd8374adfb1a983e
|
|
| MD5 |
a050810a5ec7b329c83716ec7209e56c
|
|
| BLAKE2b-256 |
5a043d1eb9bd98d39b91ce00846c11fc04170fe7a56700dc0a366209240326da
|
Provenance
The following attestation bundles were made for any2md-0.5.0-py3-none-any.whl:
Publisher:
publish.yml on rocklambros/any2md
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
any2md-0.5.0-py3-none-any.whl -
Subject digest:
7b0a1eacaafc60236c16491f407c06ee641a3388f69834cacd8374adfb1a983e - Sigstore transparency entry: 972803490
- Sigstore integration time:
-
Permalink:
rocklambros/any2md@26d861d5b157f4a4f4618d2bff9d062046d58272 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/rocklambros
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@26d861d5b157f4a4f4618d2bff9d062046d58272 -
Trigger Event:
release
-
Statement type: