Skip to main content

Convert PDF, DOCX, and HTML files — or web pages by URL — to clean, LLM-optimized Markdown with YAML frontmatter.

Project description

any2md

Convert PDF, DOCX, and HTML files — or web pages by URL — to clean, LLM-optimized Markdown with YAML frontmatter.

One command. Any format. Consistent, structured output ready for language models.

Quick Start

pip install any2md

any2md report.pdf
any2md https://example.com/article
any2md --help

Output lands in ./Text/ by default:

---
title: "Quarterly Financial Report"
source_file: "report.pdf"
pages: 12
type: pdf
---

# Quarterly Financial Report

Document content here...

Features

Feature Description
Multi-format PDF, DOCX, HTML (.html, .htm)
URL fetching Pass any http/https URL as input
YAML frontmatter Title, source, page/word count, type
Batch processing Single file, directory scan, or mixed inputs
Auto-routing Dispatches to the correct converter by extension
Smart skip Won't overwrite existing files unless --force
Filename sanitization Spaces, special characters, unicode dashes handled
Title extraction Pulls the first H1–H3 heading automatically
Link stripping --strip-links removes hyperlinks, keeps text

Installation

Requires Python 3.8+.

pip install any2md

From source

git clone https://github.com/rocklambros/any2md.git
cd any2md
pip install .

Dependencies

Library Purpose
PyMuPDF + pymupdf4llm PDF extraction
mammoth + markdownify DOCX conversion
trafilatura + BeautifulSoup HTML/URL extraction

Usage

Basic conversion

# Single file
any2md report.pdf

# Multiple files
any2md report.pdf proposal.docx "meeting notes.pdf"

# HTML file
any2md page.html

# Web page by URL
any2md https://example.com/article

# Mixed batch — PDFs, DOCX, HTML, and URLs together
any2md doc.pdf page.html https://example.com

Directory scanning

# Scan a specific directory
any2md --input-dir ./documents

# Convert everything in the current directory (default behavior)
any2md

Options

# Custom output directory
any2md -o ./converted report.pdf

# Overwrite existing files
any2md --force

# Strip hyperlinks from output
any2md --strip-links doc.pdf

# Combine options
any2md -f -o ./out --strip-links docs/*.pdf docs/*.docx

Alternative invocations

# Module mode (works without installing via pip)
python -m any2md report.pdf

# Legacy script (backward compatibility)
python3 mdconv.py report.pdf

Output Format

Every converted file has YAML frontmatter followed by cleaned Markdown. The frontmatter fields vary by source format:

PDF — includes page count:

---
title: "Quarterly Financial Report"
source_file: "Q3 Report 2024.pdf"
pages: 12
type: pdf
---

DOCX — includes word count:

---
title: "Project Proposal"
source_file: "proposal.docx"
word_count: 3847
type: docx
---

HTML file — includes word count:

---
title: "Page Title"
source_file: "page.html"
word_count: 1234
type: html
---

URL — records source URL instead of filename:

---
title: "Article Title"
source_url: "https://example.com/article"
word_count: 567
type: html
---

CLI Reference

usage: any2md [-h] [--input-dir PATH] [--force] [--output-dir PATH] [--strip-links] [files ...]

Convert PDF, DOCX, and HTML files to LLM-optimized Markdown.

positional arguments:
  files                 Files or URLs to convert. Supports PDF, DOCX, HTML
                        files and http(s) URLs. If omitted, converts all
                        supported files in the current directory.

options:
  -h, --help            show this help message and exit
  --input-dir, -i PATH  Directory to scan for supported files (PDF, DOCX, HTML)
  --force, -f           Overwrite existing .md files
  --output-dir, -o PATH Output directory (default: ./Text)
  --strip-links         Remove markdown links, keeping only the link text

Architecture

User Input (files, URLs, flags)
         │
         ▼
      cli.py ─── parse args, classify URLs vs file paths
         │
         ▼
converters/__init__.py ─── dispatch by extension
         │
    ┌────┼────┐
    ▼    ▼    ▼
 pdf  docx  html ─── format-specific extraction
    │    │    │
    └────┼────┘
         ▼
      utils.py ─── clean, title-extract, sanitize, frontmatter
         │
         ▼
      Output ─── YAML frontmatter + Markdown → output_dir/

Extraction pipelines

Format Pipeline
PDF pymupdf4llm.to_markdown() → clean → frontmatter
DOCX mammoth (DOCX → HTML) → markdownify (HTML → Markdown) → clean → frontmatter
HTML/URL BS4 pre-clean → trafilatura extract (fallback: markdownify) → clean → frontmatter

Adding a new format

  1. Create any2md/converters/newformat.py with a convert_newformat(path, output_dir, force, strip_links_flag) → bool function
  2. Add the extension and function to CONVERTERS in any2md/converters/__init__.py
  3. Add the extension to SUPPORTED_EXTENSIONS

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

any2md-0.4.0.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

any2md-0.4.0-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file any2md-0.4.0.tar.gz.

File metadata

  • Download URL: any2md-0.4.0.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for any2md-0.4.0.tar.gz
Algorithm Hash digest
SHA256 45b96fef48eb25b220a55173c8c58a0abf81b13ce2765b720d95c25367c9d2c6
MD5 008d35e41f6db61ad7b60a802a42a3f2
BLAKE2b-256 7fb051acc3a51bc8a055e00f804f6edc7c099adb431fe4d0992f4f54b4179ebd

See more details on using hashes here.

Provenance

The following attestation bundles were made for any2md-0.4.0.tar.gz:

Publisher: publish.yml on rocklambros/any2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file any2md-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: any2md-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 11.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for any2md-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 63db0d55a46e4b9ca7aadebcd17c73a8f413009e755f14ab8db53c15d5eb6d91
MD5 e1f6a211e5e029b7dcd990f60b91d670
BLAKE2b-256 ce3376f5a5b2362ae34a28efdc519593522211785a3110b9e3b2f5900df5770d

See more details on using hashes here.

Provenance

The following attestation bundles were made for any2md-0.4.0-py3-none-any.whl:

Publisher: publish.yml on rocklambros/any2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page