Convert PDF, DOCX, HTML, and TXT files — or web pages by URL — to clean, LLM-optimized Markdown with YAML frontmatter.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rocklambros

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: Markup :: Markdown

Project description

any2md

Convert PDF, DOCX, HTML, and TXT files — or web pages by URL — to clean, LLM-optimized Markdown with YAML frontmatter.

One command. Any format. Consistent, structured output ready for language models.

Quick Start

pip install any2md

any2md report.pdf
any2md https://example.com/article
any2md --help

Output lands in ./Text/ by default:

---
title: "Quarterly Financial Report"
source_file: "report.pdf"
pages: 12
type: pdf
---

# Quarterly Financial Report

Document content here...

Features

Feature	Description
Multi-format	PDF, DOCX, HTML (.html, .htm), TXT
URL fetching	Pass any http/https URL as input
YAML frontmatter	Title, source, page/word count, type
Batch processing	Single file, directory scan, or mixed inputs
Auto-routing	Dispatches to the correct converter by extension
Smart skip	Won't overwrite existing files unless `--force`
Filename sanitization	Spaces, special characters, unicode dashes handled
TXT structure detection	Infers headings, lists, code blocks from plain text
Title extraction	Pulls the first H1–H3 heading automatically
Link stripping	`--strip-links` removes hyperlinks, keeps text

Installation

Requires Python 3.8+.

pip install any2md

From source

git clone https://github.com/rocklambros/any2md.git
cd any2md
pip install .

Dependencies

Library	Purpose
PyMuPDF + pymupdf4llm	PDF extraction
mammoth + markdownify	DOCX conversion
trafilatura + BeautifulSoup	HTML/URL extraction

Usage

Basic conversion

# Single file
any2md report.pdf

# Multiple files
any2md report.pdf proposal.docx "meeting notes.pdf"

# HTML file
any2md page.html

# Web page by URL
any2md https://example.com/article

# Plain text file
any2md notes.txt

# Mixed batch — PDFs, DOCX, HTML, TXT, and URLs together
any2md doc.pdf page.html notes.txt https://example.com

Directory scanning

# Scan a specific directory
any2md --input-dir ./documents

# Convert everything in the current directory (default behavior)
any2md

Options

# Custom output directory
any2md -o ./converted report.pdf

# Overwrite existing files
any2md --force

# Strip hyperlinks from output
any2md --strip-links doc.pdf

# Combine options
any2md -f -o ./out --strip-links docs/*.pdf docs/*.docx

Alternative invocations

# Module mode (works without installing via pip)
python -m any2md report.pdf

# Legacy script (backward compatibility)
python3 mdconv.py report.pdf

Output Format

Every converted file has YAML frontmatter followed by cleaned Markdown. The frontmatter fields vary by source format:

PDF — includes page count:

---
title: "Quarterly Financial Report"
source_file: "Q3 Report 2024.pdf"
pages: 12
type: pdf
---

DOCX — includes word count:

---
title: "Project Proposal"
source_file: "proposal.docx"
word_count: 3847
type: docx
---

HTML file — includes word count:

---
title: "Page Title"
source_file: "page.html"
word_count: 1234
type: html
---

TXT — structure inferred via heuristics, includes word count:

---
title: "Meeting Notes"
source_file: "notes.txt"
word_count: 892
type: txt
---

URL — records source URL instead of filename:

---
title: "Article Title"
source_url: "https://example.com/article"
word_count: 567
type: html
---

CLI Reference

usage: any2md [-h] [--input-dir PATH] [--force] [--output-dir PATH] [--strip-links] [files ...]

Convert PDF, DOCX, HTML, and TXT files to LLM-optimized Markdown.

positional arguments:
  files                 Files or URLs to convert. Supports PDF, DOCX, HTML,
                        TXT files and http(s) URLs. If omitted, converts all
                        supported files in the current directory.

options:
  -h, --help            show this help message and exit
  --input-dir, -i PATH  Directory to scan for supported files (PDF, DOCX, HTML, TXT)
  --force, -f           Overwrite existing .md files
  --output-dir, -o PATH Output directory (default: ./Text)
  --strip-links         Remove markdown links, keeping only the link text

Architecture

User Input (files, URLs, flags)
         │
         ▼
      cli.py ─── parse args, classify URLs vs file paths
         │
         ▼
converters/__init__.py ─── dispatch by extension
         │
    ┌────┼────┬────┐
    ▼    ▼    ▼    ▼
 pdf  docx  html  txt ─── format-specific extraction
    │    │    │    │
    └────┼────┴────┘
         ▼
      utils.py ─── clean, title-extract, sanitize, frontmatter
         │
         ▼
      Output ─── YAML frontmatter + Markdown → output_dir/

Extraction pipelines

Format	Pipeline
PDF	`pymupdf4llm.to_markdown()` → clean → frontmatter
DOCX	`mammoth` (DOCX → HTML) → `markdownify` (HTML → Markdown) → clean → frontmatter
HTML/URL	BS4 pre-clean → `trafilatura` extract (fallback: `markdownify`) → clean → frontmatter
TXT	`structurize()` heuristics (headings, lists, code blocks) → clean → frontmatter

Adding a new format

Create any2md/converters/newformat.py with a convert_newformat(path, output_dir, force, strip_links_flag) → bool function
Add the extension and function to CONVERTERS in any2md/converters/__init__.py
Add the extension to SUPPORTED_EXTENSIONS

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rocklambros

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: Markup :: Markdown

Release history Release notifications | RSS feed

0.6.0

Feb 20, 2026

This version

0.5.0

Feb 20, 2026

0.4.0

Feb 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

any2md-0.5.0.tar.gz (10.7 kB view details)

Uploaded Feb 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

any2md-0.5.0-py3-none-any.whl (14.4 kB view details)

Uploaded Feb 20, 2026 Python 3

File details

Details for the file any2md-0.5.0.tar.gz.

File metadata

Download URL: any2md-0.5.0.tar.gz
Upload date: Feb 20, 2026
Size: 10.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for any2md-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`5c60cc09495524b3054e5128804a15e140f0fa67f26985c83eb0718f09191bd4`
MD5	`5342230f215d13f1bdbfc75b5613cfa9`
BLAKE2b-256	`379e07b26ce8577721a10f39df39a0663aac1d3a7bff8f07879aa8cfcc0258f3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for any2md-0.5.0.tar.gz:

Publisher: publish.yml on rocklambros/any2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: any2md-0.5.0.tar.gz
- Subject digest: 5c60cc09495524b3054e5128804a15e140f0fa67f26985c83eb0718f09191bd4
- Sigstore transparency entry: 972803486
- Sigstore integration time: Feb 20, 2026
Source repository:
- Permalink: rocklambros/any2md@26d861d5b157f4a4f4618d2bff9d062046d58272
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/rocklambros
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@26d861d5b157f4a4f4618d2bff9d062046d58272
- Trigger Event: release

File details

Details for the file any2md-0.5.0-py3-none-any.whl.

File metadata

Download URL: any2md-0.5.0-py3-none-any.whl
Upload date: Feb 20, 2026
Size: 14.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for any2md-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7b0a1eacaafc60236c16491f407c06ee641a3388f69834cacd8374adfb1a983e`
MD5	`a050810a5ec7b329c83716ec7209e56c`
BLAKE2b-256	`5a043d1eb9bd98d39b91ce00846c11fc04170fe7a56700dc0a366209240326da`

See more details on using hashes here.

Provenance

The following attestation bundles were made for any2md-0.5.0-py3-none-any.whl:

Publisher: publish.yml on rocklambros/any2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: any2md-0.5.0-py3-none-any.whl
- Subject digest: 7b0a1eacaafc60236c16491f407c06ee641a3388f69834cacd8374adfb1a983e
- Sigstore transparency entry: 972803490
- Sigstore integration time: Feb 20, 2026
Source repository:
- Permalink: rocklambros/any2md@26d861d5b157f4a4f4618d2bff9d062046d58272
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/rocklambros
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@26d861d5b157f4a4f4618d2bff9d062046d58272
- Trigger Event: release

any2md 0.5.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

any2md

Quick Start

Features

Installation

From source

Dependencies

Usage

Basic conversion

Directory scanning

Options

Alternative invocations

Output Format

CLI Reference

Architecture

Extraction pipelines

Adding a new format

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance