High-performance HTML to Markdown converter powered by Rust with a clean Python API
Project description
html-to-markdown
High-performance HTML to Markdown converter Rust crate and CLI with Python bindings and CLI. Available via PyPI, Homebrew, and Cargo. Cross-platform support for Linux, macOS, and Windows.
Part of the Kreuzberg ecosystem for document intelligence.
📚 Full V2 Documentation - Comprehensive guide for Rust, Python, and CLI usage.
⚡ Benchmarks
Throughput (Python API)
Real Wikipedia documents on Apple M1 Pro:
| Document | Size | Latency | Throughput | Docs/sec |
|---|---|---|---|---|
| Lists (Timeline) | 129KB | 0.62ms | 208 MB/s | 1,613 |
| Tables (Countries) | 360KB | 2.02ms | 178 MB/s | 495 |
| Mixed (Python wiki) | 656KB | 4.56ms | 144 MB/s | 219 |
Throughput scales linearly from 144-208 MB/s across all document sizes.
Memory Usage
| Document Size | Memory Delta | Peak RSS | Leak Detection |
|---|---|---|---|
| 10KB | < 2 MB | < 20 MB | ✅ None |
| 50KB | < 8 MB | < 35 MB | ✅ None |
| 500KB | < 40 MB | < 80 MB | ✅ None |
Memory usage is linear and stable across 50+ repeated conversions.
V2 is 19-30x faster than v1 Python/BeautifulSoup implementation.
📊 Benchmark Results - Detailed Python API comparison 📈 Performance Analysis - Rust core benchmarks and profiling 🔧 Benchmarking Guide - How to run benchmarks ✅ CommonMark Compliance - CommonMark specification compliance
Features
- 🚀 Blazing Fast: Pure Rust core with ultra-fast
tlHTML parser - 🐍 Python Bindings: Clean Python API via PyO3 with full type hints
- 🦀 Native CLI: Rust CLI binary with comprehensive options
- 📊 hOCR 1.2 Compliant: Full support for all 40+ elements and 20+ properties
- 📝 CommonMark Compliant: Follows CommonMark specification for list formatting
- 🎯 Type Safe: Full type hints and
.pyistubs for excellent IDE support - 🌍 Cross-Platform: Wheels for Linux, macOS, Windows (x86_64 + ARM64)
- ✅ Well-Tested: 900+ tests with dual Python + Rust coverage
Installation
📦 Package Names: Due to a naming conflict on crates.io, the Rust crate is published as
html-to-markdown-rs, while the Python package remainshtml-to-markdownon PyPI. The CLI binary name ishtml-to-markdownfor both.
Python Package
pip install html-to-markdown
Rust Library
cargo add html-to-markdown-rs
CLI Binary
via Homebrew (macOS/Linux)
brew tap goldziher/tap
brew install html-to-markdown
via Cargo
cargo install html-to-markdown-cli
Direct Download
Download pre-built binaries from GitHub Releases.
Quick Start
Python API
Clean, type-safe configuration with dataclasses:
from html_to_markdown import convert, ConversionOptions
html = """
<h1>Welcome</h1>
<p>This is <strong>fast</strong> Rust-powered conversion!</p>
<ul>
<li>Blazing fast</li>
<li>Type safe</li>
<li>Easy to use</li>
</ul>
"""
options = ConversionOptions(
heading_style="atx",
strong_em_symbol="*",
bullets="*+-",
)
markdown = convert(html, options)
print(markdown)
Output:
# Welcome
This is **fast** Rust-powered conversion!
* Blazing fast
+ Type safe
- Easy to use
Rust API
use html_to_markdown_rs::{convert, ConversionOptions, HeadingStyle};
fn main() {
let html = r#"
<h1>Welcome</h1>
<p>This is <strong>fast</strong> conversion!</p>
"#;
let options = ConversionOptions {
heading_style: HeadingStyle::Atx,
..Default::default()
};
let markdown = convert(html, Some(options)).unwrap();
println!("{}", markdown);
}
CLI Usage
# Convert file
html-to-markdown input.html > output.md
# From stdin
cat input.html | html-to-markdown > output.md
# With options
html-to-markdown --heading-style atx --list-indent-width 2 input.html
# Clean web-scraped content
html-to-markdown \
--preprocess \
--preset aggressive \
--no-extract-metadata \
scraped.html > clean.md
Configuration
Python: Dataclass Configuration
from html_to_markdown import (
convert,
ConversionOptions,
PreprocessingOptions,
)
# Conversion settings
options = ConversionOptions(
heading_style="atx", # "atx", "atx_closed", "underlined"
list_indent_width=2, # Discord/Slack: use 2
bullets="*+-", # Bullet characters
strong_em_symbol="*", # "*" or "_"
escape_asterisks=True, # Escape * in text
code_language="python", # Default code block language
extract_metadata=True, # Extract HTML metadata
highlight_style="double-equal", # "double-equal", "html", "bold"
)
# HTML preprocessing
preprocessing = PreprocessingOptions(
enabled=True,
preset="standard", # "minimal", "standard", "aggressive"
remove_navigation=True,
remove_forms=True,
)
markdown = convert(html, options, preprocessing)
Python: Legacy API (v1 compatibility)
For backward compatibility with existing v1 code:
from html_to_markdown import convert_to_markdown
markdown = convert_to_markdown(
html,
heading_style="atx",
list_indent_width=2,
preprocess=True,
preprocessing_preset="standard",
)
Common Use Cases
Discord/Slack Compatible Lists
from html_to_markdown import convert, ConversionOptions
options = ConversionOptions(list_indent_width=2)
markdown = convert(html, options)
Clean Web-Scraped HTML
from html_to_markdown import convert, PreprocessingOptions
preprocessing = PreprocessingOptions(
enabled=True,
preset="aggressive", # Heavy cleaning
remove_navigation=True,
remove_forms=True,
)
markdown = convert(html, preprocessing=preprocessing)
hOCR 1.2 Support
Complete hOCR 1.2 specification compliance with support for all elements, properties, and metadata:
from html_to_markdown import convert, ConversionOptions
# Option 1: Document structure extraction (NEW in v2)
# Extracts all hOCR elements and converts to structured markdown
# Supports: paragraphs, sections, chapters, headers/footers, images, math, etc.
markdown = convert(hocr_html)
# Option 2: Legacy table extraction (spatial reconstruction)
# Reconstructs tables from word bounding boxes
options = ConversionOptions(
hocr_extract_tables=True,
hocr_table_column_threshold=50,
hocr_table_row_threshold_ratio=0.5,
)
markdown = convert(hocr_html, options)
Full hOCR 1.2 Spec Coverage:
- ✅ All 40 Element Types - Logical structure (12), typesetting (6), float (13), inline (6), engine-specific (3)
- ✅ All 20+ Properties - bbox, baseline, textangle, poly, x_wconf, x_confs, x_font, x_fsize, order, cflow, cuts, x_bboxes, image, ppageno, lpageno, scan_res, and more
- ✅ All 5 Metadata Fields - ocr-system, ocr-capabilities, ocr-number-of-pages, ocr-langs, ocr-scripts
- ✅ 37 Tests - Complete coverage of all elements and properties
Semantic Markdown Conversion:
| Element Category | Examples | Markdown Output |
|---|---|---|
| Headings | ocr_title, ocr_chapter |
# Heading |
| Sections | ocr_section, ocr_subsection |
##, ### |
| Structure | ocr_par, ocr_blockquote |
Paragraphs, > quotes |
| Metadata | ocr_abstract, ocr_author |
**Abstract**, *Author* |
| Floats | ocr_header, ocr_footer |
*Header*, *Footer* |
| Images | ocr_image, ocr_photo |
 with image property |
| Math | ocr_math, ocr_display |
`formula`, ```equation``` |
| Layout | ocr_separator |
--- horizontal rule |
| Inline | ocrx_word, ocr_dropcap |
Text, **Letter** |
HTML Entity Handling: Automatically decodes ", ', <, >, & in title attributes for proper property parsing.
Configuration Reference
V2 Defaults (CommonMark-compliant):
list_indent_width: 2 (CommonMark standard)bullets: "*+-" (cycles through*,+,-for nested levels)escape_asterisks: false (minimal escaping)escape_underscores: false (minimal escaping)escape_misc: false (minimal escaping)newline_style: "spaces" (CommonMark: two trailing spaces)code_block_style: "backticks" (fenced code blocks with ```, better whitespace preservation)heading_style: "atx" (CommonMark:#)preprocessing.enabled: false (no preprocessing by default)
For complete configuration reference, see Full Documentation.
Upgrading from v1.x
Backward Compatibility
Existing v1 code works without changes:
from html_to_markdown import convert_to_markdown
markdown = convert_to_markdown(html, heading_style="atx") # Still works!
Modern API (Recommended)
For new projects, use the dataclass-based API:
from html_to_markdown import convert, ConversionOptions
options = ConversionOptions(heading_style="atx", list_indent_width=2)
markdown = convert(html, options)
What Changed in v2
Core Rewrite:
- Complete Rust rewrite using
tlHTML parser - 19-30x performance improvement over v1
- CommonMark-compliant defaults (2-space indents, minimal escaping, ATX headings)
- No BeautifulSoup or lxml dependencies
Removed Features:
code_language_callback- usecode_languagefor default languagestrip/convertoptions - usestrip_tagsor preprocessingconvert_to_markdown_stream()- not supported in v2
Planned:
custom_converters- planned for future release
See CHANGELOG.md for complete v1 vs v2 comparison and migration guide.
Kreuzberg Ecosystem
html-to-markdown is part of the Kreuzberg ecosystem, a comprehensive framework for document intelligence and processing. While html-to-markdown focuses on converting HTML to Markdown with maximum performance, Kreuzberg provides a complete solution for:
- Document Extraction: Extract text, images, and metadata from 50+ document formats
- OCR Processing: Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR)
- Table Extraction: Vision-based and OCR-based table detection
- Document Classification: Automatic detection of contracts, forms, invoices, etc.
- RAG Pipelines: Integration with retrieval-augmented generation workflows
Learn more at kreuzberg.dev or join our Discord community.
Contributing
See CONTRIBUTING.md for development setup, testing, and contribution guidelines.
License
MIT License - see LICENSE for details.
Acknowledgments
Version 1 started as a fork of markdownify, rewritten, extended, and enhanced with better typing and features. Version 2 is a complete Rust rewrite for high performance.
Support
If you find this library useful, consider:
Your support helps maintain and improve this library!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file html_to_markdown-2.0.0-cp310-abi3-win_amd64.whl.
File metadata
- Download URL: html_to_markdown-2.0.0-cp310-abi3-win_amd64.whl
- Upload date:
- Size: 4.6 MB
- Tags: CPython 3.10+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb0e3648712cf8552fca522a318b0fb46ba7393051b63fa92b9d3c08fb8c056c
|
|
| MD5 |
122e6258f9439306f958006ae863a5ff
|
|
| BLAKE2b-256 |
312c88b3da650fb425f548104cd070ee5af047b3588995260601835ef372196e
|
Provenance
The following attestation bundles were made for html_to_markdown-2.0.0-cp310-abi3-win_amd64.whl:
Publisher:
publish.yaml on Goldziher/html-to-markdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
html_to_markdown-2.0.0-cp310-abi3-win_amd64.whl -
Subject digest:
cb0e3648712cf8552fca522a318b0fb46ba7393051b63fa92b9d3c08fb8c056c - Sigstore transparency entry: 599767912
- Sigstore integration time:
-
Permalink:
Goldziher/html-to-markdown@7fd82739ac9464443fbfe75ded7b5028805a1659 -
Branch / Tag:
refs/tags/v2.0.0 - Owner: https://github.com/Goldziher
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@7fd82739ac9464443fbfe75ded7b5028805a1659 -
Trigger Event:
release
-
Statement type:
File details
Details for the file html_to_markdown-2.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.
File metadata
- Download URL: html_to_markdown-2.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
- Upload date:
- Size: 4.9 MB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3efb63268ff313149494e024f6d388c4a0a0ae3ce3d26e300ff476ec0ecfac2b
|
|
| MD5 |
84f0a46ebe3c2b5d75d169533a9e0ab5
|
|
| BLAKE2b-256 |
1411446a467639c4ad46bdfa21c7714086f70ad67db36be86ec743a9663222ad
|
Provenance
The following attestation bundles were made for html_to_markdown-2.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl:
Publisher:
publish.yaml on Goldziher/html-to-markdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
html_to_markdown-2.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl -
Subject digest:
3efb63268ff313149494e024f6d388c4a0a0ae3ce3d26e300ff476ec0ecfac2b - Sigstore transparency entry: 599767923
- Sigstore integration time:
-
Permalink:
Goldziher/html-to-markdown@7fd82739ac9464443fbfe75ded7b5028805a1659 -
Branch / Tag:
refs/tags/v2.0.0 - Owner: https://github.com/Goldziher
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@7fd82739ac9464443fbfe75ded7b5028805a1659 -
Trigger Event:
release
-
Statement type:
File details
Details for the file html_to_markdown-2.0.0-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: html_to_markdown-2.0.0-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 4.4 MB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8951c3086ae376b98fd117642a909c41081fe3f6841e9568c6b52e28a2bb50f6
|
|
| MD5 |
2be04cc3a49559ff28ffbe871e3aaf65
|
|
| BLAKE2b-256 |
61903f001ef83441264320eb669dba9fa0dcbceb93d0385dedb9ae2f9cde0a2f
|
Provenance
The following attestation bundles were made for html_to_markdown-2.0.0-cp310-abi3-macosx_11_0_arm64.whl:
Publisher:
publish.yaml on Goldziher/html-to-markdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
html_to_markdown-2.0.0-cp310-abi3-macosx_11_0_arm64.whl -
Subject digest:
8951c3086ae376b98fd117642a909c41081fe3f6841e9568c6b52e28a2bb50f6 - Sigstore transparency entry: 599767901
- Sigstore integration time:
-
Permalink:
Goldziher/html-to-markdown@7fd82739ac9464443fbfe75ded7b5028805a1659 -
Branch / Tag:
refs/tags/v2.0.0 - Owner: https://github.com/Goldziher
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@7fd82739ac9464443fbfe75ded7b5028805a1659 -
Trigger Event:
release
-
Statement type: