Skip to main content

High-performance HTML to Markdown converter powered by Rust with a clean Python API

Project description

html-to-markdown

High-performance HTML to Markdown converter Rust crate and CLI with Python bindings and CLI. Available via PyPI, Homebrew, and Cargo. Cross-platform support for Linux, macOS, and Windows.

PyPI version Crates.io Python Versions Documentation License: MIT Discord

Part of the Kreuzberg ecosystem for document intelligence.

📚 Full V2 Documentation - Comprehensive guide for Rust, Python, and CLI usage.

⚡ Benchmarks

Throughput (Python API)

Real Wikipedia documents on Apple M1 Pro:

Document Size Latency Throughput Docs/sec
Lists (Timeline) 129KB 0.62ms 208 MB/s 1,613
Tables (Countries) 360KB 2.02ms 178 MB/s 495
Mixed (Python wiki) 656KB 4.56ms 144 MB/s 219

Throughput scales linearly from 144-208 MB/s across all document sizes.

Memory Usage

Document Size Memory Delta Peak RSS Leak Detection
10KB < 2 MB < 20 MB ✅ None
50KB < 8 MB < 35 MB ✅ None
500KB < 40 MB < 80 MB ✅ None

Memory usage is linear and stable across 50+ repeated conversions.

V2 is 19-30x faster than v1 Python/BeautifulSoup implementation.

📊 Benchmark Results - Detailed Python API comparison 📈 Performance Analysis - Rust core benchmarks and profiling 🔧 Benchmarking Guide - How to run benchmarks ✅ CommonMark Compliance - CommonMark specification compliance

Features

  • 🚀 Blazing Fast: Pure Rust core with ultra-fast tl HTML parser
  • 🐍 Python Bindings: Clean Python API via PyO3 with full type hints
  • 🦀 Native CLI: Rust CLI binary with comprehensive options
  • 📊 hOCR 1.2 Compliant: Full support for all 40+ elements and 20+ properties
  • 📝 CommonMark Compliant: Follows CommonMark specification for list formatting
  • 🎯 Type Safe: Full type hints and .pyi stubs for excellent IDE support
  • 🌍 Cross-Platform: Wheels for Linux, macOS, Windows (x86_64 + ARM64)
  • ✅ Well-Tested: 900+ tests with dual Python + Rust coverage

Installation

📦 Package Names: Due to a naming conflict on crates.io, the Rust crate is published as html-to-markdown-rs, while the Python package remains html-to-markdown on PyPI. The CLI binary name is html-to-markdown for both.

Python Package

pip install html-to-markdown

Rust Library

cargo add html-to-markdown-rs

CLI Binary

via Homebrew (macOS/Linux)

brew tap goldziher/tap
brew install html-to-markdown

via Cargo

cargo install html-to-markdown-cli

Direct Download

Download pre-built binaries from GitHub Releases.

Quick Start

Python API

Clean, type-safe configuration with dataclasses:

from html_to_markdown import convert, ConversionOptions

html = """
<h1>Welcome</h1>
<p>This is <strong>fast</strong> Rust-powered conversion!</p>
<ul>
    <li>Blazing fast</li>
    <li>Type safe</li>
    <li>Easy to use</li>
</ul>
"""

options = ConversionOptions(
    heading_style="atx",
    strong_em_symbol="*",
    bullets="*+-",
)

markdown = convert(html, options)
print(markdown)

Output:

# Welcome

This is **fast** Rust-powered conversion!

* Blazing fast
+ Type safe
- Easy to use

Rust API

use html_to_markdown_rs::{convert, ConversionOptions, HeadingStyle};

fn main() {
    let html = r#"
        <h1>Welcome</h1>
        <p>This is <strong>fast</strong> conversion!</p>
    "#;

    let options = ConversionOptions {
        heading_style: HeadingStyle::Atx,
        ..Default::default()
    };

    let markdown = convert(html, Some(options)).unwrap();
    println!("{}", markdown);
}

CLI Usage

# Convert file
html-to-markdown input.html > output.md

# From stdin
cat input.html | html-to-markdown > output.md

# With options
html-to-markdown --heading-style atx --list-indent-width 2 input.html

# Clean web-scraped content
html-to-markdown \
    --preprocess \
    --preset aggressive \
    --no-extract-metadata \
    scraped.html > clean.md

Configuration

Python: Dataclass Configuration

from html_to_markdown import (
    convert,
    ConversionOptions,
    PreprocessingOptions,
)

# Conversion settings
options = ConversionOptions(
    heading_style="atx",  # "atx", "atx_closed", "underlined"
    list_indent_width=2,  # Discord/Slack: use 2
    bullets="*+-",  # Bullet characters
    strong_em_symbol="*",  # "*" or "_"
    escape_asterisks=True,  # Escape * in text
    code_language="python",  # Default code block language
    extract_metadata=True,  # Extract HTML metadata
    highlight_style="double-equal",  # "double-equal", "html", "bold"
)

# HTML preprocessing
preprocessing = PreprocessingOptions(
    enabled=True,
    preset="standard",  # "minimal", "standard", "aggressive"
    remove_navigation=True,
    remove_forms=True,
)

markdown = convert(html, options, preprocessing)

Python: Legacy API (v1 compatibility)

For backward compatibility with existing v1 code:

from html_to_markdown import convert_to_markdown

markdown = convert_to_markdown(
    html,
    heading_style="atx",
    list_indent_width=2,
    preprocess=True,
    preprocessing_preset="standard",
)

Common Use Cases

Discord/Slack Compatible Lists

from html_to_markdown import convert, ConversionOptions

options = ConversionOptions(list_indent_width=2)
markdown = convert(html, options)

Clean Web-Scraped HTML

from html_to_markdown import convert, PreprocessingOptions

preprocessing = PreprocessingOptions(
    enabled=True,
    preset="aggressive",  # Heavy cleaning
    remove_navigation=True,
    remove_forms=True,
)

markdown = convert(html, preprocessing=preprocessing)

hOCR 1.2 Support

Complete hOCR 1.2 specification compliance with support for all elements, properties, and metadata:

from html_to_markdown import convert, ConversionOptions

# Option 1: Document structure extraction (NEW in v2)
# Extracts all hOCR elements and converts to structured markdown
# Supports: paragraphs, sections, chapters, headers/footers, images, math, etc.
markdown = convert(hocr_html)

# Option 2: Legacy table extraction (spatial reconstruction)
# Reconstructs tables from word bounding boxes
options = ConversionOptions(
    hocr_extract_tables=True,
    hocr_table_column_threshold=50,
    hocr_table_row_threshold_ratio=0.5,
)
markdown = convert(hocr_html, options)

Full hOCR 1.2 Spec Coverage:

  • All 40 Element Types - Logical structure (12), typesetting (6), float (13), inline (6), engine-specific (3)
  • All 20+ Properties - bbox, baseline, textangle, poly, x_wconf, x_confs, x_font, x_fsize, order, cflow, cuts, x_bboxes, image, ppageno, lpageno, scan_res, and more
  • All 5 Metadata Fields - ocr-system, ocr-capabilities, ocr-number-of-pages, ocr-langs, ocr-scripts
  • 37 Tests - Complete coverage of all elements and properties

Semantic Markdown Conversion:

Element Category Examples Markdown Output
Headings ocr_title, ocr_chapter # Heading
Sections ocr_section, ocr_subsection ##, ###
Structure ocr_par, ocr_blockquote Paragraphs, > quotes
Metadata ocr_abstract, ocr_author **Abstract**, *Author*
Floats ocr_header, ocr_footer *Header*, *Footer*
Images ocr_image, ocr_photo ![alt](path) with image property
Math ocr_math, ocr_display `formula`, ```equation```
Layout ocr_separator --- horizontal rule
Inline ocrx_word, ocr_dropcap Text, **Letter**

HTML Entity Handling: Automatically decodes &quot;, &apos;, &lt;, &gt;, &amp; in title attributes for proper property parsing.

Configuration Reference

V2 Defaults (CommonMark-compliant):

  • list_indent_width: 2 (CommonMark standard)
  • bullets: "*+-" (cycles through *, +, - for nested levels)
  • escape_asterisks: false (minimal escaping)
  • escape_underscores: false (minimal escaping)
  • escape_misc: false (minimal escaping)
  • newline_style: "spaces" (CommonMark: two trailing spaces)
  • code_block_style: "backticks" (fenced code blocks with ```, better whitespace preservation)
  • heading_style: "atx" (CommonMark: #)
  • preprocessing.enabled: false (no preprocessing by default)

For complete configuration reference, see Full Documentation.

Upgrading from v1.x

Backward Compatibility

Existing v1 code works without changes:

from html_to_markdown import convert_to_markdown

markdown = convert_to_markdown(html, heading_style="atx")  # Still works!

Modern API (Recommended)

For new projects, use the dataclass-based API:

from html_to_markdown import convert, ConversionOptions

options = ConversionOptions(heading_style="atx", list_indent_width=2)
markdown = convert(html, options)

What Changed in v2

Core Rewrite:

  • Complete Rust rewrite using tl HTML parser
  • 19-30x performance improvement over v1
  • CommonMark-compliant defaults (2-space indents, minimal escaping, ATX headings)
  • No BeautifulSoup or lxml dependencies

Removed Features:

  • code_language_callback - use code_language for default language
  • strip / convert options - use strip_tags or preprocessing
  • convert_to_markdown_stream() - not supported in v2

Planned:

  • custom_converters - planned for future release

See CHANGELOG.md for complete v1 vs v2 comparison and migration guide.

Kreuzberg Ecosystem

html-to-markdown is part of the Kreuzberg ecosystem, a comprehensive framework for document intelligence and processing. While html-to-markdown focuses on converting HTML to Markdown with maximum performance, Kreuzberg provides a complete solution for:

  • Document Extraction: Extract text, images, and metadata from 50+ document formats
  • OCR Processing: Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR)
  • Table Extraction: Vision-based and OCR-based table detection
  • Document Classification: Automatic detection of contracts, forms, invoices, etc.
  • RAG Pipelines: Integration with retrieval-augmented generation workflows

Learn more at kreuzberg.dev or join our Discord community.

Contributing

See CONTRIBUTING.md for development setup, testing, and contribution guidelines.

License

MIT License - see LICENSE for details.

Acknowledgments

Version 1 started as a fork of markdownify, rewritten, extended, and enhanced with better typing and features. Version 2 is a complete Rust rewrite for high performance.

Support

If you find this library useful, consider:

Sponsor

Your support helps maintain and improve this library!

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

html_to_markdown-2.0.0-cp310-abi3-win_amd64.whl (4.6 MB view details)

Uploaded CPython 3.10+Windows x86-64

html_to_markdown-2.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

html_to_markdown-2.0.0-cp310-abi3-macosx_11_0_arm64.whl (4.4 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file html_to_markdown-2.0.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for html_to_markdown-2.0.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 cb0e3648712cf8552fca522a318b0fb46ba7393051b63fa92b9d3c08fb8c056c
MD5 122e6258f9439306f958006ae863a5ff
BLAKE2b-256 312c88b3da650fb425f548104cd070ee5af047b3588995260601835ef372196e

See more details on using hashes here.

Provenance

The following attestation bundles were made for html_to_markdown-2.0.0-cp310-abi3-win_amd64.whl:

Publisher: publish.yaml on Goldziher/html-to-markdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file html_to_markdown-2.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for html_to_markdown-2.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 3efb63268ff313149494e024f6d388c4a0a0ae3ce3d26e300ff476ec0ecfac2b
MD5 84f0a46ebe3c2b5d75d169533a9e0ab5
BLAKE2b-256 1411446a467639c4ad46bdfa21c7714086f70ad67db36be86ec743a9663222ad

See more details on using hashes here.

Provenance

The following attestation bundles were made for html_to_markdown-2.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl:

Publisher: publish.yaml on Goldziher/html-to-markdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file html_to_markdown-2.0.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for html_to_markdown-2.0.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8951c3086ae376b98fd117642a909c41081fe3f6841e9568c6b52e28a2bb50f6
MD5 2be04cc3a49559ff28ffbe871e3aaf65
BLAKE2b-256 61903f001ef83441264320eb669dba9fa0dcbceb93d0385dedb9ae2f9cde0a2f

See more details on using hashes here.

Provenance

The following attestation bundles were made for html_to_markdown-2.0.0-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: publish.yaml on Goldziher/html-to-markdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page