High-performance HTML to Markdown converter powered by Rust with a clean Python API

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

html-to-markdown

High-performance HTML to Markdown converter Rust crate and CLI with Python bindings and CLI. Available via PyPI, Homebrew, and Cargo. Cross-platform support for Linux, macOS, and Windows.

Part of the Kreuzberg ecosystem for document intelligence.

📚 Full V2 Documentation - Comprehensive guide for Rust, Python, and CLI usage.

⚡ Benchmarks

Throughput (Python API)

Real Wikipedia documents on Apple M1 Pro:

Document	Size	Latency	Throughput	Docs/sec
Lists (Timeline)	129KB	0.62ms	208 MB/s	1,613
Tables (Countries)	360KB	2.02ms	178 MB/s	495
Mixed (Python wiki)	656KB	4.56ms	144 MB/s	219

Throughput scales linearly from 144-208 MB/s across all document sizes.

Memory Usage

Document Size	Memory Delta	Peak RSS	Leak Detection
10KB	< 2 MB	< 20 MB	✅ None
50KB	< 8 MB	< 35 MB	✅ None
500KB	< 40 MB	< 80 MB	✅ None

Memory usage is linear and stable across 50+ repeated conversions.

V2 is 19-30x faster than v1 Python/BeautifulSoup implementation.

📊 Benchmark Results - Detailed Python API comparison 📈 Performance Analysis - Rust core benchmarks and profiling 🔧 Benchmarking Guide - How to run benchmarks ✅ CommonMark Compliance - CommonMark specification compliance

Features

🚀 Blazing Fast: Pure Rust core with ultra-fast tl HTML parser
🐍 Python Bindings: Clean Python API via PyO3 with full type hints
🦀 Native CLI: Rust CLI binary with comprehensive options
📊 hOCR 1.2 Compliant: Full support for all 40+ elements and 20+ properties
📝 CommonMark Compliant: Follows CommonMark specification for list formatting
🎯 Type Safe: Full type hints and .pyi stubs for excellent IDE support
🌍 Cross-Platform: Wheels for Linux, macOS, Windows (x86_64 + ARM64)
✅ Well-Tested: 900+ tests with dual Python + Rust coverage

Installation

📦 Package Names: Due to a naming conflict on crates.io, the Rust crate is published as html-to-markdown-rs, while the Python package remains html-to-markdown on PyPI. The CLI binary name is html-to-markdown for both.

Python Package

pip install html-to-markdown

Rust Library

cargo add html-to-markdown-rs

CLI Binary

via Homebrew (macOS/Linux)

brew tap goldziher/tap
brew install html-to-markdown

via Cargo

cargo install html-to-markdown-cli

Direct Download

Download pre-built binaries from GitHub Releases.

Quick Start

Python API

Clean, type-safe configuration with dataclasses:

from html_to_markdown import convert, ConversionOptions

html = """
<h1>Welcome</h1>
<p>This is <strong>fast</strong> Rust-powered conversion!</p>
<ul>
    <li>Blazing fast</li>
    <li>Type safe</li>
    <li>Easy to use</li>
</ul>
"""

options = ConversionOptions(
    heading_style="atx",
    strong_em_symbol="*",
    bullets="*+-",
)

markdown = convert(html, options)
print(markdown)

Output:

# Welcome

This is **fast** Rust-powered conversion!

* Blazing fast
+ Type safe
- Easy to use

Rust API

use html_to_markdown_rs::{convert, ConversionOptions, HeadingStyle};

fn main() {
    let html = r#"
        <h1>Welcome</h1>
        <p>This is <strong>fast</strong> conversion!</p>
    "#;

    let options = ConversionOptions {
        heading_style: HeadingStyle::Atx,
        ..Default::default()
    };

    let markdown = convert(html, Some(options)).unwrap();
    println!("{}", markdown);
}

CLI Usage

# Convert file
html-to-markdown input.html > output.md

# From stdin
cat input.html | html-to-markdown > output.md

# With options
html-to-markdown --heading-style atx --list-indent-width 2 input.html

# Clean web-scraped content
html-to-markdown \
    --preprocess \
    --preset aggressive \
    --no-extract-metadata \
    scraped.html > clean.md

Configuration

Python: Dataclass Configuration

from html_to_markdown import (
    convert,
    ConversionOptions,
    PreprocessingOptions,
)

# Conversion settings
options = ConversionOptions(
    heading_style="atx",  # "atx", "atx_closed", "underlined"
    list_indent_width=2,  # Discord/Slack: use 2
    bullets="*+-",  # Bullet characters
    strong_em_symbol="*",  # "*" or "_"
    escape_asterisks=True,  # Escape * in text
    code_language="python",  # Default code block language
    extract_metadata=True,  # Extract HTML metadata
    highlight_style="double-equal",  # "double-equal", "html", "bold"
)

# HTML preprocessing
preprocessing = PreprocessingOptions(
    enabled=True,
    preset="standard",  # "minimal", "standard", "aggressive"
    remove_navigation=True,
    remove_forms=True,
)

markdown = convert(html, options, preprocessing)

Python: Legacy API (v1 compatibility)

For backward compatibility with existing v1 code:

from html_to_markdown import convert_to_markdown

markdown = convert_to_markdown(
    html,
    heading_style="atx",
    list_indent_width=2,
    preprocess=True,
    preprocessing_preset="standard",
)

Common Use Cases

Discord/Slack Compatible Lists

from html_to_markdown import convert, ConversionOptions

options = ConversionOptions(list_indent_width=2)
markdown = convert(html, options)

Clean Web-Scraped HTML

from html_to_markdown import convert, PreprocessingOptions

preprocessing = PreprocessingOptions(
    enabled=True,
    preset="aggressive",  # Heavy cleaning
    remove_navigation=True,
    remove_forms=True,
)

markdown = convert(html, preprocessing=preprocessing)

hOCR 1.2 Support

Complete hOCR 1.2 specification compliance with support for all elements, properties, and metadata:

from html_to_markdown import convert, ConversionOptions

# Option 1: Document structure extraction (NEW in v2)
# Extracts all hOCR elements and converts to structured markdown
# Supports: paragraphs, sections, chapters, headers/footers, images, math, etc.
markdown = convert(hocr_html)

# Option 2: Legacy table extraction (spatial reconstruction)
# Reconstructs tables from word bounding boxes
options = ConversionOptions(
    hocr_extract_tables=True,
    hocr_table_column_threshold=50,
    hocr_table_row_threshold_ratio=0.5,
)
markdown = convert(hocr_html, options)

Full hOCR 1.2 Spec Coverage:

✅ All 40 Element Types - Logical structure (12), typesetting (6), float (13), inline (6), engine-specific (3)
✅ All 20+ Properties - bbox, baseline, textangle, poly, x_wconf, x_confs, x_font, x_fsize, order, cflow, cuts, x_bboxes, image, ppageno, lpageno, scan_res, and more
✅ All 5 Metadata Fields - ocr-system, ocr-capabilities, ocr-number-of-pages, ocr-langs, ocr-scripts
✅ 37 Tests - Complete coverage of all elements and properties

Semantic Markdown Conversion:

Element Category	Examples	Markdown Output
Headings	`ocr_title`, `ocr_chapter`	`# Heading`
Sections	`ocr_section`, `ocr_subsection`	`##`, `###`
Structure	`ocr_par`, `ocr_blockquote`	Paragraphs, `> quotes`
Metadata	`ocr_abstract`, `ocr_author`	`Abstract`, `Author`
Floats	`ocr_header`, `ocr_footer`	`Header`, `Footer`
Images	`ocr_image`, `ocr_photo`	`![alt](path)` with image property
Math	`ocr_math`, `ocr_display`	`formula`, ```equation```
Layout	`ocr_separator`	`---` horizontal rule
Inline	`ocrx_word`, `ocr_dropcap`	Text, `Letter`

HTML Entity Handling: Automatically decodes ", ', <, >, & in title attributes for proper property parsing.

Configuration Reference

V2 Defaults (CommonMark-compliant):

list_indent_width: 2 (CommonMark standard)
bullets: "*+-" (cycles through *, +, - for nested levels)
escape_asterisks: false (minimal escaping)
escape_underscores: false (minimal escaping)
escape_misc: false (minimal escaping)
newline_style: "spaces" (CommonMark: two trailing spaces)
code_block_style: "backticks" (fenced code blocks with ```, better whitespace preservation)
heading_style: "atx" (CommonMark: #)
preprocessing.enabled: false (no preprocessing by default)

For complete configuration reference, see Full Documentation.

Upgrading from v1.x

Backward Compatibility

Existing v1 code works without changes:

from html_to_markdown import convert_to_markdown

markdown = convert_to_markdown(html, heading_style="atx")  # Still works!

Modern API (Recommended)

For new projects, use the dataclass-based API:

from html_to_markdown import convert, ConversionOptions

options = ConversionOptions(heading_style="atx", list_indent_width=2)
markdown = convert(html, options)

What Changed in v2

Core Rewrite:

Complete Rust rewrite using tl HTML parser
19-30x performance improvement over v1
CommonMark-compliant defaults (2-space indents, minimal escaping, ATX headings)
No BeautifulSoup or lxml dependencies

Removed Features:

code_language_callback - use code_language for default language
strip / convert options - use strip_tags or preprocessing
convert_to_markdown_stream() - not supported in v2

Planned:

custom_converters - planned for future release

See CHANGELOG.md for complete v1 vs v2 comparison and migration guide.

Kreuzberg Ecosystem

html-to-markdown is part of the Kreuzberg ecosystem, a comprehensive framework for document intelligence and processing. While html-to-markdown focuses on converting HTML to Markdown with maximum performance, Kreuzberg provides a complete solution for:

Document Extraction: Extract text, images, and metadata from 50+ document formats
OCR Processing: Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR)
Table Extraction: Vision-based and OCR-based table detection
Document Classification: Automatic detection of contracts, forms, invoices, etc.
RAG Pipelines: Integration with retrieval-augmented generation workflows

Learn more at kreuzberg.dev or join our Discord community.

Contributing

See CONTRIBUTING.md for development setup, testing, and contribution guidelines.

License

MIT License - see LICENSE for details.

Acknowledgments

Version 1 started as a fork of markdownify, rewritten, extended, and enhanced with better typing and features. Version 2 is a complete Rust rewrite for high performance.

Support

If you find this library useful, consider:

Your support helps maintain and improve this library!

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nhirschfeld

These details have not been verified by PyPI

Release history Release notifications | RSS feed

3.1.0

Apr 1, 2026

3.0.2

Apr 1, 2026

3.0.1

Mar 31, 2026

3.0.0

Mar 30, 2026

2.30.0

Mar 27, 2026

2.29.0

Mar 22, 2026

2.28.6

Mar 21, 2026

2.28.2

Mar 9, 2026

2.28.1

Mar 7, 2026

2.28.0

Mar 5, 2026

2.27.3

Mar 5, 2026

2.27.2

Mar 2, 2026

2.27.1

Mar 1, 2026

2.27.0

Mar 1, 2026

2.26.3

Feb 28, 2026

2.26.2

Feb 28, 2026

2.26.1

Feb 27, 2026

2.25.1

Feb 17, 2026

2.25.0

Feb 15, 2026

2.24.6

Feb 14, 2026

2.24.5

Feb 1, 2026

2.24.4

Feb 1, 2026

2.24.3

Jan 31, 2026

2.24.1

Jan 29, 2026

2.23.4

Jan 20, 2026

2.23.3

Jan 20, 2026

2.23.2

Jan 20, 2026

2.23.1

Jan 20, 2026

2.23.0

Jan 19, 2026

2.22.5

Jan 16, 2026

2.22.2

Jan 13, 2026

2.22.1

Jan 13, 2026

2.22.0

Jan 13, 2026

2.21.1

Jan 13, 2026

2.20.0

Jan 5, 2026

2.19.8

Jan 5, 2026

2.19.7

Jan 3, 2026

2.19.6

Jan 3, 2026

2.19.5

Jan 2, 2026

2.19.4

Jan 2, 2026

2.19.3

Jan 2, 2026

2.19.2

Dec 30, 2025

2.19.1

Dec 29, 2025

2.19.0

Dec 29, 2025

2.18.0

Dec 29, 2025

2.16.1

Dec 22, 2025

2.16.0

Dec 22, 2025

2.15.0

Dec 19, 2025

2.14.11

Dec 16, 2025

2.14.10

Dec 16, 2025

2.14.9

Dec 16, 2025

2.14.8

Dec 15, 2025

2.14.7

Dec 15, 2025

2.14.6

Dec 15, 2025

2.14.5

Dec 15, 2025

2.14.4

Dec 15, 2025

2.14.3

Dec 15, 2025

2.14.2

Dec 13, 2025

2.14.1

Dec 12, 2025

2.14.0

Dec 11, 2025

2.13.0

Dec 10, 2025

2.12.1

Dec 9, 2025

2.12.0

Dec 8, 2025

2.11.4

Dec 8, 2025

2.11.3

Dec 8, 2025

2.11.1

Dec 5, 2025

2.10.1

Dec 4, 2025

2.9.2

Nov 28, 2025

2.9.1

Nov 21, 2025

2.9.0

Nov 20, 2025

2.8.3

Nov 16, 2025

2.8.2

Nov 15, 2025

2.8.1

Nov 15, 2025

2.8.0

Nov 15, 2025

2.7.2

Nov 14, 2025

2.7.1

Nov 12, 2025

2.7.0

Nov 11, 2025

2.6.6

Nov 10, 2025

2.6.5

Nov 8, 2025

2.6.4

Nov 8, 2025

2.6.3

Nov 8, 2025

2.6.2

Nov 7, 2025

2.6.1

Nov 7, 2025

2.6.0

Nov 7, 2025

2.5.6

Oct 29, 2025

2.5.5

Oct 29, 2025

2.5.4

Oct 29, 2025

2.5.3

Oct 29, 2025

2.5.2

Oct 29, 2025

2.5.1

Oct 29, 2025

2.5.0

Oct 24, 2025

2.4.2

Oct 24, 2025

2.4.1

Oct 22, 2025

2.4.0

Oct 22, 2025

2.3.4

Oct 14, 2025

2.3.3

Oct 14, 2025

2.3.0

Oct 13, 2025

2.2.0

Oct 12, 2025

2.1.2

Oct 11, 2025

2.1.0

Oct 11, 2025

2.0.1

Oct 10, 2025

This version

2.0.0

Oct 10, 2025

1.16.0

Sep 27, 2025

1.15.0

Sep 26, 2025

1.14.1

Sep 25, 2025

1.14.0

Sep 22, 2025

1.13.0

Sep 16, 2025

1.12.1

Sep 15, 2025

1.12.0

Sep 15, 2025

1.11.0

Sep 13, 2025

1.10.0

Sep 8, 2025

1.9.1

Sep 2, 2025

1.9.0

Jul 29, 2025

1.8.0

Jul 12, 2025

1.6.0

Jul 11, 2025

1.5.0

Jul 10, 2025

1.4.0

Jun 23, 2025

1.3.3

Jun 3, 2025

1.3.2

Apr 24, 2025

1.3.1

Apr 19, 2025

1.3.0

Apr 2, 2025

1.2.1

Mar 27, 2025

1.2.0

Feb 3, 2025

1.1.0

Sep 9, 2024

1.0.0

Sep 8, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

html_to_markdown-2.0.0-cp310-abi3-win_amd64.whl (4.6 MB view details)

Uploaded Oct 10, 2025 CPython 3.10+Windows x86-64

html_to_markdown-2.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (4.9 MB view details)

Uploaded Oct 10, 2025 CPython 3.10+manylinux: glibc 2.17+ x86-64

html_to_markdown-2.0.0-cp310-abi3-macosx_11_0_arm64.whl (4.4 MB view details)

Uploaded Oct 10, 2025 CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file html_to_markdown-2.0.0-cp310-abi3-win_amd64.whl.

File metadata

Download URL: html_to_markdown-2.0.0-cp310-abi3-win_amd64.whl
Upload date: Oct 10, 2025
Size: 4.6 MB
Tags: CPython 3.10+, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for html_to_markdown-2.0.0-cp310-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`cb0e3648712cf8552fca522a318b0fb46ba7393051b63fa92b9d3c08fb8c056c`
MD5	`122e6258f9439306f958006ae863a5ff`
BLAKE2b-256	`312c88b3da650fb425f548104cd070ee5af047b3588995260601835ef372196e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for html_to_markdown-2.0.0-cp310-abi3-win_amd64.whl:

Publisher: publish.yaml on Goldziher/html-to-markdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: html_to_markdown-2.0.0-cp310-abi3-win_amd64.whl
- Subject digest: cb0e3648712cf8552fca522a318b0fb46ba7393051b63fa92b9d3c08fb8c056c
- Sigstore transparency entry: 599767912
- Sigstore integration time: Oct 10, 2025
Source repository:
- Permalink: Goldziher/html-to-markdown@7fd82739ac9464443fbfe75ded7b5028805a1659
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/Goldziher
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@7fd82739ac9464443fbfe75ded7b5028805a1659
- Trigger Event: release

File details

Details for the file html_to_markdown-2.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

Download URL: html_to_markdown-2.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Upload date: Oct 10, 2025
Size: 4.9 MB
Tags: CPython 3.10+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for html_to_markdown-2.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm	Hash digest
SHA256	`3efb63268ff313149494e024f6d388c4a0a0ae3ce3d26e300ff476ec0ecfac2b`
MD5	`84f0a46ebe3c2b5d75d169533a9e0ab5`
BLAKE2b-256	`1411446a467639c4ad46bdfa21c7714086f70ad67db36be86ec743a9663222ad`

See more details on using hashes here.

Provenance

The following attestation bundles were made for html_to_markdown-2.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl:

Publisher: publish.yaml on Goldziher/html-to-markdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: html_to_markdown-2.0.0-cp310-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
- Subject digest: 3efb63268ff313149494e024f6d388c4a0a0ae3ce3d26e300ff476ec0ecfac2b
- Sigstore transparency entry: 599767923
- Sigstore integration time: Oct 10, 2025
Source repository:
- Permalink: Goldziher/html-to-markdown@7fd82739ac9464443fbfe75ded7b5028805a1659
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/Goldziher
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@7fd82739ac9464443fbfe75ded7b5028805a1659
- Trigger Event: release

File details

Details for the file html_to_markdown-2.0.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: html_to_markdown-2.0.0-cp310-abi3-macosx_11_0_arm64.whl
Upload date: Oct 10, 2025
Size: 4.4 MB
Tags: CPython 3.10+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for html_to_markdown-2.0.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`8951c3086ae376b98fd117642a909c41081fe3f6841e9568c6b52e28a2bb50f6`
MD5	`2be04cc3a49559ff28ffbe871e3aaf65`
BLAKE2b-256	`61903f001ef83441264320eb669dba9fa0dcbceb93d0385dedb9ae2f9cde0a2f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for html_to_markdown-2.0.0-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: publish.yaml on Goldziher/html-to-markdown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: html_to_markdown-2.0.0-cp310-abi3-macosx_11_0_arm64.whl
- Subject digest: 8951c3086ae376b98fd117642a909c41081fe3f6841e9568c6b52e28a2bb50f6
- Sigstore transparency entry: 599767901
- Sigstore integration time: Oct 10, 2025
Source repository:
- Permalink: Goldziher/html-to-markdown@7fd82739ac9464443fbfe75ded7b5028805a1659
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/Goldziher
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@7fd82739ac9464443fbfe75ded7b5028805a1659
- Trigger Event: release

html-to-markdown 2.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

html-to-markdown

⚡ Benchmarks

Throughput (Python API)

Memory Usage

Features

Installation

Python Package

Rust Library

CLI Binary

via Homebrew (macOS/Linux)

via Cargo

Direct Download

Quick Start

Python API

Rust API

CLI Usage

Configuration

Python: Dataclass Configuration

Python: Legacy API (v1 compatibility)

Common Use Cases

Discord/Slack Compatible Lists

Clean Web-Scraped HTML

hOCR 1.2 Support

Configuration Reference

Upgrading from v1.x

Backward Compatibility

Modern API (Recommended)

What Changed in v2

Kreuzberg Ecosystem

Contributing

License

Acknowledgments

Support

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance