Skip to main content

Rust implementation of Python markdownify with a Python API

Project description

markdownify-rs

Rust implementation of Python markdownify with output parity as the primary goal.

Python bindings

Build and install locally with maturin (uv):

uv venv
uv pip install maturin
.venv/bin/maturin develop --features python

Build via pip (PEP 517):

uv pip install .

Usage:

from markdownify_rs import markdownify

print(markdownify("<b>Hello</b>"))

BS4-style read-only HTML API:

from markdownify_rs import Soup

soup = Soup("<div class='main'><p>Hello</p></div>")
p = soup.find("p")
print(p.get_text())                  # "Hello"
print(soup.select("div.main > p"))   # [<Tag <p>Hello</p>>]
print(p.parent.name)                 # "div"

BS4 batch query API (GIL-free execution in Rust):

from markdownify_rs import QueryPlan, run_batch

plan = QueryPlan()
plan.add_select_count("links", "a[href]")
plan.add_select_one_text("title", "title")
plan.add_select_all_texts("link_texts", "a", separator=" ", strip=True)
plan.add_get_text("all_text", " ", True)

rows = run_batch(["<html><title>A</title><a href='/x'>x</a></html>"], plan)
print(rows[0]["links"], rows[0]["title"], rows[0]["link_texts"])  # 1, "A", ["x"]

Full Python bs4 API + QueryPlan guide:

  • PYTHON_BS4_API.md

Markdown -> HTML usage (Python-Markdown-style API):

from markdownify_rs import markdown, Markdown

print(markdown("# Hello", extensions=["tables", "footnotes"]))

md = Markdown(extensions=["toc", "admonition"])
print(md.convert("[TOC]\n\n# Title"))
print(md.toc_tokens)  # [{"level": 1, "id": "...", "name": "Title"}]

# Optional CommonMark/GFM-ish toggles (default False):
print(markdown("visit https://example.com", autolink=True))
print(markdown("- [ ] task", tasklist=True))
print(markdown("~~old~~", strikethrough=True))

# Parsing mode:
# - python_compat (default): best-effort Python-Markdown compatibility
# - fast: pure comrak/CommonMark fast path
print(markdown("1) one\n2) two", extensions=["sane_lists"], mode="python_compat"))
print(markdown("1) one\n2) two", extensions=["sane_lists"], mode="fast"))

Full Python markdown-to-HTML API quickstart (all public args):

  • PYTHON_MARKDOWN_TO_HTML_QUICKSTART.md

Batch usage (parallelized in Rust):

from markdownify_rs import markdownify_batch

outputs = markdownify_batch(["<b>Hello</b>", "<i>World</i>"])

Markdown-adjacent utilities (submodule):

from markdownify_rs.markdown_utils import (
    split_into_chunks,
    split_into_chunks_batch,
    coalesce_small_chunks,
    link_percentage,
    link_percentage_batch,
    filter_by_link_percentage,
    strip_links_with_substring,
    strip_links_with_substring_batch,
    remove_large_tables,
    remove_large_tables_batch,
    remove_lines_with_substring,
    remove_lines_with_substring_batch,
    fix_newlines,
    fix_newlines_batch,
    split_on_dividers,
    strip_html_and_contents,
    strip_html_and_contents_batch,
    strip_data_uri_images,
    text_pipeline_batch,
)

chunks = split_into_chunks(text, how="sections")
chunks_batch = split_into_chunks_batch([text1, text2], how="sections")
cleaned = strip_links_with_substring(text, "javascript")
cleaned_batch = strip_links_with_substring_batch([text1, text2], "javascript")
filtered = filter_by_link_percentage([text1, text2], threshold=0.5)
pipelined = text_pipeline_batch(
    [text1, text2],
    steps=[
        ("strip_links_with_substring", {"substring": "javascript"}),
        ("remove_large_tables", {"max_cells": 200}),
        ("fix_newlines", {}),
    ],
)

Notes:

  • code_language_callback is not yet supported in the Python bindings.

CLI:

markdownify-rs input.html
cat input.html | markdownify-rs

Parity hacks (scraper vs. BeautifulSoup)

These are explicit, ad hoc behaviors added on top of scraper/html5ever to match python-markdownify (BeautifulSoup + html.parser) output. They are intentionally quirky and may be replaced with more “correct” behavior once parity is stable.

  • <br> parser quirk: With BeautifulSoup’s html.parser, if a non‑self‑closing <br> appears before a self‑closing <br/>, the later <br/> can be treated like an opening <br> whose contents run until that implicit <br> is closed (usually when its parent closes). We emulate this by removing the content between that <br/> and the closing tag that ends the implicit <br> (ignoring <br> tags inside comments/scripts), which matches python-markdownify’s output.
  • Leading whitespace reconstruction: html.parser preserves whitespace‑only text nodes that html5ever drops (notably between <html> children and at the start of <body>). We reconstruct the normalized leading whitespace prefix (using the same “single space vs. single newline” rules as BeautifulSoup’s endData) and merge it with the converter output, carrying it across non‑block tags and empty custom elements whose contents are only comments/whitespace.
  • Table header inference: For tables whose header row is effectively empty, we avoid forcing a “---” separator to match python-markdownify behavior.
  • Top-level <td>/<th> wrapping: If input is a bare <td>/<th>, we wrap it in a <table><tr>…</tr></table> fragment to align with python-markdownify output.

Benchmarks

Datasets

  • Michigan Statutes (JSONL, 241 HTML documents).
    • Total HTML bytes: 101,029,525 (~96.35 MiB).
    • Largest document: 8,034,686 bytes (~7.66 MiB).
    • Source file size: 102,856,616 bytes (~98.10 MiB).
  • Law websites (CSV, 3,136 HTML documents).
    • Total HTML bytes: 111,747,114 (~106.57 MiB).
    • Largest document: 1,381,380 bytes (~1.32 MiB).
    • Source file size: 148,486,852 bytes (~141.61 MiB).

Run

# Michigan Statutes (JSONL)
MARKDOWNIFY_BENCH_PATH=/path/to/mi_statutes.jsonl .venv/bin/python scripts/bench_python.py --module markdownify_rs --dist-name markdownify-rs --label markdownify_rs
MARKDOWNIFY_BENCH_PATH=/path/to/mi_statutes.jsonl .venv/bin/python scripts/bench_python.py --module markdownify --dist-name markdownify --label markdownify

# Law websites (CSV)
.venv/bin/python scripts/bench_python.py --format csv --path /path/to/deleted_pages.csv --module markdownify_rs --dist-name markdownify-rs --label markdownify_rs
.venv/bin/python scripts/bench_python.py --format csv --path /path/to/deleted_pages.csv --module markdownify --dist-name markdownify --label markdownify

Python binding comparison (both run through Python, 2026-01-28, Apple M3, macOS 14.6 / Darwin 24.6.0, Python 3.13.0)

Michigan Statutes (JSONL)

  • markdownify_rs convert_all (241 docs): time 2.266594 s, throughput 42.508 MiB/s
  • markdownify_rs convert_all_batch (241 docs): time 0.538012 s, throughput 179.084 MiB/s
  • markdownify_rs convert_largest (8,034,686 bytes): time 187.941 ms, throughput 40.771 MiB/s
  • markdownify convert_all (241 docs): time 29.654787 s, throughput 3.249 MiB/s
  • markdownify convert_largest (8,034,686 bytes): time 4.496880 s, throughput 1.704 MiB/s

Speedup summary (wall-clock time, lower is better)

Scenario markdownify_rs time markdownify_rs batch time markdownify time Speedup (rs vs py) Speedup (batch vs py) Batch vs rs
convert_all 2.266594 s 0.538012 s 29.654787 s 13.08x (+1208.34%) 55.12x (+5411.92%) 4.21x (+321.29%)
convert_largest 187.941 ms n/a 4.496880 s 23.93x (+2292.71%) n/a n/a

Law websites (CSV)

  • markdownify_rs convert_all (3,136 docs): time 2.596691 s, throughput 41.041 MiB/s
  • markdownify_rs convert_all_batch (3,136 docs): time 0.672013 s, throughput 158.584 MiB/s
  • markdownify_rs convert_largest (1,381,380 bytes): time 54.482 ms, throughput 24.180 MiB/s
  • markdownify convert_all (3,136 docs): time 17.680570 s, throughput 6.028 MiB/s
  • markdownify convert_largest (1,381,380 bytes): time 280.459 ms, throughput 4.697 MiB/s

Speedup summary (wall-clock time, lower is better)

Scenario markdownify_rs time markdownify_rs batch time markdownify time Speedup (rs vs py) Speedup (batch vs py) Batch vs rs
convert_all 2.596691 s 0.672013 s 17.680570 s 6.81x (+580.89%) 26.31x (+2530.99%) 3.86x (+286.40%)
convert_largest 54.482 ms n/a 280.459 ms 5.15x (+414.77%) n/a n/a

Markdown -> HTML parity/speed report:

.venv/bin/python scripts/report_markdown_to_html_parity_speed.py \
  --corpus-dir /tmp/test_markdowns \
  --report BENCHMARKS_MARKDOWN_TO_HTML.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdownify_rs-0.1.5.tar.gz (143.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

markdownify_rs-0.1.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.5-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.5-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.5-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.5-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

markdownify_rs-0.1.5-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.5-cp38-abi3-macosx_11_0_arm64.whl (2.0 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file markdownify_rs-0.1.5.tar.gz.

File metadata

  • Download URL: markdownify_rs-0.1.5.tar.gz
  • Upload date:
  • Size: 143.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for markdownify_rs-0.1.5.tar.gz
Algorithm Hash digest
SHA256 eb70b28e74a44b71849f8fc8e95af3ae86bee09de4eacc5e5e9ea5610216b7a7
MD5 8d0c4da7a01c97e7a21e85104302e55a
BLAKE2b-256 7c8e9de4fba42db6f93fbb3a446ced86110ab9cad5882bb03f81f48ffb734b2a

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e3f587a342855f0b7eeea5e6b21981b889760bea865bc2e6a8353bca0905efc0
MD5 9944c93b988723d82ad51225003b8269
BLAKE2b-256 da7384db0218dc59da9a568df8dd8c9b7f78753bd40c1b9ae4827b949268773c

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.5-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.5-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1193b041cb01aba9b8ea955fc8ec75921cf64a06210be30ee31b65b049e90bf2
MD5 27d2498a9be75af9bca239dfff916791
BLAKE2b-256 2154f7a00b38d73c2052f6e2401637bf42b4de8c92ea0620afd2eafeaa81366e

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.5-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.5-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e3590a7dac6c2aba159670140a4ece07a15c6f8ff30e954a6850006b421f0bec
MD5 a6b78e16221ddbecd53226c853e442de
BLAKE2b-256 fbbe3f24dbf5c127e0370422fae3357a59b2c7f548f6c12c1eb96a4b31236adc

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.5-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.5-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b37bb250d8fd8257fc7d483b3e451ea2c23d83ba37e3495d5b9e8fff03181e15
MD5 e59637c9a8fb1d8d5f62975c7207134d
BLAKE2b-256 1140b833143de8006c07fe631aef62181d428fb1a5b9ddbdd442c4a0d591ec12

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.5-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.5-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 40ee1605e719a5a5957bdc0582df32c1cb71b198b2608cc72de309816db2bbcf
MD5 507f7504b7dfc32058cc4b54c890d9c6
BLAKE2b-256 d2ed9c07b7a4ba246566de6ab6531a7bae7f1c0f6daf63f8f59ab8b8a56dafa8

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.5-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.5-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 094832376a2a57653df030c11bca314699ce2e15ea1a93f71aefca69b8fb3241
MD5 c0bff1d4f572937fc1d23621a9f5c0eb
BLAKE2b-256 23d3f74579ec38fe0de1bf83a5fbd101d5f84fb7b76aa40599bf70cf10e83611

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.5-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.5-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c707e29b106912892329f7130e90aba8e1eefe02bc6c86eea67423c7e0b4a48f
MD5 c3864a7e08d6966f92a4fc1560b5bc39
BLAKE2b-256 bb66e1b50d3550fab60071c4e46d890e4bcb449aba85a29dcf5b4fb29c41e72c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page