Skip to main content

Rust implementation of Python markdownify with a Python API

Project description

markdownify-rs

Rust implementation of Python markdownify with output parity as the primary goal.

Python bindings

Build and install locally with maturin (uv):

uv venv
uv pip install maturin
.venv/bin/maturin develop --features python

Build via pip (PEP 517):

uv pip install .

Usage:

from markdownify_rs import markdownify

print(markdownify("<b>Hello</b>"))

BS4-style read-only HTML API:

from markdownify_rs import Soup

soup = Soup("<div class='main'><p>Hello</p></div>")
p = soup.find("p")
print(p.get_text())                  # "Hello"
print(soup.select("div.main > p"))   # [<Tag <p>Hello</p>>]
print(p.parent.name)                 # "div"

BS4 batch query API (GIL-free execution in Rust):

from markdownify_rs import QueryPlan, run_batch

plan = QueryPlan()
plan.add_select_count("links", "a[href]")
plan.add_select_one_text("title", "title")
plan.add_get_text("all_text", " ", True)

rows = run_batch(["<html><title>A</title><a href='/x'>x</a></html>"], plan)
print(rows[0]["links"], rows[0]["title"])  # 1, "A"

Full Python bs4 API + QueryPlan guide:

  • PYTHON_BS4_API.md

Markdown -> HTML usage (Python-Markdown-style API):

from markdownify_rs import markdown, Markdown

print(markdown("# Hello", extensions=["tables", "footnotes"]))

md = Markdown(extensions=["toc", "admonition"])
print(md.convert("[TOC]\n\n# Title"))
print(md.toc_tokens)  # [{"level": 1, "id": "...", "name": "Title"}]

# Optional CommonMark/GFM-ish toggles (default False):
print(markdown("visit https://example.com", autolink=True))
print(markdown("- [ ] task", tasklist=True))
print(markdown("~~old~~", strikethrough=True))

# Parsing mode:
# - python_compat (default): best-effort Python-Markdown compatibility
# - fast: pure comrak/CommonMark fast path
print(markdown("1) one\n2) two", extensions=["sane_lists"], mode="python_compat"))
print(markdown("1) one\n2) two", extensions=["sane_lists"], mode="fast"))

Full Python markdown-to-HTML API quickstart (all public args):

  • PYTHON_MARKDOWN_TO_HTML_QUICKSTART.md

Batch usage (parallelized in Rust):

from markdownify_rs import markdownify_batch

outputs = markdownify_batch(["<b>Hello</b>", "<i>World</i>"])

Markdown-adjacent utilities (submodule):

from markdownify_rs.markdown_utils import (
    split_into_chunks,
    split_into_chunks_batch,
    coalesce_small_chunks,
    link_percentage,
    link_percentage_batch,
    filter_by_link_percentage,
    strip_links_with_substring,
    strip_links_with_substring_batch,
    remove_large_tables,
    remove_large_tables_batch,
    remove_lines_with_substring,
    remove_lines_with_substring_batch,
    fix_newlines,
    fix_newlines_batch,
    split_on_dividers,
    strip_html_and_contents,
    strip_html_and_contents_batch,
    strip_data_uri_images,
    text_pipeline_batch,
)

chunks = split_into_chunks(text, how="sections")
chunks_batch = split_into_chunks_batch([text1, text2], how="sections")
cleaned = strip_links_with_substring(text, "javascript")
cleaned_batch = strip_links_with_substring_batch([text1, text2], "javascript")
filtered = filter_by_link_percentage([text1, text2], threshold=0.5)
pipelined = text_pipeline_batch(
    [text1, text2],
    steps=[
        ("strip_links_with_substring", {"substring": "javascript"}),
        ("remove_large_tables", {"max_cells": 200}),
        ("fix_newlines", {}),
    ],
)

Notes:

  • code_language_callback is not yet supported in the Python bindings.

CLI:

markdownify-rs input.html
cat input.html | markdownify-rs

Parity hacks (scraper vs. BeautifulSoup)

These are explicit, ad hoc behaviors added on top of scraper/html5ever to match python-markdownify (BeautifulSoup + html.parser) output. They are intentionally quirky and may be replaced with more “correct” behavior once parity is stable.

  • <br> parser quirk: With BeautifulSoup’s html.parser, if a non‑self‑closing <br> appears before a self‑closing <br/>, the later <br/> can be treated like an opening <br> whose contents run until that implicit <br> is closed (usually when its parent closes). We emulate this by removing the content between that <br/> and the closing tag that ends the implicit <br> (ignoring <br> tags inside comments/scripts), which matches python-markdownify’s output.
  • Leading whitespace reconstruction: html.parser preserves whitespace‑only text nodes that html5ever drops (notably between <html> children and at the start of <body>). We reconstruct the normalized leading whitespace prefix (using the same “single space vs. single newline” rules as BeautifulSoup’s endData) and merge it with the converter output, carrying it across non‑block tags and empty custom elements whose contents are only comments/whitespace.
  • Table header inference: For tables whose header row is effectively empty, we avoid forcing a “---” separator to match python-markdownify behavior.
  • Top-level <td>/<th> wrapping: If input is a bare <td>/<th>, we wrap it in a <table><tr>…</tr></table> fragment to align with python-markdownify output.

Benchmarks

Datasets

  • Michigan Statutes (JSONL, 241 HTML documents).
    • Total HTML bytes: 101,029,525 (~96.35 MiB).
    • Largest document: 8,034,686 bytes (~7.66 MiB).
    • Source file size: 102,856,616 bytes (~98.10 MiB).
  • Law websites (CSV, 3,136 HTML documents).
    • Total HTML bytes: 111,747,114 (~106.57 MiB).
    • Largest document: 1,381,380 bytes (~1.32 MiB).
    • Source file size: 148,486,852 bytes (~141.61 MiB).

Run

# Michigan Statutes (JSONL)
MARKDOWNIFY_BENCH_PATH=/path/to/mi_statutes.jsonl .venv/bin/python scripts/bench_python.py --module markdownify_rs --dist-name markdownify-rs --label markdownify_rs
MARKDOWNIFY_BENCH_PATH=/path/to/mi_statutes.jsonl .venv/bin/python scripts/bench_python.py --module markdownify --dist-name markdownify --label markdownify

# Law websites (CSV)
.venv/bin/python scripts/bench_python.py --format csv --path /path/to/deleted_pages.csv --module markdownify_rs --dist-name markdownify-rs --label markdownify_rs
.venv/bin/python scripts/bench_python.py --format csv --path /path/to/deleted_pages.csv --module markdownify --dist-name markdownify --label markdownify

Python binding comparison (both run through Python, 2026-01-28, Apple M3, macOS 14.6 / Darwin 24.6.0, Python 3.13.0)

Michigan Statutes (JSONL)

  • markdownify_rs convert_all (241 docs): time 2.266594 s, throughput 42.508 MiB/s
  • markdownify_rs convert_all_batch (241 docs): time 0.538012 s, throughput 179.084 MiB/s
  • markdownify_rs convert_largest (8,034,686 bytes): time 187.941 ms, throughput 40.771 MiB/s
  • markdownify convert_all (241 docs): time 29.654787 s, throughput 3.249 MiB/s
  • markdownify convert_largest (8,034,686 bytes): time 4.496880 s, throughput 1.704 MiB/s

Speedup summary (wall-clock time, lower is better)

Scenario markdownify_rs time markdownify_rs batch time markdownify time Speedup (rs vs py) Speedup (batch vs py) Batch vs rs
convert_all 2.266594 s 0.538012 s 29.654787 s 13.08x (+1208.34%) 55.12x (+5411.92%) 4.21x (+321.29%)
convert_largest 187.941 ms n/a 4.496880 s 23.93x (+2292.71%) n/a n/a

Law websites (CSV)

  • markdownify_rs convert_all (3,136 docs): time 2.596691 s, throughput 41.041 MiB/s
  • markdownify_rs convert_all_batch (3,136 docs): time 0.672013 s, throughput 158.584 MiB/s
  • markdownify_rs convert_largest (1,381,380 bytes): time 54.482 ms, throughput 24.180 MiB/s
  • markdownify convert_all (3,136 docs): time 17.680570 s, throughput 6.028 MiB/s
  • markdownify convert_largest (1,381,380 bytes): time 280.459 ms, throughput 4.697 MiB/s

Speedup summary (wall-clock time, lower is better)

Scenario markdownify_rs time markdownify_rs batch time markdownify time Speedup (rs vs py) Speedup (batch vs py) Batch vs rs
convert_all 2.596691 s 0.672013 s 17.680570 s 6.81x (+580.89%) 26.31x (+2530.99%) 3.86x (+286.40%)
convert_largest 54.482 ms n/a 280.459 ms 5.15x (+414.77%) n/a n/a

Markdown -> HTML parity/speed report:

.venv/bin/python scripts/report_markdown_to_html_parity_speed.py \
  --corpus-dir /tmp/test_markdowns \
  --report BENCHMARKS_MARKDOWN_TO_HTML.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdownify_rs-0.1.4.tar.gz (138.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

markdownify_rs-0.1.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.4-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.4-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.4-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

markdownify_rs-0.1.4-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.4-cp38-abi3-macosx_11_0_arm64.whl (2.0 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file markdownify_rs-0.1.4.tar.gz.

File metadata

  • Download URL: markdownify_rs-0.1.4.tar.gz
  • Upload date:
  • Size: 138.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.4

File hashes

Hashes for markdownify_rs-0.1.4.tar.gz
Algorithm Hash digest
SHA256 3f048cd8454677d8e7c60c55e8cd7946e3a7b69d520cccafdb82ef4564a81451
MD5 499224ddcf25d8c3d4b83b1a0a00b626
BLAKE2b-256 20dda22f032544e2e5673c9aa5ea1c7e529d6e1349bf6aceb8a8247c0f735801

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 93ae04b8616501bc39d491dc41e9555e770c0cb0355d440dbec7d50aca20080e
MD5 42cd86d901daebec4e012d6eeacd1e57
BLAKE2b-256 c3596551a45f31d76babdb20e29d02781b07c6c73990591a20ad85d8838c8212

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 62183582a0ac9e24615e4164ac2d9344aedfe24e6297649cc490fe2ad50da760
MD5 0ccd9d85387f88194f5920992113a1af
BLAKE2b-256 87e6b1e200e79f16af2454ee4720059161b18c08efb7272677a43f07a0da6dbb

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.4-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.4-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a0bc4b8c521fee3d95d6c304a6f87c6c5a3c5d5cc330f636d340019c0b6f0d5d
MD5 0da0f2d5cd0ceb84d76a8d7af1949ee9
BLAKE2b-256 139dd9aa788bebea02a8f5ecd155d37295b8fe96f219dccabcf7f642f69b3260

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.4-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.4-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c609e81ecfffe6877a8d2b6d16f499ce673ef1c59a427f237b6ddc83fe409f17
MD5 d39714afb0ea7b3e93439564aa666ac1
BLAKE2b-256 48d82e0c5632fd73c517b2553adc9b40f53e18b9bad48ecb69e6fe8b33339dae

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.4-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.4-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cfdc710f8d266cd4777548ae20688ffd4ca74c21060c0b83e560c696b667fc59
MD5 11ebdbbd3fc48b8d9228bc2d0670ab79
BLAKE2b-256 691bb249e43b8d8a73898780b2af806b8af897545404279f77c2243ca2ba4316

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.4-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.4-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 aed22ca818e7a7285327ee64897c1014bfe991c2e234398299b1b1b36e3f0e5a
MD5 c30f296ee5e07269cad258088608f58f
BLAKE2b-256 cc9e2fb20fcdf2ab299efd591bb0050abceb0e9c4f41f3f353f3da708708afd2

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.4-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.4-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c96fbbf2f45bf148157e3f54c9d73874aa449bcf15e9bf3b640db02a3f2be90d
MD5 2bd10a0aafe2332bc00376bd9748cdd2
BLAKE2b-256 df4f63c12d6798d587d9fb0f3f1f0494febcf03fab92090f11907994f6258876

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page