Skip to main content

Rust implementation of Python markdownify with a Python API

Project description

markdownify-rs

Rust implementation of Python markdownify with output parity as the primary goal.

Python bindings

Build and install locally with maturin (uv):

uv venv
uv pip install maturin
.venv/bin/maturin develop --features python

Build via pip (PEP 517):

uv pip install .

Usage:

from markdownify_rs import markdownify

print(markdownify("<b>Hello</b>"))

Batch usage (parallelized in Rust):

from markdownify_rs import markdownify_batch

outputs = markdownify_batch(["<b>Hello</b>", "<i>World</i>"])

Markdown-adjacent utilities (submodule):

from markdownify_rs.markdown_utils import (
    split_into_chunks,
    split_into_chunks_batch,
    coalesce_small_chunks,
    strip_links_with_substring,
    remove_large_tables,
    split_on_dividers,
    link_percentage,
    strip_html_and_contents,
    strip_data_uri_images,
)

chunks = split_into_chunks(text, how="sections")
chunks_batch = split_into_chunks_batch([text1, text2], how="sections")
cleaned = strip_links_with_substring(text, "javascript")

Notes:

  • code_language_callback is not yet supported in the Python bindings.

CLI:

markdownify-rs input.html
cat input.html | markdownify-rs

Parity hacks (scraper vs. BeautifulSoup)

These are explicit, ad hoc behaviors added on top of scraper/html5ever to match python-markdownify (BeautifulSoup + html.parser) output. They are intentionally quirky and may be replaced with more “correct” behavior once parity is stable.

  • <br> parser quirk: With BeautifulSoup’s html.parser, if a non‑self‑closing <br> appears before a self‑closing <br/>, the later <br/> can be treated like an opening <br> whose contents run until that implicit <br> is closed (usually when its parent closes). We emulate this by removing the content between that <br/> and the closing tag that ends the implicit <br> (ignoring <br> tags inside comments/scripts), which matches python-markdownify’s output.
  • Leading whitespace reconstruction: html.parser preserves whitespace‑only text nodes that html5ever drops (notably between <html> children and at the start of <body>). We reconstruct the normalized leading whitespace prefix (using the same “single space vs. single newline” rules as BeautifulSoup’s endData) and merge it with the converter output, carrying it across non‑block tags and empty custom elements whose contents are only comments/whitespace.
  • Table header inference: For tables whose header row is effectively empty, we avoid forcing a “---” separator to match python-markdownify behavior.
  • Top-level <td>/<th> wrapping: If input is a bare <td>/<th>, we wrap it in a <table><tr>…</tr></table> fragment to align with python-markdownify output.

Benchmarks

Datasets

  • Michigan Statutes (JSONL, 241 HTML documents).
    • Total HTML bytes: 101,029,525 (~96.35 MiB).
    • Largest document: 8,034,686 bytes (~7.66 MiB).
    • Source file size: 102,856,616 bytes (~98.10 MiB).
  • Law websites (CSV, 3,136 HTML documents).
    • Total HTML bytes: 111,747,114 (~106.57 MiB).
    • Largest document: 1,381,380 bytes (~1.32 MiB).
    • Source file size: 148,486,852 bytes (~141.61 MiB).

Run

# Michigan Statutes (JSONL)
MARKDOWNIFY_BENCH_PATH=/path/to/mi_statutes.jsonl .venv/bin/python scripts/bench_python.py --module markdownify_rs --dist-name markdownify-rs --label markdownify_rs
MARKDOWNIFY_BENCH_PATH=/path/to/mi_statutes.jsonl .venv/bin/python scripts/bench_python.py --module markdownify --dist-name markdownify --label markdownify

# Law websites (CSV)
.venv/bin/python scripts/bench_python.py --format csv --path /path/to/deleted_pages.csv --module markdownify_rs --dist-name markdownify-rs --label markdownify_rs
.venv/bin/python scripts/bench_python.py --format csv --path /path/to/deleted_pages.csv --module markdownify --dist-name markdownify --label markdownify

Python binding comparison (both run through Python, 2026-01-28, Apple M3, macOS 14.6 / Darwin 24.6.0, Python 3.13.0)

Michigan Statutes (JSONL)

  • markdownify_rs convert_all (241 docs): time 2.266594 s, throughput 42.508 MiB/s
  • markdownify_rs convert_all_batch (241 docs): time 0.538012 s, throughput 179.084 MiB/s
  • markdownify_rs convert_largest (8,034,686 bytes): time 187.941 ms, throughput 40.771 MiB/s
  • markdownify convert_all (241 docs): time 29.654787 s, throughput 3.249 MiB/s
  • markdownify convert_largest (8,034,686 bytes): time 4.496880 s, throughput 1.704 MiB/s

Speedup summary (wall-clock time, lower is better)

Scenario markdownify_rs time markdownify_rs batch time markdownify time Speedup (rs vs py) Speedup (batch vs py) Batch vs rs
convert_all 2.266594 s 0.538012 s 29.654787 s 13.08x (+1208.34%) 55.12x (+5411.92%) 4.21x (+321.29%)
convert_largest 187.941 ms n/a 4.496880 s 23.93x (+2292.71%) n/a n/a

Law websites (CSV)

  • markdownify_rs convert_all (3,136 docs): time 2.596691 s, throughput 41.041 MiB/s
  • markdownify_rs convert_all_batch (3,136 docs): time 0.672013 s, throughput 158.584 MiB/s
  • markdownify_rs convert_largest (1,381,380 bytes): time 54.482 ms, throughput 24.180 MiB/s
  • markdownify convert_all (3,136 docs): time 17.680570 s, throughput 6.028 MiB/s
  • markdownify convert_largest (1,381,380 bytes): time 280.459 ms, throughput 4.697 MiB/s

Speedup summary (wall-clock time, lower is better)

Scenario markdownify_rs time markdownify_rs batch time markdownify time Speedup (rs vs py) Speedup (batch vs py) Batch vs rs
convert_all 2.596691 s 0.672013 s 17.680570 s 6.81x (+580.89%) 26.31x (+2530.99%) 3.86x (+286.40%)
convert_largest 54.482 ms n/a 280.459 ms 5.15x (+414.77%) n/a n/a

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdownify_rs-0.1.1.tar.gz (74.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

markdownify_rs-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.1-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.1-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

markdownify_rs-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.1-cp38-abi3-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file markdownify_rs-0.1.1.tar.gz.

File metadata

  • Download URL: markdownify_rs-0.1.1.tar.gz
  • Upload date:
  • Size: 74.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.2

File hashes

Hashes for markdownify_rs-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e7b6901cb53084a160f52a28073c945625df11948819118d88054b393cd0cf77
MD5 bd5698a15266a259dce5d49d74e2aebc
BLAKE2b-256 5a1896edfb18e61474a8e8adb5833c868d15c5752c7fee505bc453e3ac50e05a

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3ba5d7620d1fb752650c7a721c70a476a629de4251b4be95eab73236ee0e1a81
MD5 d0c914e50a6d2c6dd66360b2a9235153
BLAKE2b-256 561bf84d20a8d6751bd9b0eaff21198376aea2c56dd8a090cc1c0e35a3caed4a

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d68d5e0641844ea90fd3128d5f8f9fd147f89dba30bcc096f9746f7eeab7343e
MD5 ffd81e624cbc80425c1df1276d4f26d9
BLAKE2b-256 5246a634b3a51c00e1d5c9075cdfd5cd8745c7909ca3f90ccfba2cf7f4b890a8

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.1-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.1-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 387d3f6fc9da1b44fac0348c68e3d2fd208c420fa6ad48f7647fd0a6fc5ce342
MD5 e0a9108f126987d7cda61c8f2998fd30
BLAKE2b-256 6836e57b2da5fddccdce5ad0b8731e27c7f9ac17ee9a3f3e4965cd3126f60252

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.1-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.1-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3db582e85123bf6b337cc3c1a4953104187242fca99c1a9345b53a526a939645
MD5 c2c3cc3b3ec7517109fd21da3aaeed1c
BLAKE2b-256 b4d74d9a9e4260e9d1ff5505d7c846bd7768c19706b85470f9df5e6632d76246

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 251172440ae8d2e75938278fe3454f5fb3575aaa57e364ef8fc04f4385a5dad0
MD5 2daa369aeda120ea08a8df5dc4608a18
BLAKE2b-256 cb683ca1d56cb2e0f87c9c5643890b6654785e1c12a8c94791100b1fb39c1f43

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 77023cf0f15a30b10ba9e10ebe22db29a4c73954ec4bc3148ecef8df2e07b2ec
MD5 33d57438165a971ed64211e59c656a13
BLAKE2b-256 6a76e6e84447ab67dd2a0a2dc9f4b9ec4e9f5171a6c9d25b2004743d0a67d49b

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a0938e56dabeca70675d425098313fedc656767423bf90108014c58aef0ac53f
MD5 eb290d610cb1b4928329db40f49390e9
BLAKE2b-256 25dc08d359dbeda2079a482b13286b510f9423750bb453f4aa6fd47f3caf383b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page