Skip to main content

Rust implementation of Python markdownify with a Python API

Project description

markdownify-rs

Rust implementation of Python markdownify with output parity as the primary goal.

Python bindings

Build and install locally with maturin (uv):

uv venv
uv pip install maturin
.venv/bin/maturin develop --features python

Build via pip (PEP 517):

uv pip install .

Usage:

from markdownify_rs import markdownify

print(markdownify("<b>Hello</b>"))

Batch usage (parallelized in Rust):

from markdownify_rs import markdownify_batch

outputs = markdownify_batch(["<b>Hello</b>", "<i>World</i>"])

Markdown-adjacent utilities (submodule):

from markdownify_rs.markdown_utils import (
    split_into_chunks,
    split_into_chunks_batch,
    coalesce_small_chunks,
    link_percentage,
    link_percentage_batch,
    filter_by_link_percentage,
    strip_links_with_substring,
    strip_links_with_substring_batch,
    remove_large_tables,
    remove_large_tables_batch,
    remove_lines_with_substring,
    remove_lines_with_substring_batch,
    fix_newlines,
    fix_newlines_batch,
    split_on_dividers,
    strip_html_and_contents,
    strip_html_and_contents_batch,
    strip_data_uri_images,
    text_pipeline_batch,
)

chunks = split_into_chunks(text, how="sections")
chunks_batch = split_into_chunks_batch([text1, text2], how="sections")
cleaned = strip_links_with_substring(text, "javascript")
cleaned_batch = strip_links_with_substring_batch([text1, text2], "javascript")
filtered = filter_by_link_percentage([text1, text2], threshold=0.5)
pipelined = text_pipeline_batch(
    [text1, text2],
    steps=[
        ("strip_links_with_substring", {"substring": "javascript"}),
        ("remove_large_tables", {"max_cells": 200}),
        ("fix_newlines", {}),
    ],
)

Notes:

  • code_language_callback is not yet supported in the Python bindings.

CLI:

markdownify-rs input.html
cat input.html | markdownify-rs

Parity hacks (scraper vs. BeautifulSoup)

These are explicit, ad hoc behaviors added on top of scraper/html5ever to match python-markdownify (BeautifulSoup + html.parser) output. They are intentionally quirky and may be replaced with more “correct” behavior once parity is stable.

  • <br> parser quirk: With BeautifulSoup’s html.parser, if a non‑self‑closing <br> appears before a self‑closing <br/>, the later <br/> can be treated like an opening <br> whose contents run until that implicit <br> is closed (usually when its parent closes). We emulate this by removing the content between that <br/> and the closing tag that ends the implicit <br> (ignoring <br> tags inside comments/scripts), which matches python-markdownify’s output.
  • Leading whitespace reconstruction: html.parser preserves whitespace‑only text nodes that html5ever drops (notably between <html> children and at the start of <body>). We reconstruct the normalized leading whitespace prefix (using the same “single space vs. single newline” rules as BeautifulSoup’s endData) and merge it with the converter output, carrying it across non‑block tags and empty custom elements whose contents are only comments/whitespace.
  • Table header inference: For tables whose header row is effectively empty, we avoid forcing a “---” separator to match python-markdownify behavior.
  • Top-level <td>/<th> wrapping: If input is a bare <td>/<th>, we wrap it in a <table><tr>…</tr></table> fragment to align with python-markdownify output.

Benchmarks

Datasets

  • Michigan Statutes (JSONL, 241 HTML documents).
    • Total HTML bytes: 101,029,525 (~96.35 MiB).
    • Largest document: 8,034,686 bytes (~7.66 MiB).
    • Source file size: 102,856,616 bytes (~98.10 MiB).
  • Law websites (CSV, 3,136 HTML documents).
    • Total HTML bytes: 111,747,114 (~106.57 MiB).
    • Largest document: 1,381,380 bytes (~1.32 MiB).
    • Source file size: 148,486,852 bytes (~141.61 MiB).

Run

# Michigan Statutes (JSONL)
MARKDOWNIFY_BENCH_PATH=/path/to/mi_statutes.jsonl .venv/bin/python scripts/bench_python.py --module markdownify_rs --dist-name markdownify-rs --label markdownify_rs
MARKDOWNIFY_BENCH_PATH=/path/to/mi_statutes.jsonl .venv/bin/python scripts/bench_python.py --module markdownify --dist-name markdownify --label markdownify

# Law websites (CSV)
.venv/bin/python scripts/bench_python.py --format csv --path /path/to/deleted_pages.csv --module markdownify_rs --dist-name markdownify-rs --label markdownify_rs
.venv/bin/python scripts/bench_python.py --format csv --path /path/to/deleted_pages.csv --module markdownify --dist-name markdownify --label markdownify

Python binding comparison (both run through Python, 2026-01-28, Apple M3, macOS 14.6 / Darwin 24.6.0, Python 3.13.0)

Michigan Statutes (JSONL)

  • markdownify_rs convert_all (241 docs): time 2.266594 s, throughput 42.508 MiB/s
  • markdownify_rs convert_all_batch (241 docs): time 0.538012 s, throughput 179.084 MiB/s
  • markdownify_rs convert_largest (8,034,686 bytes): time 187.941 ms, throughput 40.771 MiB/s
  • markdownify convert_all (241 docs): time 29.654787 s, throughput 3.249 MiB/s
  • markdownify convert_largest (8,034,686 bytes): time 4.496880 s, throughput 1.704 MiB/s

Speedup summary (wall-clock time, lower is better)

Scenario markdownify_rs time markdownify_rs batch time markdownify time Speedup (rs vs py) Speedup (batch vs py) Batch vs rs
convert_all 2.266594 s 0.538012 s 29.654787 s 13.08x (+1208.34%) 55.12x (+5411.92%) 4.21x (+321.29%)
convert_largest 187.941 ms n/a 4.496880 s 23.93x (+2292.71%) n/a n/a

Law websites (CSV)

  • markdownify_rs convert_all (3,136 docs): time 2.596691 s, throughput 41.041 MiB/s
  • markdownify_rs convert_all_batch (3,136 docs): time 0.672013 s, throughput 158.584 MiB/s
  • markdownify_rs convert_largest (1,381,380 bytes): time 54.482 ms, throughput 24.180 MiB/s
  • markdownify convert_all (3,136 docs): time 17.680570 s, throughput 6.028 MiB/s
  • markdownify convert_largest (1,381,380 bytes): time 280.459 ms, throughput 4.697 MiB/s

Speedup summary (wall-clock time, lower is better)

Scenario markdownify_rs time markdownify_rs batch time markdownify time Speedup (rs vs py) Speedup (batch vs py) Batch vs rs
convert_all 2.596691 s 0.672013 s 17.680570 s 6.81x (+580.89%) 26.31x (+2530.99%) 3.86x (+286.40%)
convert_largest 54.482 ms n/a 280.459 ms 5.15x (+414.77%) n/a n/a

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdownify_rs-0.1.2.tar.gz (78.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

markdownify_rs-0.1.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.2-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.2-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

markdownify_rs-0.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.2-cp38-abi3-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file markdownify_rs-0.1.2.tar.gz.

File metadata

  • Download URL: markdownify_rs-0.1.2.tar.gz
  • Upload date:
  • Size: 78.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.2

File hashes

Hashes for markdownify_rs-0.1.2.tar.gz
Algorithm Hash digest
SHA256 62ddbfa85ef57f82552277ac5612e137f5ac13fa338d0f53e051c85952f8030a
MD5 b5d13eafff0bda35bc5d7e96906f91f0
BLAKE2b-256 cea88b54a0a715ad0c987a6ccccf72cd482c74a214628a1cf691c84a5b596f7b

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3038302721f4c3389cd374ce30239817b0ad813c8b0c9ce3c3278ac107f75487
MD5 6345aa04f155e191ed370a632f25ebef
BLAKE2b-256 322994ac79072f06221ac0c0b3cf2ab8910ead1adbe474873a618756b7fb0434

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e5328c536c83d8870e01050ab852cd6e25aeafe5410cd36773da6cae87a27bda
MD5 c2801f2ec0149cdf0fccd3a935522f7f
BLAKE2b-256 e6baf8446259cfba151236ebb45372f624e2142c12ae50b488aced0f64cb8f37

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.2-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.2-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5085682b4c6875c35b9457e06ccd7e8776c6629433135809fea2a95c76b82a44
MD5 6b11377123b949cc07fce8b5448e3ae1
BLAKE2b-256 3a26fc1de9bdd1451d9f20ea971ec34ebea5c31b3a3651f21be6da806c4e6be1

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.2-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.2-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 447ea95299a1cc5a92f9057802be2a3a688b6304b2b76b4222cf6a9879196b9a
MD5 a9d41e59e7774b0fa62ea02486bcf67a
BLAKE2b-256 f5ec718b3c2b85c551a748175bd1c5287a8252f60d1f2334d4f208de3fd7bf7f

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 54f483f98875dfa64e651b028c5952d5b12ae0bc6fe4156b8fd9a1a5194280e4
MD5 0b11c5738fc24b2d481f2044f2bca833
BLAKE2b-256 b397b782347a775dc5e30f140e55299b7443c4cdc188e093e3cc040fa6ecfa18

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1f98f2bd107e35886d77fcf30b3e98a16795899c42f230b5110c6ec81932bc7f
MD5 4961a45ff5c923ce098bfa52e2b66999
BLAKE2b-256 8ab7a5c069a3803b337412b0a2e0a496f747d3331d33b70c8932315578c17a56

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.2-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.2-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a0db731874315d636864f132bbedd83799ecfd87bb0474ef515a94a03bfb2207
MD5 22bda3221f209351635d2563c8d82310
BLAKE2b-256 78c641629274ee0b83b3c9bab0cbfa224ccfd2a6d613365871a089786f162dc9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page