Skip to main content

Rust implementation of Python markdownify with a Python API

Project description

markdownify-rs

Rust implementation of Python markdownify with output parity as the primary goal.

Python bindings

Build and install locally with maturin (uv):

uv venv
uv pip install maturin
.venv/bin/maturin develop --features python

Build via pip (PEP 517):

uv pip install .

Usage:

from markdownify_rs import markdownify

print(markdownify("<b>Hello</b>"))

Markdown -> HTML usage (Python-Markdown-style API):

from markdownify_rs import markdown, Markdown

print(markdown("# Hello", extensions=["tables", "footnotes"]))

md = Markdown(extensions=["toc", "admonition"])
print(md.convert("[TOC]\n\n# Title"))
print(md.toc_tokens)  # [{"level": 1, "id": "...", "name": "Title"}]

# Optional CommonMark/GFM-ish toggles (default False):
print(markdown("visit https://example.com", autolink=True))
print(markdown("- [ ] task", tasklist=True))
print(markdown("~~old~~", strikethrough=True))

# Parsing mode:
# - python_compat (default): best-effort Python-Markdown compatibility
# - fast: pure comrak/CommonMark fast path
print(markdown("1) one\n2) two", extensions=["sane_lists"], mode="python_compat"))
print(markdown("1) one\n2) two", extensions=["sane_lists"], mode="fast"))

Full Python markdown-to-HTML API quickstart (all public args):

  • PYTHON_MARKDOWN_TO_HTML_QUICKSTART.md

Batch usage (parallelized in Rust):

from markdownify_rs import markdownify_batch

outputs = markdownify_batch(["<b>Hello</b>", "<i>World</i>"])

Markdown-adjacent utilities (submodule):

from markdownify_rs.markdown_utils import (
    split_into_chunks,
    split_into_chunks_batch,
    coalesce_small_chunks,
    link_percentage,
    link_percentage_batch,
    filter_by_link_percentage,
    strip_links_with_substring,
    strip_links_with_substring_batch,
    remove_large_tables,
    remove_large_tables_batch,
    remove_lines_with_substring,
    remove_lines_with_substring_batch,
    fix_newlines,
    fix_newlines_batch,
    split_on_dividers,
    strip_html_and_contents,
    strip_html_and_contents_batch,
    strip_data_uri_images,
    text_pipeline_batch,
)

chunks = split_into_chunks(text, how="sections")
chunks_batch = split_into_chunks_batch([text1, text2], how="sections")
cleaned = strip_links_with_substring(text, "javascript")
cleaned_batch = strip_links_with_substring_batch([text1, text2], "javascript")
filtered = filter_by_link_percentage([text1, text2], threshold=0.5)
pipelined = text_pipeline_batch(
    [text1, text2],
    steps=[
        ("strip_links_with_substring", {"substring": "javascript"}),
        ("remove_large_tables", {"max_cells": 200}),
        ("fix_newlines", {}),
    ],
)

Notes:

  • code_language_callback is not yet supported in the Python bindings.

CLI:

markdownify-rs input.html
cat input.html | markdownify-rs

Parity hacks (scraper vs. BeautifulSoup)

These are explicit, ad hoc behaviors added on top of scraper/html5ever to match python-markdownify (BeautifulSoup + html.parser) output. They are intentionally quirky and may be replaced with more “correct” behavior once parity is stable.

  • <br> parser quirk: With BeautifulSoup’s html.parser, if a non‑self‑closing <br> appears before a self‑closing <br/>, the later <br/> can be treated like an opening <br> whose contents run until that implicit <br> is closed (usually when its parent closes). We emulate this by removing the content between that <br/> and the closing tag that ends the implicit <br> (ignoring <br> tags inside comments/scripts), which matches python-markdownify’s output.
  • Leading whitespace reconstruction: html.parser preserves whitespace‑only text nodes that html5ever drops (notably between <html> children and at the start of <body>). We reconstruct the normalized leading whitespace prefix (using the same “single space vs. single newline” rules as BeautifulSoup’s endData) and merge it with the converter output, carrying it across non‑block tags and empty custom elements whose contents are only comments/whitespace.
  • Table header inference: For tables whose header row is effectively empty, we avoid forcing a “---” separator to match python-markdownify behavior.
  • Top-level <td>/<th> wrapping: If input is a bare <td>/<th>, we wrap it in a <table><tr>…</tr></table> fragment to align with python-markdownify output.

Benchmarks

Datasets

  • Michigan Statutes (JSONL, 241 HTML documents).
    • Total HTML bytes: 101,029,525 (~96.35 MiB).
    • Largest document: 8,034,686 bytes (~7.66 MiB).
    • Source file size: 102,856,616 bytes (~98.10 MiB).
  • Law websites (CSV, 3,136 HTML documents).
    • Total HTML bytes: 111,747,114 (~106.57 MiB).
    • Largest document: 1,381,380 bytes (~1.32 MiB).
    • Source file size: 148,486,852 bytes (~141.61 MiB).

Run

# Michigan Statutes (JSONL)
MARKDOWNIFY_BENCH_PATH=/path/to/mi_statutes.jsonl .venv/bin/python scripts/bench_python.py --module markdownify_rs --dist-name markdownify-rs --label markdownify_rs
MARKDOWNIFY_BENCH_PATH=/path/to/mi_statutes.jsonl .venv/bin/python scripts/bench_python.py --module markdownify --dist-name markdownify --label markdownify

# Law websites (CSV)
.venv/bin/python scripts/bench_python.py --format csv --path /path/to/deleted_pages.csv --module markdownify_rs --dist-name markdownify-rs --label markdownify_rs
.venv/bin/python scripts/bench_python.py --format csv --path /path/to/deleted_pages.csv --module markdownify --dist-name markdownify --label markdownify

Python binding comparison (both run through Python, 2026-01-28, Apple M3, macOS 14.6 / Darwin 24.6.0, Python 3.13.0)

Michigan Statutes (JSONL)

  • markdownify_rs convert_all (241 docs): time 2.266594 s, throughput 42.508 MiB/s
  • markdownify_rs convert_all_batch (241 docs): time 0.538012 s, throughput 179.084 MiB/s
  • markdownify_rs convert_largest (8,034,686 bytes): time 187.941 ms, throughput 40.771 MiB/s
  • markdownify convert_all (241 docs): time 29.654787 s, throughput 3.249 MiB/s
  • markdownify convert_largest (8,034,686 bytes): time 4.496880 s, throughput 1.704 MiB/s

Speedup summary (wall-clock time, lower is better)

Scenario markdownify_rs time markdownify_rs batch time markdownify time Speedup (rs vs py) Speedup (batch vs py) Batch vs rs
convert_all 2.266594 s 0.538012 s 29.654787 s 13.08x (+1208.34%) 55.12x (+5411.92%) 4.21x (+321.29%)
convert_largest 187.941 ms n/a 4.496880 s 23.93x (+2292.71%) n/a n/a

Law websites (CSV)

  • markdownify_rs convert_all (3,136 docs): time 2.596691 s, throughput 41.041 MiB/s
  • markdownify_rs convert_all_batch (3,136 docs): time 0.672013 s, throughput 158.584 MiB/s
  • markdownify_rs convert_largest (1,381,380 bytes): time 54.482 ms, throughput 24.180 MiB/s
  • markdownify convert_all (3,136 docs): time 17.680570 s, throughput 6.028 MiB/s
  • markdownify convert_largest (1,381,380 bytes): time 280.459 ms, throughput 4.697 MiB/s

Speedup summary (wall-clock time, lower is better)

Scenario markdownify_rs time markdownify_rs batch time markdownify time Speedup (rs vs py) Speedup (batch vs py) Batch vs rs
convert_all 2.596691 s 0.672013 s 17.680570 s 6.81x (+580.89%) 26.31x (+2530.99%) 3.86x (+286.40%)
convert_largest 54.482 ms n/a 280.459 ms 5.15x (+414.77%) n/a n/a

Markdown -> HTML parity/speed report:

.venv/bin/python scripts/report_markdown_to_html_parity_speed.py \
  --corpus-dir /tmp/test_markdowns \
  --report BENCHMARKS_MARKDOWN_TO_HTML.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdownify_rs-0.1.3.tar.gz (110.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

markdownify_rs-0.1.3-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.8 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.3-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.8 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.3-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.8 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.3-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.8 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

markdownify_rs-0.1.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.8 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

markdownify_rs-0.1.3-cp38-abi3-macosx_11_0_arm64.whl (1.9 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file markdownify_rs-0.1.3.tar.gz.

File metadata

  • Download URL: markdownify_rs-0.1.3.tar.gz
  • Upload date:
  • Size: 110.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.4

File hashes

Hashes for markdownify_rs-0.1.3.tar.gz
Algorithm Hash digest
SHA256 7a85a2a9c0ad997b803b0bcc98572fe1c9498abb0b353e41400590a401a6a9dd
MD5 748ba5e79dc317b91bad4c3d5edc306a
BLAKE2b-256 d5ba9b93256e80e67d435ee0f29fc450f6bc7d2cc694275e28c1d7ea19209868

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.3-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.3-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7a6f2665eb642edc323d36f088ccad2e24d4cd0501cb9af8d90e480cd677f948
MD5 6601af4b42e69059239dfdbc2bb126cc
BLAKE2b-256 07eb22ac32283681af2198fc262e776052d9a60702cdc4d504e5f212998cd9e3

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.3-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.3-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7729f5c055d793aadbeeb2da538817d88b813d96ac0753887136e81f55e9f562
MD5 4706a5cd026e1f39caf8bab751ee0c47
BLAKE2b-256 c36ae3d5e96484169ad123a1887bb2a5b236ec2f2c10a3928cb9d3f99804bbed

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.3-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.3-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 0c9ae03cc079ac47c2bfface3f280cd80626b0e1236e1b9e447888654109e06b
MD5 e8cee4ad180048e2bfff30fec94399a4
BLAKE2b-256 f71fade24b34a960fa7b649dec5edeaef2f922c5cca95cd5dbb7956b7d853728

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.3-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.3-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fbebf6110be8818b4cdcadd46a87945490999c252f4674f2a525f2521f99f5d5
MD5 e4f235b667b684f3550bb7ded3082f04
BLAKE2b-256 de60f451131b7c606be659535db3b9a1bc188c7e9a22aee19b545ff226dd2a28

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b8f5d9b511bf7a78e8675d49185a0c5c754be1756f1dba0f52294c4ad09676c1
MD5 af43601a1ffff29b42406ee8373a8496
BLAKE2b-256 f9ab31a47b60a1961094c914964d28ca25d2d26bac743916cab9dc966236d745

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1a9f988cb5cdcf8bbaad25309e19b7008c15475be4a9d447c19eb822b9176e44
MD5 99add1503151ef2811b4678ed021aad4
BLAKE2b-256 21f49de24aa14e9c7376b842744faf0d75fc34a3edfeb97dc32108e5c0265795

See more details on using hashes here.

File details

Details for the file markdownify_rs-0.1.3-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for markdownify_rs-0.1.3-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 595664016c76ab57a308e8f4dc27f6bd3f391be2097cb55abd39445fd5d1aa31
MD5 bb3178f5fe5b1778ae421730b02c8185
BLAKE2b-256 ca1f803d554963a774b8f70502a9de67a4403befdc3e3201cf21381274f249a5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page