Skip to main content

A pure Python HTML5 parser that just works.

Project description

JustHTML

A pure Python HTML5 parser that just works. No C extensions to compile. No system dependencies to install. No complex API to learn.

๐Ÿ“– Full documentation | ๐Ÿ› Try it in the Playground

Why use JustHTML?

  • Just... Correct โœ… โ€” Spec-perfect HTML5 parsing with browser-grade error recovery โ€” passes the official 9k+ html5lib-tests suite, with 100% line+branch coverage. (Correctness)

    JustHTML("<p><b>Hi<i>there</b>!", fragment=True).to_html(pretty=False)
    # => <p><b>Hi<i>there</i></b><i>!</i></p>
    
    # Note: fragment=True parses snippets (no <html>/<body> needed)
    
  • Just... Python ๐Ÿ โ€” Pure Python, zero dependencies โ€” no C extensions or system libraries, easy to debug, and works anywhere Python runs, including PyPy and Pyodide. (Run in the browser)

    python -m pip show justhtml | grep -E '^Requires:'
    # Requires: [intentionally left blank]
    
  • Just... Secure ๐Ÿ”’ โ€” Safe-by-default sanitization at construction time โ€” built-in Bleach-style allowlist sanitization on JustHTML(...) (disable with safe=False). Can sanitize inline CSS rules. (Sanitization & Security)

    JustHTML(
        "<p>Hello<script>alert(1)</script> "
        "<a href=\"javascript:alert(1)\">bad</a> "
        "<a href=\"https://example.com/?a=1&b=2\">ok</a></p>",
        fragment=True,
    ).to_html()
    # => <p>Hello <a>bad</a> <a href="https://example.com/?a=1&amp;b=2">ok</a></p>
    
  • Just... Query ๐Ÿ” โ€” CSS selectors out of the box โ€” one method (query()), familiar syntax (combinators, groups, pseudo-classes), and plain Python nodes as results. (CSS Selectors)

    JustHTML(
        "<div><p class=\"x\">Hi</p><p>Bye</p></div>",
        fragment=True,
    ).query("div p.x")[0].to_html(pretty=False)
    # => <p class="x">Hi</p>
    
  • Just... Transform ๐Ÿ—๏ธ โ€” Built-in DOM transforms for: drop/unwrap nodes, rewrite attributes, linkify text, and compose safe pipelines. (Transforms)

    from justhtml import JustHTML, Linkify, SetAttrs, Unwrap
    
    doc = JustHTML(
        "<p>Hello <span class=\"x\">world</span> example.com</p>",
        transforms=[
            Unwrap("span.x"),
            Linkify(),
            SetAttrs("a", rel="nofollow"),
        ],
        fragment=True,
        safe=False,
    )
    print(doc.to_html(pretty=False))
    # => <p>Hello world <a href="http://example.com" rel="nofollow">example.com</a></p>
    
  • Just... Fast Enough โšก โ€” Fast for the common case (fastest pure-Python HTML5 parser available); for terabytes, use a C/Rust parser like html5ever. (Benchmarks)

    /usr/bin/time -f '%e s' bash -lc \
      "curl -Ls https://en.wikipedia.org/wiki/HTML | python -m justhtml - > /dev/null"
    # 0.41 s
    

Comparison

Tool HTML5 parsing [1][2] Speed CSS query Sanitizes output Notes
JustHTML
Pure Python
โœ…ย 100% โšก Fast โœ… CSS selectors โœ… Built-in (safe=True) Correct, easy to install, and fast enough.
Chromium
browser engine
โœ… 99% ๐Ÿš€ย Veryย Fast โ€” โ€” โ€”
WebKit
browser engine
โœ… 98% ๐Ÿš€ Very Fast โ€” โ€” โ€”
Firefox
browser engine
โœ… 97% ๐Ÿš€ Very Fast โ€” โ€” โ€”
html5lib
Pure Python
๐ŸŸก 88% ๐Ÿข Slow ๐ŸŸก XPath (lxml) ๐Ÿ”ด Deprecated Unmaintained. Reference implementation; Correct but quite slow.
html5_parser
Python wrapper of C-based Gumbo
๐ŸŸก 84% ๐Ÿš€ Very Fast ๐ŸŸก XPath (lxml) โŒ Needs sanitization Fast and mostly correct.
selectolax
Python wrapper of C-based Lexbor
๐ŸŸก 68% ๐Ÿš€ Very Fast โœ… CSS selectors โŒ Needs sanitization Very fast but less compliant.
html.parser
Python stdlib
๐Ÿ”ด 4% โšก Fast โŒ None โŒ Needs sanitization Standard library. Chokes on malformed HTML.
BeautifulSoup
Pure Python
๐Ÿ”ด 4% (default) ๐Ÿข Slow ๐ŸŸก Custom API โŒ Needs sanitization Wraps html.parser (default). Can use lxml or html5lib.
lxml
Python wrapper of C-based libxml2
๐Ÿ”ด 1% ๐Ÿš€ Very Fast ๐ŸŸก XPath โŒ Needs sanitization Fast but not HTML5 compliant. Don't use the old lxml.html.clean module!

[1]: Parser compliance scores are from a strict run of the html5lib-tests tree-construction fixtures (1,743 non-script tests). See docs/correctness.md for details.

[2]: Browser numbers are from justhtml-html5lib-tests-bench on the upstream html5lib-tests/tree-construction corpus (excluding 12 scripting-enabled cases).

Installation

pip install justhtml

Next: Quickstart Guide, CSS Selectors, Sanitization & Security, or try the Playground.

Requires Python 3.10 or later.

Quick Example

from justhtml import JustHTML

doc = JustHTML("<html><body><p class='intro'>Hello!</p></body></html>")

# Query with CSS selectors
for p in doc.query("p.intro"):
    print(p.name)        # "p"
    print(p.attrs)       # {"class": "intro"}
    print(p.to_html())   # <p class="intro">Hello!</p>

See the Quickstart Guide for more examples including tree traversal, streaming, and strict mode.

Command Line

If you installed JustHTML (for example with pip install justhtml or pip install -e .), you can use the justhtml command. If you don't have it available, use the equivalent python -m justhtml ... form instead.

# Pretty-print an HTML file
justhtml index.html

# Parse from stdin
curl -s https://example.com | justhtml -

# Select nodes and output text
justhtml index.html --selector "main p" --format text

# Select nodes and output Markdown (subset of GFM)
justhtml index.html --selector "article" --format markdown

# Select nodes and output HTML
justhtml index.html --selector "a" --format html
# Example: extract Markdown from GitHub README HTML
curl -s https://github.com/EmilStenstrom/justhtml/ | justhtml - --selector '.markdown-body' --format markdown | head -n 15

Output:

# JustHTML

[](#justhtml)

A pure Python HTML5 parser that just works. No C extensions to compile. No system dependencies to install. No complex API to learn.

**[๐Ÿ“– Read the full documentation here](/EmilStenstrom/justhtml/blob/main/docs/index.md)**

## Why use JustHTML?

- **Just... Correct โœ…** โ€” Spec-perfect HTML5 parsing with browser-grade error recovery โ€” passes the official 9k+ [html5lib-tests](https://github.com/html5lib/html5lib-tests) suite, with 100% line+branch coverage. ([Correctness](/EmilStenstrom/justhtml/blob/main/docs/correctness.md))
- **Just... Python ๐Ÿ** โ€” Pure Python, zero dependencies โ€” no C extensions or system libraries, easy to debug, and works anywhere Python runs (including PyPy and Pyodide). ([Quickstart](/EmilStenstrom/justhtml/blob/main/docs/quickstart.md))
- **Just... Secure ๐Ÿ”’** โ€” Safe-by-default sanitization at construction time โ€” built-in Bleach-style allowlist sanitization on `JustHTML(...)` (disable with `safe=False`), plus URL/CSS rules. ([Sanitization & Security](/EmilStenstrom/justhtml/blob/main/docs/sanitization.md))

Security

For security policy and vulnerability reporting, please see SECURITY.md.

Contributing

See CONTRIBUTING.md for development setup and guidelines.

Acknowledgments

JustHTML started as a Python port of html5ever, the HTML5 parser from Mozilla's Servo browser engine. While the codebase has since evolved significantly, html5ever's clean architecture and spec-compliant approach were invaluable as a starting point. Thank you to the Servo team for their excellent work.

Correctness and conformance work is heavily guided by the html5lib ecosystem and especially the official html5lib-tests fixtures used across implementations.

The sanitization API and threat-model expectations are informed by established Python sanitizers like Bleach and nh3.

The CSS selector query API is inspired by the ergonomics of lxml.cssselect.

License

MIT. Free to use both for commercial and non-commercial use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

justhtml-1.0.0.tar.gz (326.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

justhtml-1.0.0-py3-none-any.whl (118.4 kB view details)

Uploaded Python 3

File details

Details for the file justhtml-1.0.0.tar.gz.

File metadata

  • Download URL: justhtml-1.0.0.tar.gz
  • Upload date:
  • Size: 326.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for justhtml-1.0.0.tar.gz
Algorithm Hash digest
SHA256 d0476350ac1cc468c223cd73fd2cfdf052b20d594bd04c9844d3c528e11bce5c
MD5 8c5b14abb0bd517bfe794154511b542a
BLAKE2b-256 cdbbef76235d0d81576c9ac02b16fadb830a71ccea0df9e50a9e84121149a658

See more details on using hashes here.

Provenance

The following attestation bundles were made for justhtml-1.0.0.tar.gz:

Publisher: publish.yml on EmilStenstrom/justhtml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file justhtml-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: justhtml-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 118.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for justhtml-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 01cfd04dc4d8dacc935a41464711b4f5f283108727fc10b5f3829e00d9d7b81b
MD5 6620d7b1fccfa0bd47eea92b9668b88e
BLAKE2b-256 36659cf054f9637394ebea691ba54f62947bde6c5bd04be04db43a0ada8de338

See more details on using hashes here.

Provenance

The following attestation bundles were made for justhtml-1.0.0-py3-none-any.whl:

Publisher: publish.yml on EmilStenstrom/justhtml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page