A pure Python HTML5 parser that just works.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

EmilStenstrom

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

JustHTML

A pure Python HTML5 parser that just works. No C extensions to compile. No system dependencies to install. No complex API to learn.

📖 Full documentation | 🛝 Try it in the Playground

Why use JustHTML?

Just... Correct ✅ — Spec-perfect HTML5 parsing with browser-grade error recovery — passes the official 9k+ html5lib-tests suite, with 100% line+branch coverage. (Correctness)

JustHTML("<p><b>Hi<i>there</b>!", fragment=True).to_html(pretty=False)
# => <p><b>Hi<i>there</i></b><i>!</i></p>

# Note: fragment=True parses snippets (no <html>/<body> needed)

Just... Python 🐍 — Pure Python, zero dependencies — no C extensions or system libraries, easy to debug, and works anywhere Python runs, including PyPy and Pyodide. (Run in the browser)
```
python -m pip show justhtml | grep -E '^Requires:'
# Requires: [intentionally left blank]
```

Just... Secure 🔒 — Safe-by-default sanitization at construction time — built-in Bleach-style allowlist sanitization on JustHTML(...) (disable with safe=False). Can sanitize inline CSS rules. (Sanitization & Security)

JustHTML(
    "<p>Hello<script>alert(1)</script> "
    "<a href=\"javascript:alert(1)\">bad</a> "
    "<a href=\"https://example.com/?a=1&b=2\">ok</a></p>",
    fragment=True,
).to_html()
# => <p>Hello <a>bad</a> <a href="https://example.com/?a=1&amp;b=2">ok</a></p>

Just... Query 🔍 — CSS selectors out of the box — one method (query()), familiar syntax (combinators, groups, pseudo-classes), and plain Python nodes as results. (CSS Selectors)
```
JustHTML(
    "<div><p class=\"x\">Hi</p><p>Bye</p></div>",
    fragment=True,
).query("div p.x")[0].to_html(pretty=False)
# => <p class="x">Hi</p>
```

Just... Transform 🏗️ — Built-in DOM transforms for: drop/unwrap nodes, rewrite attributes, linkify text, and compose safe pipelines. (Transforms)

from justhtml import JustHTML, Linkify, SetAttrs, Unwrap

doc = JustHTML(
    "<p>Hello <span class=\"x\">world</span> example.com</p>",
    transforms=[
        Unwrap("span.x"),
        Linkify(),
        SetAttrs("a", rel="nofollow"),
    ],
    fragment=True,
    safe=False,
)
print(doc.to_html(pretty=False))
# => <p>Hello world <a href="http://example.com" rel="nofollow">example.com</a></p>

Just... Fast Enough ⚡ — Fast for the common case (fastest pure-Python HTML5 parser available); for terabytes, use a C/Rust parser like html5ever. (Benchmarks)
```
/usr/bin/time -f '%e s' bash -lc \
  "curl -Ls https://en.wikipedia.org/wiki/HTML | python -m justhtml - > /dev/null"
# 0.41 s
```

Comparison

Tool	HTML5 parsing [1][2]	Speed	CSS query	Sanitizes output	Notes
JustHTML Pure Python	✅ 100%	⚡ Fast	✅ CSS selectors	✅ Built-in (`safe=True`)	Correct, easy to install, and fast enough.
Chromium browser engine	✅ 99%	🚀 Very Fast	—	—	—
WebKit browser engine	✅ 98%	🚀 Very Fast	—	—	—
Firefox browser engine	✅ 97%	🚀 Very Fast	—	—	—
`html5lib` Pure Python	🟡 88%	🐢 Slow	🟡 XPath (lxml)	🔴 Deprecated	Unmaintained. Reference implementation; Correct but quite slow.
`html5_parser` Python wrapper of C-based Gumbo	🟡 84%	🚀 Very Fast	🟡 XPath (lxml)	❌ Needs sanitization	Fast and mostly correct.
`selectolax` Python wrapper of C-based Lexbor	🟡 68%	🚀 Very Fast	✅ CSS selectors	❌ Needs sanitization	Very fast but less compliant.
`html.parser` Python stdlib	🔴 4%	⚡ Fast	❌ None	❌ Needs sanitization	Standard library. Chokes on malformed HTML.
`BeautifulSoup` Pure Python	🔴 4% (default)	🐢 Slow	🟡 Custom API	❌ Needs sanitization	Wraps `html.parser` (default). Can use lxml or html5lib.
`lxml` Python wrapper of C-based libxml2	🔴 1%	🚀 Very Fast	🟡 XPath	❌ Needs sanitization	Fast but not HTML5 compliant. Don't use the old lxml.html.clean module!

[1]: Parser compliance scores are from a strict run of the html5lib-tests tree-construction fixtures (1,743 non-script tests). See docs/correctness.md for details.

[2]: Browser numbers are from justhtml-html5lib-tests-bench on the upstream html5lib-tests/tree-construction corpus (excluding 12 scripting-enabled cases).

Installation

pip install justhtml

Next: Quickstart Guide, CSS Selectors, Sanitization & Security, or try the Playground.

Requires Python 3.10 or later.

Quick Example

from justhtml import JustHTML

doc = JustHTML("<html><body><p class='intro'>Hello!</p></body></html>")

# Query with CSS selectors
for p in doc.query("p.intro"):
    print(p.name)        # "p"
    print(p.attrs)       # {"class": "intro"}
    print(p.to_html())   # <p class="intro">Hello!</p>

See the Quickstart Guide for more examples including tree traversal, streaming, and strict mode.

Command Line

If you installed JustHTML (for example with pip install justhtml or pip install -e .), you can use the justhtml command. If you don't have it available, use the equivalent python -m justhtml ... form instead.

# Pretty-print an HTML file
justhtml index.html

# Parse from stdin
curl -s https://example.com | justhtml -

# Select nodes and output text
justhtml index.html --selector "main p" --format text

# Select nodes and output Markdown (subset of GFM)
justhtml index.html --selector "article" --format markdown

# Select nodes and output HTML
justhtml index.html --selector "a" --format html

# Example: extract Markdown from GitHub README HTML
curl -s https://github.com/EmilStenstrom/justhtml/ | justhtml - --selector '.markdown-body' --format markdown | head -n 15

Output:

# JustHTML

[](#justhtml)

A pure Python HTML5 parser that just works. No C extensions to compile. No system dependencies to install. No complex API to learn.

**[📖 Read the full documentation here](/EmilStenstrom/justhtml/blob/main/docs/index.md)**

## Why use JustHTML?

- **Just... Correct ✅** — Spec-perfect HTML5 parsing with browser-grade error recovery — passes the official 9k+ [html5lib-tests](https://github.com/html5lib/html5lib-tests) suite, with 100% line+branch coverage. ([Correctness](/EmilStenstrom/justhtml/blob/main/docs/correctness.md))
- **Just... Python 🐍** — Pure Python, zero dependencies — no C extensions or system libraries, easy to debug, and works anywhere Python runs (including PyPy and Pyodide). ([Quickstart](/EmilStenstrom/justhtml/blob/main/docs/quickstart.md))
- **Just... Secure 🔒** — Safe-by-default sanitization at construction time — built-in Bleach-style allowlist sanitization on `JustHTML(...)` (disable with `safe=False`), plus URL/CSS rules. ([Sanitization & Security](/EmilStenstrom/justhtml/blob/main/docs/sanitization.md))

Security

For security policy and vulnerability reporting, please see SECURITY.md.

Contributing

See CONTRIBUTING.md for development setup and guidelines.

Acknowledgments

JustHTML started as a Python port of html5ever, the HTML5 parser from Mozilla's Servo browser engine. While the codebase has since evolved significantly, html5ever's clean architecture and spec-compliant approach were invaluable as a starting point. Thank you to the Servo team for their excellent work.

Correctness and conformance work is heavily guided by the html5lib ecosystem and especially the official html5lib-tests fixtures used across implementations.

The sanitization API and threat-model expectations are informed by established Python sanitizers like Bleach and nh3.

The CSS selector query API is inspired by the ergonomics of lxml.cssselect.

License

MIT. Free to use both for commercial and non-commercial use.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

EmilStenstrom

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.20.0

May 14, 2026

1.19.0

May 9, 2026

1.18.0

May 4, 2026

1.17.0

Apr 19, 2026

1.16.0

Apr 12, 2026

1.15.0

Apr 9, 2026

1.14.0

Apr 5, 2026

1.13.0

Mar 21, 2026

1.12.0

Mar 17, 2026

1.11.0

Mar 15, 2026

1.10.0

Mar 15, 2026

1.9.1

Mar 10, 2026

1.9.0

Mar 8, 2026

1.8.0

Mar 5, 2026

1.7.0

Feb 8, 2026

1.6.0

Feb 6, 2026

1.5.0

Feb 1, 2026

1.4.0

Jan 29, 2026

1.3.0

Jan 28, 2026

1.2.0

Jan 25, 2026

1.1.0

Jan 24, 2026

This version

1.0.0

Jan 20, 2026

0.40.0

Jan 19, 2026

0.39.0

Jan 18, 2026

0.38.0

Jan 18, 2026

0.37.0

Jan 18, 2026

0.36.0

Jan 17, 2026

0.35.0

Jan 11, 2026

0.34.0

Jan 10, 2026

0.33.0

Jan 10, 2026

0.32.0

Jan 10, 2026

0.31.0

Jan 9, 2026

0.30.0

Jan 3, 2026

0.29.0

Jan 3, 2026

0.28.0

Jan 3, 2026

0.27.0

Jan 3, 2026

0.26.0

Jan 2, 2026

0.25.0

Jan 1, 2026

0.24.0

Jan 1, 2026

0.23.0

Dec 30, 2025

0.22.0

Dec 28, 2025

0.21.0

Dec 28, 2025

0.20.0

Dec 28, 2025

0.19.0

Dec 28, 2025

0.18.0

Dec 21, 2025

0.17.0

Dec 20, 2025

0.16.0

Dec 18, 2025

0.15.0

Dec 18, 2025

0.14.0

Dec 17, 2025

0.13.1

Dec 17, 2025

0.13.0

Dec 16, 2025

0.12.0

Dec 15, 2025

0.11.0

Dec 15, 2025

0.10.0

Dec 14, 2025

0.9.0

Dec 14, 2025

0.8.0

Dec 13, 2025

0.7.0

Dec 13, 2025

0.6.0

Dec 7, 2025

0.5.2

Dec 7, 2025

0.5.1

Dec 7, 2025

0.5.0

Dec 7, 2025

0.4.0

Dec 6, 2025

0.3.0

Dec 1, 2025

0.2.0

Dec 1, 2025

0.1.0

Nov 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

justhtml-1.0.0.tar.gz (326.5 kB view details)

Uploaded Jan 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

justhtml-1.0.0-py3-none-any.whl (118.4 kB view details)

Uploaded Jan 20, 2026 Python 3

File details

Details for the file justhtml-1.0.0.tar.gz.

File metadata

Download URL: justhtml-1.0.0.tar.gz
Upload date: Jan 20, 2026
Size: 326.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for justhtml-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`d0476350ac1cc468c223cd73fd2cfdf052b20d594bd04c9844d3c528e11bce5c`
MD5	`8c5b14abb0bd517bfe794154511b542a`
BLAKE2b-256	`cdbbef76235d0d81576c9ac02b16fadb830a71ccea0df9e50a9e84121149a658`

See more details on using hashes here.

Provenance

The following attestation bundles were made for justhtml-1.0.0.tar.gz:

Publisher: publish.yml on EmilStenstrom/justhtml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: justhtml-1.0.0.tar.gz
- Subject digest: d0476350ac1cc468c223cd73fd2cfdf052b20d594bd04c9844d3c528e11bce5c
- Sigstore transparency entry: 839422587
- Sigstore integration time: Jan 20, 2026
Source repository:
- Permalink: EmilStenstrom/justhtml@68975da501f4ad0fc331e81ace82a338ccb531e4
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/EmilStenstrom
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@68975da501f4ad0fc331e81ace82a338ccb531e4
- Trigger Event: release

File details

Details for the file justhtml-1.0.0-py3-none-any.whl.

File metadata

Download URL: justhtml-1.0.0-py3-none-any.whl
Upload date: Jan 20, 2026
Size: 118.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for justhtml-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`01cfd04dc4d8dacc935a41464711b4f5f283108727fc10b5f3829e00d9d7b81b`
MD5	`6620d7b1fccfa0bd47eea92b9668b88e`
BLAKE2b-256	`36659cf054f9637394ebea691ba54f62947bde6c5bd04be04db43a0ada8de338`

See more details on using hashes here.

Provenance

The following attestation bundles were made for justhtml-1.0.0-py3-none-any.whl:

Publisher: publish.yml on EmilStenstrom/justhtml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: justhtml-1.0.0-py3-none-any.whl
- Subject digest: 01cfd04dc4d8dacc935a41464711b4f5f283108727fc10b5f3829e00d9d7b81b
- Sigstore transparency entry: 839422651
- Sigstore integration time: Jan 20, 2026
Source repository:
- Permalink: EmilStenstrom/justhtml@68975da501f4ad0fc331e81ace82a338ccb531e4
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/EmilStenstrom
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@68975da501f4ad0fc331e81ace82a338ccb531e4
- Trigger Event: release

justhtml 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

JustHTML

Why use JustHTML?

Comparison

Installation

Quick Example

Command Line

Security

Contributing

Acknowledgments

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance