A pure Python HTML5 parser that just works.
Project description
JustHTML
A pure Python HTML5 parser that just works. No C extensions to compile. No system dependencies to install. No complex API to learn.
📖 Read the full documentation here
Why use JustHTML?
- Just... Correct ✅ — Spec-perfect HTML5 parsing with browser-grade error recovery — passes the official 9k+ html5lib-tests suite, with 100% line+branch coverage. (Correctness)
- Just... Python 🐍 — Pure Python, zero dependencies — no C extensions or system libraries, easy to debug, and works anywhere Python runs (including PyPy and Pyodide). (Quickstart)
- Just... Secure 🔒 — Safe-by-default output for untrusted HTML — built-in Bleach-style allowlist sanitization on
to_html()/to_markdown()(override withsafe=False), plus URL/CSS rules. (Sanitization & Security) - Just... Query 🔍 — CSS selectors out of the box — one method (
query()), familiar syntax (combinators, groups, pseudo-classes), and plain Python nodes as results. (CSS Selectors) - Just... Fast Enough ⚡ — Fast for the common case (fastest pure-Python HTML5 parser available); for terabytes, use a C/Rust parser like
html5ever. (Benchmarks)
Comparison to other parsers
| Parser | HTML5 Compliance | Pure Python? | Speed | Query API | Notes |
|---|---|---|---|---|---|
| JustHTML | ✅ 100% | ✅ Yes | ⚡ Fast | ✅ CSS selectors | It just works. Correct, easy to install, and fast enough. |
html5lib |
🟡 88% | ✅ Yes | 🐢 Slow | ❌ None | The reference implementation. Very correct but quite slow. |
html5_parser |
🟡 84% | ❌ No | 🚀 Very Fast | 🟡 XPath (lxml) | C-based (Gumbo). Fast and mostly correct. |
selectolax |
🟡 68% | ❌ No | 🚀 Very Fast | ✅ CSS selectors | C-based (Lexbor). Very fast but less compliant. |
BeautifulSoup |
🔴 4% | ✅ Yes | 🐢 Slow | 🟡 Custom API | Wrapper around html.parser. Not spec compliant. |
html.parser |
🔴 4% | ✅ Yes | ⚡ Fast | ❌ None | Standard library. Chokes on malformed HTML. |
lxml |
🔴 1% | ❌ No | 🚀 Very Fast | 🟡 XPath | C-based (libxml2). Fast but not HTML5 compliant. |
Compliance scores from a strict run of the html5lib-tests tree-construction fixtures (1,743 non-script tests). See benchmarks/correctness.py and docs/correctness.md for details.
Browser engine agreement (tree-construction, pass/(pass+fail), 2025-12-30):
| Engine | Tests Passed | Agreement | Notes |
|---|---|---|---|
| Chromium | 1763/1770 | 99.6% | DOMParser / contextual fragment (via Playwright) |
| WebKit | 1741/1770 | 98.4% | DOMParser / contextual fragment (via Playwright) |
| Firefox | 1727/1770 | 97.6% | DOMParser / contextual fragment (via Playwright) |
Browser numbers from justhtml-html5lib-tests-bench on the upstream html5lib-tests/tree-construction corpus (excluding 12 scripting-enabled cases).
Installation
Requires Python 3.10 or later.
pip install justhtml
Quick Example
from justhtml import JustHTML
doc = JustHTML("<html><body><p class='intro'>Hello!</p></body></html>")
# Query with CSS selectors
for p in doc.query("p.intro"):
print(p.name) # "p"
print(p.attrs) # {"class": "intro"}
print(p.to_html()) # <p class="intro">Hello!</p>
See the Quickstart Guide for more examples including tree traversal, streaming, and strict mode.
Command Line
If you installed JustHTML (for example with pip install justhtml or pip install -e .), you can use the justhtml command.
If you don't have it available, use the equivalent python -m justhtml ... form instead.
# Pretty-print an HTML file
justhtml index.html
# Parse from stdin
curl -s https://example.com | justhtml -
# Select nodes and output text
justhtml index.html --selector "main p" --format text
# Select nodes and output Markdown (subset of GFM)
justhtml index.html --selector "article" --format markdown
# Select nodes and output HTML
justhtml index.html --selector "a" --format html
# Example: extract Markdown from GitHub README HTML
curl -s https://github.com/EmilStenstrom/justhtml/ | justhtml - --selector '.markdown-body' --format markdown | head -n 15
Output:
# JustHTML
[](#justhtml)
A pure Python HTML5 parser that just works. No C extensions to compile. No system dependencies to install. No complex API to learn.
**[📖 Read the full documentation here](/EmilStenstrom/justhtml/blob/main/docs/index.md)**
## Why use JustHTML?
- **Just... Correct ✅** — Spec-perfect HTML5 parsing with browser-grade error recovery — passes the official 9k+ [html5lib-tests](https://github.com/html5lib/html5lib-tests) suite, with 100% line+branch coverage. ([Correctness](/EmilStenstrom/justhtml/blob/main/docs/correctness.md))
- **Just... Python 🐍** — Pure Python, zero dependencies — no C extensions or system libraries, easy to debug, and works anywhere Python runs (including PyPy and Pyodide). ([Quickstart](/EmilStenstrom/justhtml/blob/main/docs/quickstart.md))
- **Just... Secure 🔒** — Safe-by-default output for untrusted HTML — built-in Bleach-style allowlist sanitization on `to_html()` / `to_markdown()` (override with `safe=False`), plus URL/CSS rules. ([Sanitization & Security](/EmilStenstrom/justhtml/blob/main/docs/sanitization.md))
Contributing
See CONTRIBUTING.md for development setup and guidelines.
Acknowledgments
JustHTML started as a Python port of html5ever, the HTML5 parser from Mozilla's Servo browser engine. While the codebase has since evolved significantly, html5ever's clean architecture and spec-compliant approach were invaluable as a starting point. Thank you to the Servo team for their excellent work.
Correctness and conformance work is heavily guided by the html5lib ecosystem and especially the official html5lib-tests fixtures used across implementations.
The sanitization API and threat-model expectations are informed by established Python sanitizers like Bleach and nh3.
The CSS selector query API is inspired by the ergonomics of lxml.cssselect.
License
MIT. Free to use both for commercial and non-commercial use.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file justhtml-0.23.0.tar.gz.
File metadata
- Download URL: justhtml-0.23.0.tar.gz
- Upload date:
- Size: 238.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
648b9000d420bc100fffb2ac5be412ee8d22cd3f17d446160b47e605eca0e674
|
|
| MD5 |
bd909dc17764b809da105a89ebe0e29b
|
|
| BLAKE2b-256 |
baf7e3249a1500cb7e7bdc439403339a8603edabec44acd0fb322efd176f45f0
|
Provenance
The following attestation bundles were made for justhtml-0.23.0.tar.gz:
Publisher:
publish.yml on EmilStenstrom/justhtml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
justhtml-0.23.0.tar.gz -
Subject digest:
648b9000d420bc100fffb2ac5be412ee8d22cd3f17d446160b47e605eca0e674 - Sigstore transparency entry: 782244116
- Sigstore integration time:
-
Permalink:
EmilStenstrom/justhtml@7fb580ec20082d7df05d521c4c03d097e996543e -
Branch / Tag:
refs/tags/v0.23.0 - Owner: https://github.com/EmilStenstrom
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7fb580ec20082d7df05d521c4c03d097e996543e -
Trigger Event:
release
-
Statement type:
File details
Details for the file justhtml-0.23.0-py3-none-any.whl.
File metadata
- Download URL: justhtml-0.23.0-py3-none-any.whl
- Upload date:
- Size: 89.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d101aafd7b90c5ba9ee27fd19c552e76f619988212f3fd7c97a749e9cb6e295
|
|
| MD5 |
53df51efe9738e1cff842539d3cf36f2
|
|
| BLAKE2b-256 |
eb7474106a8e315447b75a054b4096367a203035bf1801e3c73ebe1989a3eff7
|
Provenance
The following attestation bundles were made for justhtml-0.23.0-py3-none-any.whl:
Publisher:
publish.yml on EmilStenstrom/justhtml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
justhtml-0.23.0-py3-none-any.whl -
Subject digest:
9d101aafd7b90c5ba9ee27fd19c552e76f619988212f3fd7c97a749e9cb6e295 - Sigstore transparency entry: 782244119
- Sigstore integration time:
-
Permalink:
EmilStenstrom/justhtml@7fb580ec20082d7df05d521c4c03d097e996543e -
Branch / Tag:
refs/tags/v0.23.0 - Owner: https://github.com/EmilStenstrom
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7fb580ec20082d7df05d521c4c03d097e996543e -
Trigger Event:
release
-
Statement type: