A pure Python HTML5 parser that just works.
Project description
JustHTML
JustHTML is a pure Python HTML5 parser that just works. It parses HTML and returns a DOM tree that you can traverse and manipulate.
Why JustHTML?
1. ✅ Correctness: 100% Spec Compliant
JustHTML is built to be correct. It implements the official WHATWG HTML5 specification exactly (tree builder and tokenizer), including all the complex error-handling rules that browsers use.
- Verified Compliance: Passes all 8,500+ tests in the official
html5lib-testssuite (used by browser vendors) (see /tests/). - 100% Coverage: Every single line and branch of code is covered by integration tests.
- Fuzz Tested: Has parsed 3 million randomized broken HTML documents to ensure it never crashes or hangs (see fuzz.py).
- Living Standard: It tracks the living standard, not a snapshot from 2012.
2. 🐍 Pure Python with zero dependencies
JustHTML has zero dependencies. It's pure Python.
- Easy Installation: No C extensions to compile, no system libraries (like libxml2) required. Works on PyPy, WASM (Pyodide), and anywhere Python runs.
- No dependency upgrade hassle: Some libraries depend on a large set of libraries, all which require upgrades to avoid security issues.
- Debuggable: It's just Python code. You can step through it with a debugger to understand exactly how your HTML is being parsed.
- Returns plain python objects: Other parsers return lxml or etree trees which means you have another API to learn. JustHTML returns a set of nested objects you can iterate over. Simple.
3. ⚡ Fast enough™ Performance
If you need to parse terabytes of data, use a C or Rust parser (like html5ever). They are 10x-20x faster (see benchmarks.py).
But for most use cases, JustHTML is fast enough. It parses the Wikipedia homepage in ~0.1s. It is the fastest pure-Python HTML5 parser available, outperforming html5lib and BeautifulSoup.
Comparison to other parsers
| Parser | Spec Compliant? | Pure Python? | Speed | Notes |
|---|---|---|---|---|
| JustHTML | ✅ Yes | ✅ Yes | ⚡ Fast | The sweet spot. Correct, easy to install, and fast enough. |
html.parser |
❌ No | ✅ Yes | ⚡ Fast | Standard library. Chokes on malformed HTML. |
lxml |
❌ No | ❌ No | 🚀 Very Fast | C-based. Fast but not spec-compliant (different output than browsers). |
html5lib |
✅ Yes | ✅ Yes | 🐢 Slow | The reference implementation. Very correct but very slow. |
BeautifulSoup |
N/A | N/A | 🐢 Slow | Wrapper around other parsers. Slower and more memory hungry than the underlying parser. |
gumbo / html5ever |
✅ Yes | ❌ No | 🚀 Very Fast | C/Rust based. Fast and correct, but requires compiling extensions. |
Installation
pip install justhtml
Example usage
Python API
from justhtml import JustHTML
html = "<html><body><div id='main'><p>Hello, <b>world</b>!</p></div></body></html>"
doc = JustHTML(html)
# 1. Traverse the tree
# The tree is made of SimpleDomNode objects.
# Each node has .name, .attrs, .children, and .parent
root = doc.root # #document
html_node = root.children[0] # html
body = html_node.children[1] # body (children[0] is head)
div = body.children[0] # div
print(f"Tag: {div.name}")
print(f"Attributes: {div.attrs}")
# 2. Pretty-print HTML
# You can serialize any node back to HTML
print(div.to_html())
# Output:
# <div id="main">
# <p>
# Hello,
# <b>world</b>
# !
# </p>
# </div>
Command Line Interface
You can also use JustHTML from the command line to pretty-print HTML files:
# Parse a file
python -m justhtml index.html
# Parse from stdin (great for piping)
curl -s https://example.com | python -m justhtml -
Develop locally and run the tests
-
Clone the repository:
git clone git@github.com:EmilStenstrom/justhtml.git cd justhtml
-
Install the library locally (there's no dependencies!):
pip install -e .
-
Run the tests:
python run_tests.pyFor verbose output showing diffs on failures:
python run_tests.py -v
-
Run the benchmarks:
python benchmark.py
License
MIT. Free to use for commercial and non-commercial use.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file justhtml-0.1.0.tar.gz.
File metadata
- Download URL: justhtml-0.1.0.tar.gz
- Upload date:
- Size: 82.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52167a0a932aa15683e703f837fa9602f2eeb846cd581b23316509046c104d3e
|
|
| MD5 |
c3a7d78bf39c091ee83c1397d88500dc
|
|
| BLAKE2b-256 |
786d934823e10a9ab7d40b427a1a5fbc92fab47db8c0c3a9c84afef1d0384f4b
|
Provenance
The following attestation bundles were made for justhtml-0.1.0.tar.gz:
Publisher:
publish.yml on EmilStenstrom/justhtml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
justhtml-0.1.0.tar.gz -
Subject digest:
52167a0a932aa15683e703f837fa9602f2eeb846cd581b23316509046c104d3e - Sigstore transparency entry: 731938629
- Sigstore integration time:
-
Permalink:
EmilStenstrom/justhtml@8bc80c5baa29a53a417db7a0db56a0ab5fa92104 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/EmilStenstrom
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8bc80c5baa29a53a417db7a0db56a0ab5fa92104 -
Trigger Event:
release
-
Statement type:
File details
Details for the file justhtml-0.1.0-py3-none-any.whl.
File metadata
- Download URL: justhtml-0.1.0-py3-none-any.whl
- Upload date:
- Size: 48.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5589bbd454085614dffdaebe3f86860188c9bfff0c82a36ee8f5b5044cd4f12
|
|
| MD5 |
11a93366b00fdedf1d3cee94653e8852
|
|
| BLAKE2b-256 |
1ef1b0d7db941db5f4557eeecd05bfa86c0c730fec84386c09ceed31d47f4a66
|
Provenance
The following attestation bundles were made for justhtml-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on EmilStenstrom/justhtml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
justhtml-0.1.0-py3-none-any.whl -
Subject digest:
b5589bbd454085614dffdaebe3f86860188c9bfff0c82a36ee8f5b5044cd4f12 - Sigstore transparency entry: 731938630
- Sigstore integration time:
-
Permalink:
EmilStenstrom/justhtml@8bc80c5baa29a53a417db7a0db56a0ab5fa92104 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/EmilStenstrom
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8bc80c5baa29a53a417db7a0db56a0ab5fa92104 -
Trigger Event:
release
-
Statement type: