Skip to main content

A pure Python HTML5 parser that just works.

Project description

JustHTML

A pure Python HTML5 parser that just works. No C extensions to compile. No system dependencies to install. No complex API to learn.

Why use JustHTML?

1. Just... Correct ✅

It implements the official WHATWG HTML5 specification exactly. If a browser can parse it, JustHTML can parse it. It handles all the complex error-handling rules that browsers use.

  • Verified Compliance: Passes all 8,500+ tests in the official html5lib-tests suite (used by browser vendors).
  • 100% Coverage: Every line and branch of code is covered by integration tests.
  • Fuzz Tested: Has parsed 3 million randomized broken HTML documents to ensure it never crashes or hangs (see benchmarks/fuzz.py).
  • Living Standard: It tracks the living standard, not a snapshot from 2012.

2. Just... Python 🐍

JustHTML has zero dependencies. It's pure Python.

  • Just Install: No C extensions to compile, no system libraries (like libxml2) required. Works on PyPy, WASM (Pyodide) (yes, it's in the test matrix), and anywhere Python runs.
  • No dependency upgrade hassle: Some libraries depend on a large set of libraries, all which require upgrades to avoid security issues.
  • Debuggable: It's just Python code. You can step through it with a debugger to understand exactly how your HTML is being parsed.
  • Returns plain python objects: Other parsers return lxml or etree trees which means you have another API to learn. JustHTML returns a set of nested objects you can iterate over. Simple.

3. Just... Query 🔍

Find elements with CSS selectors. Just one method to learn - query() - and it uses CSS syntax you already know.

doc.query("div.container > p.intro")  # Familiar CSS syntax
doc.query("#main, .sidebar")          # Selector groups
doc.query("li:nth-child(2n+1)")       # Pseudo-classes

4. Just... Fast Enough ⚡

If you need to parse terabytes of data, use a C or Rust parser (like html5ever). They are 10x-20x faster.

But for most use cases, JustHTML is fast enough. It parses the Wikipedia homepage in ~0.1s. It is the fastest pure-Python HTML5 parser available, outperforming html5lib and BeautifulSoup.

Comparison to other parsers

Parser HTML5 Compliance Pure Python? Speed Query API Notes
JustHTML 100% ✅ Yes ⚡ Fast ✅ CSS selectors It just works. Correct, easy to install, and fast enough.
html5lib 🟡 88% ✅ Yes 🐢 Slow ❌ None The reference implementation. Very correct but quite slow.
html5_parser 🟡 84% ❌ No 🚀 Very Fast 🟡 XPath (lxml) C-based (Gumbo). Fast and mostly correct.
selectolax 🟡 68% ❌ No 🚀 Very Fast ✅ CSS selectors C-based (Lexbor). Very fast but less compliant.
BeautifulSoup 🔴 4% ✅ Yes 🐢 Slow 🟡 Custom API Wrapper around html.parser. Not spec compliant.
html.parser 🔴 4% ✅ Yes ⚡ Fast ❌ None Standard library. Chokes on malformed HTML.
lxml 🔴 1% ❌ No 🚀 Very Fast 🟡 XPath C-based (libxml2). Fast but not HTML5 compliant.

Compliance scores from running the html5lib-tests suite (1,743 tree-construction tests). See benchmarks/correctness.py.

Installation

Requires Python 3.10 or later.

pip install justhtml

Example usage

Python API

from justhtml import JustHTML

html = "<html><body><div id='main'><p>Hello, <b>world</b>!</p></div></body></html>"
doc = JustHTML(html)

# 1. Traverse the tree
# The tree is made of SimpleDomNode objects.
# Each node has .name, .attrs, .children, and .parent
root = doc.root              # #document
html_node = root.children[0] # html
body = html_node.children[1] # body (children[0] is head)
div = body.children[0]       # div

print(f"Tag: {div.name}")
print(f"Attributes: {div.attrs}")

# 2. Query with CSS selectors
# Find elements using familiar CSS selector syntax
paragraphs = doc.query("p")           # All <p> elements
main_div = doc.query("#main")[0]      # Element with id="main"
bold = doc.query("div > p b")         # <b> inside <p> inside <div>

# 3. Pretty-print HTML
# You can serialize any node back to HTML
print(div.to_html())
# Output:
# <div id="main">
#   <p>
#     Hello,
#     <b>world</b>
#     !
#   </p>
# </div>

# 4. Streaming API (extremely fast and memory efficient)
# For massive files or when you don't need the full DOM tree.
# NOTE: Does not build a tree and _only_ runs the html5-compatible tokenizer

from justhtml import stream

for event, data in stream(html):
    if event == "start":
        tag, attrs = data
        print(f"Start: {tag} with {attrs}")
    elif event == "text":
        print(f"Text: {data}")
    elif event == "end":
        print(f"End: {data}")

# 5. Strict mode (reject malformed HTML)
# Raises an exception on the first parse error with source highlighting
try:
    doc = JustHTML("<html><p>Hello", strict=True)
except Exception as e:
    print(e)
# Output (Python 3.11+):
#   File "<html>", line 1
#     <html><p>Hello
#                   ^
# StrictModeError: Expected closing tag </p> but reached end of file

Supported CSS Selectors

JustHTML supports a comprehensive subset of CSS selectors:

Selector Example Description
Tag div Elements by tag name
Class .intro Elements with class
ID #main Element with ID
Universal * All elements
Attribute [href] Elements with attribute
Attr value [type="text"] Exact attribute match
Attr prefix [href^="https"] Attribute starts with
Attr suffix [href$=".pdf"] Attribute ends with
Attr contains [href*="example"] Attribute contains
Descendant div p <p> inside <div>
Child div > p Direct child
Adjacent h1 + p Immediately after
Sibling h1 ~ p Any sibling after
First child :first-child First child element
Last child :last-child Last child element
Nth child :nth-child(2n+1) Nth child (odd, even, formula)
Not :not(.hidden) Negation
Groups h1, h2, h3 Multiple selectors

Command Line Interface

You can also use JustHTML from the command line to pretty-print HTML files:

# Parse a file
python -m justhtml index.html

# Parse from stdin (great for piping)
curl -s https://example.com | python -m justhtml -

Develop locally and run the tests

  1. Clone the repository:

    git clone git@github.com:EmilStenstrom/justhtml.git
    cd justhtml
    
  2. Install the library locally:

    pip install -e ".[dev]"
    
  3. Run the tests:

    python run_tests.py
    

    For verbose output showing diffs on failures:

    python run_tests.py -v
    
  4. Run the benchmarks:

    python benchmarks/performance.py
    

License

MIT. Free to use both for commercial and non-commercial use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

justhtml-0.5.1.tar.gz (126.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

justhtml-0.5.1-py3-none-any.whl (65.9 kB view details)

Uploaded Python 3

File details

Details for the file justhtml-0.5.1.tar.gz.

File metadata

  • Download URL: justhtml-0.5.1.tar.gz
  • Upload date:
  • Size: 126.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for justhtml-0.5.1.tar.gz
Algorithm Hash digest
SHA256 3dce31affbfad8b9d172cf87058e0fb30112d3ad1adabea44ff765e9b0cc96d3
MD5 166573e3e1a32809695ec50945c0e90a
BLAKE2b-256 ffd33e92584125e34ddbee7ccafd479f9b65a1ac46aef0a5c0c6b020a437d867

See more details on using hashes here.

Provenance

The following attestation bundles were made for justhtml-0.5.1.tar.gz:

Publisher: publish.yml on EmilStenstrom/justhtml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file justhtml-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: justhtml-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 65.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for justhtml-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b07d3557f8094a8c007a98cae7cf039eab087c01a5654ab28540cc7d2f320286
MD5 b1e1db5f7f24a3e170a33f8d3d41be69
BLAKE2b-256 19dfd8e2ea9b8d6e5c703fd5cfe300e6ee59a26e4c1f721bee9565cbe26339f8

See more details on using hashes here.

Provenance

The following attestation bundles were made for justhtml-0.5.1-py3-none-any.whl:

Publisher: publish.yml on EmilStenstrom/justhtml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page