Skip to main content

A pure Python HTML5 parser that just works.

Project description

JustHTML

A pure Python HTML5 parser that just works. No C extensions to compile. No system dependencies to install. No complex API to learn.

📖 Read the full documentation here

Why use JustHTML?

1. Just... Correct ✅

It implements the official WHATWG HTML5 specification exactly. If a browser can parse it, JustHTML can parse it. It handles all the complex error-handling rules that browsers use.

  • Verified Compliance: Passes all 9k+ tests in the official html5lib-tests suite (used by browser vendors).
  • 100% Coverage: Every line and branch of code is covered by integration tests.
  • Fuzz Tested: Has parsed 6 million randomized broken HTML documents to ensure it never crashes or hangs (see benchmarks/fuzz.py).
  • Living Standard: It tracks the living standard, not a snapshot from 2012.

2. Just... Python 🐍

JustHTML has zero dependencies. It's pure Python.

  • Just Install: No C extensions to compile, no system libraries (like libxml2) required. Works on PyPy, WASM (Pyodide) (yes, it's in the test matrix), and anywhere Python runs.
  • No dependency upgrade hassle: Some libraries depend on a large set of libraries, all which require upgrades to avoid security issues.
  • Debuggable: It's just Python code. You can step through it with a debugger to understand exactly how your HTML is being parsed.
  • Returns plain python objects: Other parsers return lxml or etree trees which means you have another API to learn. JustHTML returns a set of nested objects you can iterate over. Simple.

3. Just... Query 🔍

Find elements with CSS selectors. Just one method to learn - query() - and it uses CSS syntax you already know.

doc.query("div.container > p.intro")  # Familiar CSS syntax
doc.query("#main, .sidebar")          # Selector groups
doc.query("li:nth-child(2n+1)")       # Pseudo-classes

4. Just... Fast Enough ⚡

If you need to parse terabytes of data, use a C or Rust parser (like html5ever). They are 10x-20x faster.

But for most use cases, JustHTML is fast enough. It parses the Wikipedia homepage in ~0.1s. It is the fastest pure-Python HTML5 parser available, outperforming html5lib and BeautifulSoup.

Comparison to other parsers

Parser HTML5 Compliance Pure Python? Speed Query API Notes
JustHTML 100% ✅ Yes ⚡ Fast ✅ CSS selectors It just works. Correct, easy to install, and fast enough.
html5lib 🟡 88% ✅ Yes 🐢 Slow ❌ None The reference implementation. Very correct but quite slow.
html5_parser 🟡 84% ❌ No 🚀 Very Fast 🟡 XPath (lxml) C-based (Gumbo). Fast and mostly correct.
selectolax 🟡 68% ❌ No 🚀 Very Fast ✅ CSS selectors C-based (Lexbor). Very fast but less compliant.
BeautifulSoup 🔴 4% ✅ Yes 🐢 Slow 🟡 Custom API Wrapper around html.parser. Not spec compliant.
html.parser 🔴 4% ✅ Yes ⚡ Fast ❌ None Standard library. Chokes on malformed HTML.
lxml 🔴 1% ❌ No 🚀 Very Fast 🟡 XPath C-based (libxml2). Fast but not HTML5 compliant.

Compliance scores from running the html5lib-tests suite (1,743 tree-construction tests). See benchmarks/correctness.py.

Installation

Requires Python 3.10 or later.

pip install justhtml

Quick Example

from justhtml import JustHTML

doc = JustHTML("<html><body><p class='intro'>Hello!</p></body></html>")

# Query with CSS selectors
for p in doc.query("p.intro"):
    print(p.name)        # "p"
    print(p.attrs)       # {"class": "intro"}
    print(p.to_html())   # <p class="intro">Hello!</p>

See the Quickstart Guide for more examples including tree traversal, streaming, and strict mode.

Command Line

If you installed JustHTML (for example with pip install justhtml or pip install -e .), you can use the justhtml command. If you don't have it available, use the equivalent python -m justhtml ... form instead.

# Pretty-print an HTML file
justhtml index.html

# Parse from stdin
curl -s https://example.com | justhtml -

# Select nodes and output text
justhtml index.html --selector "main p" --format text

# Select nodes and output Markdown (subset of GFM)
justhtml index.html --selector "article" --format markdown

# Select nodes and output HTML
justhtml index.html --selector "a" --format html
# Example: extract Markdown from GitHub README HTML
curl -s https://github.com/EmilStenstrom/justhtml/ | justhtml - --selector '.markdown-body' --format markdown | head -n 15

Output:

# JustHTML

[](#justhtml)

A pure Python HTML5 parser that just works. No C extensions to compile. No system dependencies to install. No complex API to learn.

**[📖 Read the full documentation here](/EmilStenstrom/justhtml/blob/main/docs/index.md)**

## Why use JustHTML?

[](#why-use-justhtml)

### 1. Just... Correct ✅

[](#1-just-correct-)

Contributing

See CONTRIBUTING.md for development setup and guidelines.

Acknowledgments

JustHTML started as a Python port of html5ever, the HTML5 parser from Mozilla's Servo browser engine. While the codebase has since evolved significantly, html5ever's clean architecture and spec-compliant approach were invaluable as a starting point. Thank you to the Servo team for their excellent work.

License

MIT. Free to use both for commercial and non-commercial use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

justhtml-0.15.0.tar.gz (164.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

justhtml-0.15.0-py3-none-any.whl (76.7 kB view details)

Uploaded Python 3

File details

Details for the file justhtml-0.15.0.tar.gz.

File metadata

  • Download URL: justhtml-0.15.0.tar.gz
  • Upload date:
  • Size: 164.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for justhtml-0.15.0.tar.gz
Algorithm Hash digest
SHA256 335b9e01fb6fb1da533c7032b9d1678b76091569f2d78da01f047d537d027ecb
MD5 bdc0f492f1adc60abfdf5e200be34b77
BLAKE2b-256 1ad0c82bc420a8dd9a4d683a583d89abc088b9d26573577421e98e0808b504e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for justhtml-0.15.0.tar.gz:

Publisher: publish.yml on EmilStenstrom/justhtml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file justhtml-0.15.0-py3-none-any.whl.

File metadata

  • Download URL: justhtml-0.15.0-py3-none-any.whl
  • Upload date:
  • Size: 76.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for justhtml-0.15.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7ab6b86b78682249b798d21a5b29f973fbcfa3de928f064cf728010941cb300b
MD5 1a6c1d18fb90dd8d05eed2182c7e07e9
BLAKE2b-256 1a2cbe9118910dd92321a3b26d9407232b3a6b451db6a41fb41d5c8a72fcb108

See more details on using hashes here.

Provenance

The following attestation bundles were made for justhtml-0.15.0-py3-none-any.whl:

Publisher: publish.yml on EmilStenstrom/justhtml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page