Skip to main content

A pure Python HTML5 parser that just works.

Project description

JustHTML

JustHTML is a pure Python HTML5 parser that just works. It parses HTML and returns a DOM tree that you can traverse and manipulate.

Why JustHTML?

1. ✅ Correctness: 100% Spec Compliant

JustHTML is built to be correct. It implements the official WHATWG HTML5 specification exactly (tree builder and tokenizer), including all the complex error-handling rules that browsers use.

  • Verified Compliance: Passes all 8,500+ tests in the official html5lib-tests suite (used by browser vendors) (see /tests/).
  • 100% Coverage: Every single line and branch of code is covered by integration tests.
  • Fuzz Tested: Has parsed 3 million randomized broken HTML documents to ensure it never crashes or hangs (see fuzz.py).
  • Living Standard: It tracks the living standard, not a snapshot from 2012.

2. 🐍 Pure Python with zero dependencies

JustHTML has zero dependencies. It's pure Python.

  • Easy Installation: No C extensions to compile, no system libraries (like libxml2) required. Works on PyPy, WASM (Pyodide), and anywhere Python runs.
  • No dependency upgrade hassle: Some libraries depend on a large set of libraries, all which require upgrades to avoid security issues.
  • Debuggable: It's just Python code. You can step through it with a debugger to understand exactly how your HTML is being parsed.
  • Returns plain python objects: Other parsers return lxml or etree trees which means you have another API to learn. JustHTML returns a set of nested objects you can iterate over. Simple.

3. ⚡ Fast enough™ Performance

If you need to parse terabytes of data, use a C or Rust parser (like html5ever). They are 10x-20x faster (see benchmarks.py).

But for most use cases, JustHTML is fast enough. It parses the Wikipedia homepage in ~0.1s. It is the fastest pure-Python HTML5 parser available, outperforming html5lib and BeautifulSoup.

Comparison to other parsers

Parser Spec Compliant? Pure Python? Speed Notes
JustHTML ✅ Yes ✅ Yes ⚡ Fast The sweet spot. Correct, easy to install, and fast enough.
html.parser ❌ No ✅ Yes ⚡ Fast Standard library. Chokes on malformed HTML.
lxml ❌ No ❌ No 🚀 Very Fast C-based. Fast but not spec-compliant (different output than browsers).
html5lib ✅ Yes ✅ Yes 🐢 Slow The reference implementation. Very correct but very slow.
BeautifulSoup N/A N/A 🐢 Slow Wrapper around other parsers. Slower and more memory hungry than the underlying parser.
gumbo / html5ever ✅ Yes ❌ No 🚀 Very Fast C/Rust based. Fast and correct, but requires compiling extensions.

Installation

pip install justhtml

Example usage

Python API

from justhtml import JustHTML

html = "<html><body><div id='main'><p>Hello, <b>world</b>!</p></div></body></html>"
doc = JustHTML(html)

# 1. Traverse the tree
# The tree is made of SimpleDomNode objects.
# Each node has .name, .attrs, .children, and .parent
root = doc.root              # #document
html_node = root.children[0] # html
body = html_node.children[1] # body (children[0] is head)
div = body.children[0]       # div

print(f"Tag: {div.name}")
print(f"Attributes: {div.attrs}")

# 2. Pretty-print HTML
# You can serialize any node back to HTML
print(div.to_html())
# Output:
# <div id="main">
#   <p>
#     Hello,
#     <b>world</b>
#     !
#   </p>
# </div>

Command Line Interface

You can also use JustHTML from the command line to pretty-print HTML files:

# Parse a file
python -m justhtml index.html

# Parse from stdin (great for piping)
curl -s https://example.com | python -m justhtml -

Develop locally and run the tests

  1. Clone the repository:

    git clone git@github.com:EmilStenstrom/justhtml.git
    cd justhtml
    
  2. Install the library locally (there's no dependencies!):

    pip install -e .
    
  3. Run the tests:

    python run_tests.py
    

    For verbose output showing diffs on failures:

    python run_tests.py -v
    
  4. Run the benchmarks:

    python benchmark.py
    

License

MIT. Free to use for commercial and non-commercial use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

justhtml-0.1.0.tar.gz (82.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

justhtml-0.1.0-py3-none-any.whl (48.0 kB view details)

Uploaded Python 3

File details

Details for the file justhtml-0.1.0.tar.gz.

File metadata

  • Download URL: justhtml-0.1.0.tar.gz
  • Upload date:
  • Size: 82.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for justhtml-0.1.0.tar.gz
Algorithm Hash digest
SHA256 52167a0a932aa15683e703f837fa9602f2eeb846cd581b23316509046c104d3e
MD5 c3a7d78bf39c091ee83c1397d88500dc
BLAKE2b-256 786d934823e10a9ab7d40b427a1a5fbc92fab47db8c0c3a9c84afef1d0384f4b

See more details on using hashes here.

Provenance

The following attestation bundles were made for justhtml-0.1.0.tar.gz:

Publisher: publish.yml on EmilStenstrom/justhtml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file justhtml-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: justhtml-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 48.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for justhtml-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b5589bbd454085614dffdaebe3f86860188c9bfff0c82a36ee8f5b5044cd4f12
MD5 11a93366b00fdedf1d3cee94653e8852
BLAKE2b-256 1ef1b0d7db941db5f4557eeecd05bfa86c0c730fec84386c09ceed31d47f4a66

See more details on using hashes here.

Provenance

The following attestation bundles were made for justhtml-0.1.0-py3-none-any.whl:

Publisher: publish.yml on EmilStenstrom/justhtml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page