Skip to main content

A pure Python HTML5 parser that just works.

Project description

JustHTML

HTML from the real web is messy. It is often malformed, user supplied, scraped from unknown pages, or headed for a browser where small parsing differences can become security bugs.

JustHTML gives Python projects one small dependency for the common HTML jobs:

  • parse HTML like a browser, including broken markup
  • sanitize untrusted HTML by default
  • query with CSS selectors
  • transform, serialize, extract text, or convert to Markdown
  • run anywhere Python runs, with no C extension and no system package to install
pip install justhtml

Requires Python 3.10 or later.

Documentation | Comparison | Playground | Security policy

JustHTML turns messy unsafe HTML into a sanitized, queryable DOM, then serializes it to text, Markdown, or HTML.

Why Use It?

Most Python HTML libraries optimize for one part of the problem.

html.parser is built in, but not HTML5-correct. BeautifulSoup is convenient, but depends heavily on the parser underneath. lxml and C/Rust-backed parsers are fast, but usually leave sanitization as a separate concern. html5lib and Bleach shaped the Python ecosystem, but both are no longer the obvious foundation for new projects.

JustHTML is for applications that want a boring, inspectable, pure-Python default:

  • Correct parsing: browser-style HTML5 recovery, tested against the official html5lib fixtures.
  • Safe by default: JustHTML(html) sanitizes before you query or serialize.
  • One DOM: parse once, then sanitize, query, transform, serialize, extract text, or produce Markdown.
  • Easy deployment: zero runtime dependencies, no compiler, works on PyPy and Pyodide.
  • Honest tradeoff: if you are parsing terabytes of trusted HTML, use a C/Rust parser. If you need reliable handling of untrusted or malformed HTML inside a Python app, use JustHTML.

Real-world signal: Mozilla Support migrated from Bleach to JustHTML in Kitsune, the Django application behind support.mozilla.org.

Quick Start

from justhtml import JustHTML

doc = JustHTML(
    "<p>Hello<script>alert(1)</script> "
    "<a href='javascript:alert(1)'>bad</a> "
    "<a href='https://example.com'>ok</a></p>",
    fragment=True,
)

print(doc.to_html(pretty=False))
# => <p>Hello <a>bad</a> <a href="https://example.com">ok</a></p>

Sanitization is enabled by default. Disable it only for trusted input:

doc = JustHTML("<main><p class='intro'>Hello</p></main>", sanitize=False)
intro = doc.query_one("p.intro")

print(intro.to_text())
# => Hello

What You Can Do

from justhtml import JustHTML, Linkify, SetAttrs, Unwrap

doc = JustHTML(
    "<p>Hello <span>world</span> example.com</p>",
    fragment=True,
    sanitize=False,
    transforms=[
        Unwrap("span"),
        Linkify(),
        SetAttrs("a", rel="nofollow"),
    ],
)

print(doc.to_html(pretty=False))
# => <p>Hello world <a href="http://example.com" rel="nofollow">example.com</a></p>

JustHTML includes:

Command Line

# Pretty-print an HTML file
justhtml index.html

# Parse from stdin
curl -s https://example.com | justhtml -

# Extract text from selected nodes
justhtml index.html --selector "main p" --format text

# Convert selected HTML to Markdown
justhtml index.html --selector "article" --format markdown

Correctness

JustHTML is tested against the official html5lib tree-construction, serializer, and encoding fixtures, plus project-specific sanitizer, selector, transform, CLI, and regression tests.

The current test suite enforces 100% combined line and branch coverage, including the parser engine. The parser engine additionally requires exact agreement with the reference path across the html5lib tree suite. See Correctness Testing for details.

Documentation

Security

JustHTML sanitizes by default, but output safety still depends on where you put it. HTML body output is not automatically safe inside JavaScript, CSS, URL attributes, or other contexts.

For the supported-version policy and vulnerability reporting, see SECURITY.md.

License

MIT. Free to use for commercial and non-commercial projects.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

justhtml-3.0.0.tar.gz (903.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

justhtml-3.0.0-py3-none-any.whl (149.0 kB view details)

Uploaded Python 3

File details

Details for the file justhtml-3.0.0.tar.gz.

File metadata

  • Download URL: justhtml-3.0.0.tar.gz
  • Upload date:
  • Size: 903.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for justhtml-3.0.0.tar.gz
Algorithm Hash digest
SHA256 422a0c234d65079d816328a6a2b5e40473a250983f5401229d2854a531523c21
MD5 ea5668126ffa9500869640c107475119
BLAKE2b-256 18cc236d3e0699ebee2497a16f31e890b3abe86eab82780ba56e403b8997ed1c

See more details on using hashes here.

Provenance

The following attestation bundles were made for justhtml-3.0.0.tar.gz:

Publisher: publish.yml on EmilStenstrom/justhtml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file justhtml-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: justhtml-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 149.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for justhtml-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ef8e8b6e94cf5aeeaf3c52b9b24f7e81daec1654cb71d70e1974d2024c8dc39b
MD5 51811066f9151c5899d8ce932e3f8e6e
BLAKE2b-256 61f3abe2e33bf8ed4a9297f16f76429fa37e37733a1c6853eed5ea770cfe564a

See more details on using hashes here.

Provenance

The following attestation bundles were made for justhtml-3.0.0-py3-none-any.whl:

Publisher: publish.yml on EmilStenstrom/justhtml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page