Skip to main content

WHATWG-compliant HTML5 toolkit: DFA tokenizer, spec-guided tree builder, DOM, configurable serializer, high-level cleaner, pretty-printer, and HTML to Markdown.

Project description

WizardHTML Banner


WizardHTML

PyPI - Version PyPI - Downloads/month License

WizardHTML is a Python library for WHATWG-compliant HTML5 toolkit: DFA tokenizer, spec-guided tree builder, DOM, configurable serializer, and helpers for cleaning, pretty-printing, and HTML→Markdown.


Contents


Installation

Requires Python 3.9+.

pip install wizardhtml

Quick start

import wizardhtml as wh

# Mode A: text-only extraction
print(wh.clean_html("<div><p>Hello</p><script>x()</script></div>"))  # -> "Hello"

# Pretty print
html = "<body><p>Hi <b>there</b></p><img src=x></body>"
print(wh.beautiful_html(html, indent=2))

# HTML → Markdown
print(wh.html_to_markdown("<h1>T</h1><p>Body</p>"))

# Parser and DOM
doc = wh.parse("<!doctype html><html><body><p>Hi</p></body></html>")

Public API

Function Purpose
parse(html, fragment_context=None, return_errors=False) Parse into Document or DocumentFragment; optional parse error list
clean_html(text, **flags) HTML cleaning with modes A/B/C
beautiful_html(html, **opts) Non-destructive pretty-printer
html_to_markdown(html) HTML → Markdown (best-effort)
serialize(node, **opts) Serialize DOM → HTML
to_text(html, separator="\n", strip=True, collapse_ws=True) Extract readable text (Mode A + whitespace normalization)

Parse

Parse HTML as full document or fragment. Collect spec-like parse errors when requested.

Parameters

Name Type Default Meaning
html str required Input HTML.
fragment_context str | None None Context element name for fragment parsing (e.g., "div", "template", "tbody", "svg", "math").
return_errors bool False If True, return (node, errors:list[str]).
  • Full document when fragment_context is None → returns Document.
  • Fragment parsing with context name (e.g. "div", "template", "tbody", "svg", "math") → returns DocumentFragment.
  • return_errors=True returns (node, list[str]).
import wizardhtml as wh

doc = wh.parse("<!doctype html><html><body><p>Hi</p></body></html>")
frag = wh.parse("<li>item</li>", fragment_context="ul")
node, errors = wh.parse("<p><b>x</p>", return_errors=True)

HTML cleaning

Clean HTML with granular flags. Three modes: A) all None → text-only, B) any True → HTML sanitized, C) all provided and False → text with selected markup preserved.

Behavior

There are three modes with different return types:

Mode How to trigger Output Description
A – text-only No parameters provided (all None) str (plain text) Extracts text, skips script-supporting tags, inserts safe spaces.
B – structural clean At least one flag is True str (serialized HTML) Removes/unwraps per flags. Supports wildcard tag/attribute removal, content stripping, empty-tag pruning.
C – text with preservation Parameters present and all False str (text + preserved markup) Extracts text but preserves groups explicitly set to False (and comments/doctype if set False).

Parameters

  • text: str HTML input.
  • remove_script: Remove executable tags (<script>, <template>).
  • remove_metadata_tags: Remove metadata (<link>, <meta>, <base>, <noscript>, <style>, <title>).
  • remove_flow_tags: Remove flow content (<address>, <div>, <input>, …).
  • remove_sectioning_tags: Remove sectioning content (<article>, <aside>, <nav>, …).
  • remove_heading_tags: Remove heading tags (<h1><h6>).
  • remove_phrasing_tags: Remove phrasing content (<audio>, <code>, <textarea>, …).
  • remove_embedded_tags: Remove embedded content (<iframe>, <embed>, <img>).
  • remove_interactive_tags: Remove interactive content (<button>, <input>, <select>).
  • remove_palpable: Remove palpable elements (<address>, <math>, <table>, …).
  • remove_doctype: Remove <!DOCTYPE html>.
  • remove_comments: Remove HTML comments.
  • remove_specific_attributes: Remove specific attributes (supports wildcards).
  • remove_specific_tags: Remove specific tags (supports wildcards).
  • remove_empty_tags: Drop empty tags.
  • remove_content_tags: Remove content of given tags.
  • remove_tags_and_contents: Remove tags and their contents.

Examples

A) Text-only (no params)

import wizardhtml as wh
txt = wh.clean_html("<div><p>Hello</p><script>x()</script></div>")
print(txt)  # -> "Hello"

B) Structural clean (HTML out)

import wizardhtml as wh

html = """
<html><head><title>x</title><script>evil()</script></head>
<body>
  <article><h1>Title</h1><img src="a.png"><p id="k" onclick="x()">hello</p></article><!-- comment -->
</body></html>
"""
out = wh.clean_html(
    html,
    remove_script=True,
    remove_metadata_tags=True,
    remove_embedded_tags=True,
    remove_specific_attributes=["id", "on*"],
    remove_empty_tags=True,
    remove_comments=True,
    remove_doctype=True,
)
print(out)

Output

<html>
<body>
  <article><h1>Title</h1><p>hello</p></article>

</body></html>

C) Text with preservation (False flags)

import wizardhtml as wh

html = "<html><body><article><h1>T</h1><p>Body</p><!-- c --></article></body></html>"
txt = wh.clean_html(
    html,
    remove_sectioning_tags=False,   # keep <article> in output
    remove_heading_tags=False,      # keep <h1> in output
    remove_comments=False,          # keep comments
)
print(txt)

Output

<article><h1>T</h1>Body<!-- c --></article>

Wildcard selectors

import wizardhtml as wh
html = '<div id="hero" data-track="x" onclick="h()"><img src="a.png"></div>'
out = wh.clean_html(
    html,
    remove_specific_attributes=["id", "data-*", "on*"],
    remove_specific_tags=["im_"],
)
print(out) 

Output

<html><head></head><body><div></div></body></html>

to_text

Extract readable text using Mode A internally, then normalize whitespace and separators. Parameters

Name Type Default Meaning
html str required Source HTML.
separator str "\n" Line separator used in final string.
strip bool True Trim leading/trailing whitespace.
collapse_ws bool True Collapse runs of spaces and blank lines.
import wizardhtml as wh
txt = wh.to_text("<div> A <b> B </b>\n\n <i>C</i></div>", separator=" ")
print(txt)  # "A B C"

Beautiful HTML

Pretty-print HTML without changing semantics. Controls indentation, quoting, attribute ordering, whitespace, DOCTYPE.

Parameters

Name Type Default Meaning
html str required Raw HTML input.
indent int 2 Spaces per level.
quote_attr_values "always" | "spec" | "legacy" "spec" Attribute quoting policy.
quote_char "\"" or "'" " Preferred quote char.
use_best_quote_char bool True Auto-pick quote char to minimize escapes.
minimize_boolean_attributes bool False Render compact booleans (disabled).
use_trailing_solidus bool False Add / on void elements.
space_before_trailing_solidus bool True Space before / if used.
escape_lt_in_attrs bool False Escape < > in attributes.
escape_rcdata bool False Escape inside RCData.
resolve_entities bool True Prefer named entities.
alphabetical_attributes bool True Sort attributes alphabetically.
strip_whitespace bool False Trim/collapse text-node whitespace.
include_doctype bool True Insert <!DOCTYPE html> if missing.
expand_mixed_content bool True Put mixed-content children on own lines.
expand_empty_elements bool True Render empty non-void on two lines.
import wizardhtml as wh

html = """
<body>
  <button id='btn1' class="primary" disabled="disabled">
    Click   <b>me</b>
  </button>
  <img alt="Logo" src="/static/logo.png">
</body>
"""
pretty = wh.beautiful_html(
    html=html,
    indent=4,
    alphabetical_attributes=True,
    minimize_boolean_attributes=True,
    quote_attr_values="always",
    strip_whitespace=True,
    include_doctype=True,
    expand_mixed_content=True,
    expand_empty_elements=True,
)
print(pretty)

Serialization

Parameters

Name Type Default Meaning
node Node required Document, DocumentFragment, Element, Text, Comment.
quote_attr_values "spec" | "legacy" | "always" "spec" Quoting policy.
quote_char "\"" or "'" " Preferred quote char.
use_best_quote_char bool True Minimize escapes.
minimize_boolean_attributes bool False Compact booleans.
resolve_entities bool True Prefer named entities.
alphabetical_attributes bool False Sort attributes.
strip_whitespace bool False Trim/collapse text-node whitespace.
include_doctype bool True Applies only when node is Document.

Example

import wizardhtml as wh

doc = wh.parse("<!doctype html><html><body><p id='x'>Hi</p></body></html>")
print(wh.serialize(doc, alphabetical_attributes=True))        # with DOCTYPE
p = wh.parse("<p id='x'>Hi</p>", fragment_context="div")
print(wh.serialize(p, include_doctype=False))                 # no DOCTYPE for fragments

HTML → Markdown

Best-effort conversion of common HTML structures to Markdown; falls back to original HTML if conversion is unsafe.

Parameters

Name Type Default Meaning
html str required Raw HTML input.

Example

import wizardhtml as wh

md = wh.html_to_markdown("<h1>Hello</h1><p>World</p>")
print(md)

Output

# Hello

World

License

AGPL-3.0-or-later.

RESOURCES


Contact & Author

Author: Mattia Rubino
Email: texwhizard.dev@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wizardhtml-1.0.0.tar.gz (136.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wizardhtml-1.0.0-cp311-cp311-win_amd64.whl (177.4 kB view details)

Uploaded CPython 3.11Windows x86-64

File details

Details for the file wizardhtml-1.0.0.tar.gz.

File metadata

  • Download URL: wizardhtml-1.0.0.tar.gz
  • Upload date:
  • Size: 136.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for wizardhtml-1.0.0.tar.gz
Algorithm Hash digest
SHA256 777821c46d4741e598956d445b7ac493f11144889131c070e2650057a0d78625
MD5 c27dccc146a16013fac0114353370471
BLAKE2b-256 bdb3eac1c8f06a4075d13f7b3ba40c1b6990df563e4bb9ddc0f48fd906ec7ef7

See more details on using hashes here.

File details

Details for the file wizardhtml-1.0.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: wizardhtml-1.0.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 177.4 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for wizardhtml-1.0.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 be3f4443c170dff4116bf0ad012c7ffd60fad41791a959ebef938e1af77e3979
MD5 5ceed0bf8b7449878ffeb9ab8210bcc1
BLAKE2b-256 9e65b3f25f2c7f38c4a5db7de05ee5cab76144610e52b46c487bf5304e9c0600

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page