Skip to main content

WHATWG-compliant HTML5 toolkit: DFA tokenizer, spec-guided tree builder, DOM, configurable serializer, high-level cleaner, pretty-printer, and HTML to Markdown.

Project description

WizardHTML Banner


WizardHTML

PyPI - Version PyPI - Downloads/month License

WizardHTML is a Python library for WHATWG-compliant HTML5 toolkit: DFA tokenizer, spec-guided tree builder, DOM, configurable serializer, and helpers for cleaning, pretty-printing, and HTML→Markdown.


Contents


Installation

Requires Python 3.9+.

pip install wizardhtml

Quick start

import wizardhtml as wh

# Mode A: text-only extraction
print(wh.clean_html("<div><p>Hello</p><script>x()</script></div>"))  # -> "Hello"

# Pretty print
html = "<body><p>Hi <b>there</b></p><img src=x></body>"
print(wh.beautiful_html(html, indent=2))

# HTML → Markdown
print(wh.html_to_markdown("<h1>T</h1><p>Body</p>"))

# Parser and DOM
doc = wh.parse("<!doctype html><html><body><p>Hi</p></body></html>")

Public API

Function Purpose
parse(html, fragment_context=None, return_errors=False) Parse into Document or DocumentFragment; optional parse error list
clean_html(text, **flags) HTML cleaning with modes A/B/C
beautiful_html(html, **opts) Non-destructive pretty-printer
html_to_markdown(html) HTML → Markdown (best-effort)
serialize(node, **opts) Serialize DOM → HTML
to_text(html, separator="\n", strip=True, collapse_ws=True) Extract readable text (Mode A + whitespace normalization)

Parse

Parse HTML as full document or fragment. Collect spec-like parse errors when requested.

Parameters

Name Type Default Meaning
html str required Input HTML.
fragment_context str | None None Context element name for fragment parsing (e.g., "div", "template", "tbody", "svg", "math").
return_errors bool False If True, return (node, errors:list[str]).
  • Full document when fragment_context is None → returns Document.
  • Fragment parsing with context name (e.g. "div", "template", "tbody", "svg", "math") → returns DocumentFragment.
  • return_errors=True returns (node, list[str]).
import wizardhtml as wh

doc = wh.parse("<!doctype html><html><body><p>Hi</p></body></html>")
frag = wh.parse("<li>item</li>", fragment_context="ul")
node, errors = wh.parse("<p><b>x</p>", return_errors=True)

HTML cleaning

Clean HTML with granular flags. Three modes: A) all None → text-only, B) any True → HTML sanitized, C) all provided and False → text with selected markup preserved.

Behavior

There are three modes with different return types:

Mode How to trigger Output Description
A – text-only No parameters provided (all None) str (plain text) Extracts text, skips script-supporting tags, inserts safe spaces.
B – structural clean At least one flag is True str (serialized HTML) Removes/unwraps per flags. Supports wildcard tag/attribute removal, content stripping, empty-tag pruning.
C – text with preservation Parameters present and all False str (text + preserved markup) Extracts text but preserves groups explicitly set to False (and comments/doctype if set False).

Parameters

  • text: str HTML input.
  • remove_script: Remove executable tags (<script>, <template>).
  • remove_metadata_tags: Remove metadata (<link>, <meta>, <base>, <noscript>, <style>, <title>).
  • remove_flow_tags: Remove flow content (<address>, <div>, <input>, …).
  • remove_sectioning_tags: Remove sectioning content (<article>, <aside>, <nav>, …).
  • remove_heading_tags: Remove heading tags (<h1><h6>).
  • remove_phrasing_tags: Remove phrasing content (<audio>, <code>, <textarea>, …).
  • remove_embedded_tags: Remove embedded content (<iframe>, <embed>, <img>).
  • remove_interactive_tags: Remove interactive content (<button>, <input>, <select>).
  • remove_palpable: Remove palpable elements (<address>, <math>, <table>, …).
  • remove_doctype: Remove <!DOCTYPE html>.
  • remove_comments: Remove HTML comments.
  • remove_specific_attributes: Remove specific attributes (supports wildcards).
  • remove_specific_tags: Remove specific tags (supports wildcards).
  • remove_empty_tags: Drop empty tags.
  • remove_content_tags: Remove content of given tags.
  • remove_tags_and_contents: Remove tags and their contents.

Examples

A) Text-only (no params)

import wizardhtml as wh
txt = wh.clean_html("<div><p>Hello</p><script>x()</script></div>")
print(txt)  # -> "Hello"

B) Structural clean (HTML out)

import wizardhtml as wh

html = """
<html><head><title>x</title><script>evil()</script></head>
<body>
  <article><h1>Title</h1><img src="a.png"><p id="k" onclick="x()">hello</p></article><!-- comment -->
</body></html>
"""
out = wh.clean_html(
    html,
    remove_script=True,
    remove_metadata_tags=True,
    remove_embedded_tags=True,
    remove_specific_attributes=["id", "on*"],
    remove_empty_tags=True,
    remove_comments=True,
    remove_doctype=True,
)
print(out)

Output

<html>
<body>
  <article><h1>Title</h1><p>hello</p></article>

</body></html>

C) Text with preservation (False flags)

import wizardhtml as wh

html = "<html><body><article><h1>T</h1><p>Body</p><!-- c --></article></body></html>"
txt = wh.clean_html(
    html,
    remove_sectioning_tags=False,   # keep <article> in output
    remove_heading_tags=False,      # keep <h1> in output
    remove_comments=False,          # keep comments
)
print(txt)

Output

<article><h1>T</h1>Body<!-- c --></article>

Wildcard selectors

import wizardhtml as wh
html = '<div id="hero" data-track="x" onclick="h()"><img src="a.png"></div>'
out = wh.clean_html(
    html,
    remove_specific_attributes=["id", "data-*", "on*"],
    remove_specific_tags=["im_"],
)
print(out) 

Output

<html><head></head><body><div></div></body></html>

to_text

Extract readable text using Mode A internally, then normalize whitespace and separators. Parameters

Name Type Default Meaning
html str required Source HTML.
separator str "\n" Line separator used in final string.
strip bool True Trim leading/trailing whitespace.
collapse_ws bool True Collapse runs of spaces and blank lines.
import wizardhtml as wh
txt = wh.to_text("<div> A <b> B </b>\n\n <i>C</i></div>", separator=" ")
print(txt)  # "A B C"

Beautiful HTML

Pretty-print HTML without changing semantics. Controls indentation, quoting, attribute ordering, whitespace, DOCTYPE.

Parameters

Name Type Default Meaning
html str required Raw HTML input.
indent int 2 Spaces per level.
quote_attr_values "always" | "spec" | "legacy" "spec" Attribute quoting policy.
quote_char "\"" or "'" " Preferred quote char.
use_best_quote_char bool True Auto-pick quote char to minimize escapes.
minimize_boolean_attributes bool False Render compact booleans (disabled).
use_trailing_solidus bool False Add / on void elements.
space_before_trailing_solidus bool True Space before / if used.
escape_lt_in_attrs bool False Escape < > in attributes.
escape_rcdata bool False Escape inside RCData.
resolve_entities bool True Prefer named entities.
alphabetical_attributes bool True Sort attributes alphabetically.
strip_whitespace bool False Trim/collapse text-node whitespace.
include_doctype bool True Insert <!DOCTYPE html> if missing.
expand_mixed_content bool True Put mixed-content children on own lines.
expand_empty_elements bool True Render empty non-void on two lines.
import wizardhtml as wh

html = """
<body>
  <button id='btn1' class="primary" disabled="disabled">
    Click   <b>me</b>
  </button>
  <img alt="Logo" src="/static/logo.png">
</body>
"""
pretty = wh.beautiful_html(
    html=html,
    indent=4,
    alphabetical_attributes=True,
    minimize_boolean_attributes=True,
    quote_attr_values="always",
    strip_whitespace=True,
    include_doctype=True,
    expand_mixed_content=True,
    expand_empty_elements=True,
)
print(pretty)

Serialization

Parameters

Name Type Default Meaning
node Node required Document, DocumentFragment, Element, Text, Comment.
quote_attr_values "spec" | "legacy" | "always" "spec" Quoting policy.
quote_char "\"" or "'" " Preferred quote char.
use_best_quote_char bool True Minimize escapes.
minimize_boolean_attributes bool False Compact booleans.
resolve_entities bool True Prefer named entities.
alphabetical_attributes bool False Sort attributes.
strip_whitespace bool False Trim/collapse text-node whitespace.
include_doctype bool True Applies only when node is Document.

Example

import wizardhtml as wh

doc = wh.parse("<!doctype html><html><body><p id='x'>Hi</p></body></html>")
print(wh.serialize(doc, alphabetical_attributes=True))        # with DOCTYPE
p = wh.parse("<p id='x'>Hi</p>", fragment_context="div")
print(wh.serialize(p, include_doctype=False))                 # no DOCTYPE for fragments

HTML → Markdown

Best-effort conversion of common HTML structures to Markdown; falls back to original HTML if conversion is unsafe.

Parameters

Name Type Default Meaning
html str required Raw HTML input.

Example

import wizardhtml as wh

md = wh.html_to_markdown("<h1>Hello</h1><p>World</p>")
print(md)

Output

# Hello

World

License

AGPL-3.0-or-later.

RESOURCES


Contact & Author

Author: Mattia Rubino
Email: texwhizard.dev@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wizardhtml-1.0.1.tar.gz (136.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wizardhtml-1.0.1-cp311-cp311-win_amd64.whl (177.4 kB view details)

Uploaded CPython 3.11Windows x86-64

File details

Details for the file wizardhtml-1.0.1.tar.gz.

File metadata

  • Download URL: wizardhtml-1.0.1.tar.gz
  • Upload date:
  • Size: 136.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for wizardhtml-1.0.1.tar.gz
Algorithm Hash digest
SHA256 77977ec0cff7dc211693e902d12b68f4e238909dad95bc46064a4941e6c64165
MD5 6f80e3748d703d032ce728dcd618a4fe
BLAKE2b-256 6bcdf9ce645d54f3aff588767fc50eb124153b318ba6c14f6e950c982ef3c3dd

See more details on using hashes here.

File details

Details for the file wizardhtml-1.0.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: wizardhtml-1.0.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 177.4 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for wizardhtml-1.0.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 666da90306d329abce182999b9b156ff1e3749990877062132b3fc4076ffbd09
MD5 5418ed079ec1dd0a4da8299581d630e2
BLAKE2b-256 c319bb4b1ba93cd7344101ee5d8a033ebb5da818cf4409670e9230b15d5c3e29

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page