WHATWG-compliant HTML5 toolkit: DFA tokenizer, spec-guided tree builder, DOM, configurable serializer, high-level cleaner, pretty-printer, and HTML to Markdown.

These details have not been verified by PyPI

Project links

Project description

WizardHTML Banner

WizardHTML

WizardHTML is a Python library for WHATWG-compliant HTML5 toolkit: DFA tokenizer, spec-guided tree builder, DOM, configurable serializer, and helpers for cleaning, pretty-printing, and HTML→Markdown.

Installation
Quick start
Public API
Parsing
HTML cleaning
Text helper
Beautiful HTML
Serialization
HTML → Markdown
License
Resources
Contact & Author

Installation

Requires Python 3.9+.

pip install wizardhtml

Quick start

import wizardhtml as wh

# Mode A: text-only extraction
print(wh.clean_html("<div><p>Hello</p><script>x()</script></div>"))  # -> "Hello"

# Pretty print
html = "<body><p>Hi <b>there</b></p><img src=x></body>"
print(wh.beautiful_html(html, indent=2))

# HTML → Markdown
print(wh.html_to_markdown("<h1>T</h1><p>Body</p>"))

# Parser and DOM
doc = wh.parse("<!doctype html><html><body><p>Hi</p></body></html>")

Public API

Function	Purpose
`parse(html, fragment_context=None, return_errors=False)`	Parse into `Document` or `DocumentFragment`; optional parse error list
`clean_html(text, **flags)`	HTML cleaning with modes A/B/C
`beautiful_html(html, **opts)`	Non-destructive pretty-printer
`html_to_markdown(html)`	HTML → Markdown (best-effort)
`serialize(node, **opts)`	Serialize DOM → HTML
`to_text(html, separator="\n", strip=True, collapse_ws=True)`	Extract readable text (Mode A + whitespace normalization)

`Parse`

Parse HTML as full document or fragment. Collect spec-like parse errors when requested.

Parameters

Name	Type	Default	Meaning
`html`	`str`	required	Input HTML.
`fragment_context`	`str \| None`	`None`	Context element name for fragment parsing (e.g., `"div"`, `"template"`, `"tbody"`, `"svg"`, `"math"`).
`return_errors`	`bool`	`False`	If `True`, return `(node, errors:list[str])`.

Full document when fragment_context is None → returns Document.
Fragment parsing with context name (e.g. "div", "template", "tbody", "svg", "math") → returns DocumentFragment.
return_errors=True returns (node, list[str]).

import wizardhtml as wh

doc = wh.parse("<!doctype html><html><body><p>Hi</p></body></html>")
frag = wh.parse("<li>item</li>", fragment_context="ul")
node, errors = wh.parse("<p><b>x</p>", return_errors=True)

HTML cleaning

Clean HTML with granular flags. Three modes: A) all None → text-only, B) any True → HTML sanitized, C) all provided and False → text with selected markup preserved.

Behavior

There are three modes with different return types:

Mode	How to trigger	Output	Description
A – text-only	No parameters provided (all `None`)	`str` (plain text)	Extracts text, skips script-supporting tags, inserts safe spaces.
B – structural clean	At least one flag is `True`	`str` (serialized HTML)	Removes/unwraps per flags. Supports wildcard tag/attribute removal, content stripping, empty-tag pruning.
C – text with preservation	Parameters present and all `False`	`str` (text + preserved markup)	Extracts text but preserves groups explicitly set to `False` (and comments/doctype if set `False`).

Parameters

text: str HTML input.
remove_script: Remove executable tags (<script>, <template>).
remove_metadata_tags: Remove metadata (<link>, <meta>, <base>, <noscript>, <style>, <title>).
remove_flow_tags: Remove flow content (<address>, <div>, <input>, …).
remove_sectioning_tags: Remove sectioning content (<article>, <aside>, <nav>, …).
remove_heading_tags: Remove heading tags (<h1>–<h6>).
remove_phrasing_tags: Remove phrasing content (<audio>, <code>, <textarea>, …).
remove_embedded_tags: Remove embedded content (<iframe>, <embed>, <img>).
remove_interactive_tags: Remove interactive content (<button>, <input>, <select>).
remove_palpable: Remove palpable elements (<address>, <math>, <table>, …).
remove_doctype: Remove <!DOCTYPE html>.
remove_comments: Remove HTML comments.
remove_specific_attributes: Remove specific attributes (supports wildcards).
remove_specific_tags: Remove specific tags (supports wildcards).
remove_empty_tags: Drop empty tags.
remove_content_tags: Remove content of given tags.
remove_tags_and_contents: Remove tags and their contents.

Examples

A) Text-only (no params)

import wizardhtml as wh
txt = wh.clean_html("<div><p>Hello</p><script>x()</script></div>")
print(txt)  # -> "Hello"

B) Structural clean (HTML out)

import wizardhtml as wh

html = """
<html><head><title>x</title><script>evil()</script></head>
<body>
  <article><h1>Title</h1><img src="a.png"><p id="k" onclick="x()">hello</p></article><!-- comment -->
</body></html>
"""
out = wh.clean_html(
    html,
    remove_script=True,
    remove_metadata_tags=True,
    remove_embedded_tags=True,
    remove_specific_attributes=["id", "on*"],
    remove_empty_tags=True,
    remove_comments=True,
    remove_doctype=True,
)
print(out)

Output

<html>
<body>
  <article><h1>Title</h1><p>hello</p></article>

</body></html>

C) Text with preservation (False flags)

import wizardhtml as wh

html = "<html><body><article><h1>T</h1><p>Body</p><!-- c --></article></body></html>"
txt = wh.clean_html(
    html,
    remove_sectioning_tags=False,   # keep <article> in output
    remove_heading_tags=False,      # keep <h1> in output
    remove_comments=False,          # keep comments
)
print(txt)

Output

<article><h1>T</h1>Body<!-- c --></article>

Wildcard selectors

import wizardhtml as wh
html = '<div id="hero" data-track="x" onclick="h()"><img src="a.png"></div>'
out = wh.clean_html(
    html,
    remove_specific_attributes=["id", "data-*", "on*"],
    remove_specific_tags=["im_"],
)
print(out)

Output

<html><head></head><body><div></div></body></html>

to_text

Extract readable text using Mode A internally, then normalize whitespace and separators. Parameters

Name	Type	Default	Meaning
`html`	`str`	required	Source HTML.
`separator`	`str`	`"\n"`	Line separator used in final string.
`strip`	`bool`	`True`	Trim leading/trailing whitespace.
`collapse_ws`	`bool`	`True`	Collapse runs of spaces and blank lines.

import wizardhtml as wh
txt = wh.to_text("<div> A <b> B </b>\n\n <i>C</i></div>", separator=" ")
print(txt)  # "A B C"

Beautiful HTML

Pretty-print HTML without changing semantics. Controls indentation, quoting, attribute ordering, whitespace, DOCTYPE.

Parameters

Name	Type	Default	Meaning
`html`	`str`	required	Raw HTML input.
`indent`	`int`	`2`	Spaces per level.
`quote_attr_values`	`"always" \| "spec" \| "legacy"`	`"spec"`	Attribute quoting policy.
`quote_char`	`"\""` or `"'"`	`"`	Preferred quote char.
`use_best_quote_char`	`bool`	`True`	Auto-pick quote char to minimize escapes.
`minimize_boolean_attributes`	`bool`	`False`	Render compact booleans (`disabled`).
`use_trailing_solidus`	`bool`	`False`	Add `/` on void elements.
`space_before_trailing_solidus`	`bool`	`True`	Space before `/` if used.
`escape_lt_in_attrs`	`bool`	`False`	Escape `<` `>` in attributes.
`escape_rcdata`	`bool`	`False`	Escape inside RCData.
`resolve_entities`	`bool`	`True`	Prefer named entities.
`alphabetical_attributes`	`bool`	`True`	Sort attributes alphabetically.
`strip_whitespace`	`bool`	`False`	Trim/collapse text-node whitespace.
`include_doctype`	`bool`	`True`	Insert `<!DOCTYPE html>` if missing.
`expand_mixed_content`	`bool`	`True`	Put mixed-content children on own lines.
`expand_empty_elements`	`bool`	`True`	Render empty non-void on two lines.

import wizardhtml as wh

html = """
<body>
  <button id='btn1' class="primary" disabled="disabled">
    Click   <b>me</b>
  </button>
  <img alt="Logo" src="/static/logo.png">
</body>
"""
pretty = wh.beautiful_html(
    html=html,
    indent=4,
    alphabetical_attributes=True,
    minimize_boolean_attributes=True,
    quote_attr_values="always",
    strip_whitespace=True,
    include_doctype=True,
    expand_mixed_content=True,
    expand_empty_elements=True,
)
print(pretty)

Serialization

Parameters

Name	Type	Default	Meaning
`node`	`Node`	required	`Document`, `DocumentFragment`, `Element`, `Text`, `Comment`.
`quote_attr_values`	`"spec" \| "legacy" \| "always"`	`"spec"`	Quoting policy.
`quote_char`	`"\""` or `"'"`	`"`	Preferred quote char.
`use_best_quote_char`	`bool`	`True`	Minimize escapes.
`minimize_boolean_attributes`	`bool`	`False`	Compact booleans.
`resolve_entities`	`bool`	`True`	Prefer named entities.
`alphabetical_attributes`	`bool`	`False`	Sort attributes.
`strip_whitespace`	`bool`	`False`	Trim/collapse text-node whitespace.
`include_doctype`	`bool`	`True`	Applies only when `node` is `Document`.

Example

import wizardhtml as wh

doc = wh.parse("<!doctype html><html><body><p id='x'>Hi</p></body></html>")
print(wh.serialize(doc, alphabetical_attributes=True))        # with DOCTYPE
p = wh.parse("<p id='x'>Hi</p>", fragment_context="div")
print(wh.serialize(p, include_doctype=False))                 # no DOCTYPE for fragments

HTML → Markdown

Best-effort conversion of common HTML structures to Markdown; falls back to original HTML if conversion is unsafe.

Parameters

Name	Type	Default	Meaning
`html`	`str`	required	Raw HTML input.

Example

import wizardhtml as wh

md = wh.html_to_markdown("<h1>Hello</h1><p>World</p>")
print(md)

Output

# Hello

World

License

AGPL-3.0-or-later.

RESOURCES

Contact & Author

Author: Mattia Rubino
Email: texwhizard.dev@gmail.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

Aug 29, 2025

This version

1.0.0

Aug 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wizardhtml-1.0.0.tar.gz (136.5 kB view details)

Uploaded Aug 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wizardhtml-1.0.0-cp311-cp311-win_amd64.whl (177.4 kB view details)

Uploaded Aug 29, 2025 CPython 3.11Windows x86-64

File details

Details for the file wizardhtml-1.0.0.tar.gz.

File metadata

Download URL: wizardhtml-1.0.0.tar.gz
Upload date: Aug 29, 2025
Size: 136.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for wizardhtml-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`777821c46d4741e598956d445b7ac493f11144889131c070e2650057a0d78625`
MD5	`c27dccc146a16013fac0114353370471`
BLAKE2b-256	`bdb3eac1c8f06a4075d13f7b3ba40c1b6990df563e4bb9ddc0f48fd906ec7ef7`

See more details on using hashes here.

File details

Details for the file wizardhtml-1.0.0-cp311-cp311-win_amd64.whl.

File metadata

Download URL: wizardhtml-1.0.0-cp311-cp311-win_amd64.whl
Upload date: Aug 29, 2025
Size: 177.4 kB
Tags: CPython 3.11, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for wizardhtml-1.0.0-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`be3f4443c170dff4116bf0ad012c7ffd60fad41791a959ebef938e1af77e3979`
MD5	`5ceed0bf8b7449878ffeb9ab8210bcc1`
BLAKE2b-256	`9e65b3f25f2c7f38c4a5db7de05ee5cab76144610e52b46c487bf5304e9c0600`

See more details on using hashes here.

wizardhtml 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WizardHTML

Contents

Installation

Quick start

Public API

Parse

HTML cleaning

Behavior

Parameters

Examples

to_text

Beautiful HTML

Serialization

HTML → Markdown

Example

License

RESOURCES

Contact & Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`Parse`