package to reduce DOM data (html or JS) without losing information so it can fit into LLMs.

These details have not been verified by PyPI

Project description

domreducer

domreducer is a Python library for programmatically stripping down and sanitizing HTML documents into a minimal, text-focused form. It’s perfect for preparing web pages for LLM ingestion, text analysis, or any context where you only need the essential structural content.

Features

Parse a full HTML document into a DOM tree.
Strip out non-structural nodes (scripts, style blocks, comments).
Strip out non-visual nodes (invisible elements, metadata).
Simplify attributes (remove inline styles, classes, IDs where possible).
Collapse deeply nested single-child containers.
Prune repetitive or boilerplate navigation/footer elements.
Reduce large inline SVGs or images to lightweight placeholders.
Preserve tables, definition lists, lists, figures, and CSS-styled tables as Markdown.
Detect single-page app “JS shells” and abort reduction so you can fall back to a JS-enabled fetch.
Minify remaining whitespace.

All steps are tracked in a ReduceOperation result, which includes before/after sizes, token counts, and any reasons for aborting.

Installation

pip install domreducer

Quickstart

from domreducer import HtmlReducer

raw_html = "<html>…your full page…</html>"

# Run the full reduction pipeline (aborts if a JS shell is detected)
op = HtmlReducer(raw_html).reduce()

if not op.success:
    print("Reduction aborted:", op.error or op.js_method_needed)
else:
    print("Original size:", op.total_char, "chars")
    print("Reduced size:", op.reduced_char, "chars")
    print("Steps details:", op.reduction_details)
    clean_html = op.reduced_data
    # …use clean_html…

Custom Pipeline

Choose only the steps you want or disable JS-shell abort:

op = HtmlReducer(raw_html).reduce(
    order=[
        "parse_the_full_dom_into_a_dom_tree",
        "strip_out_non_structural_nodes",
        "simplify_attributes",
        "minify_whitespace",
    ],
    abort_on_js_shell=False,
)

API Reference

`HtmlReducer(html: str)`

Constructor takes your raw HTML string.

`.reduce(order: List[str] = None, abort_on_js_shell: bool = True) → ReduceOperation`

order: list of step names (in the order to apply). Defaults to the full pipeline.
abort_on_js_shell: if True, calls .is_probably_js_shell() after parsing and returns an aborted ReduceOperation.

Available steps (in pipeline order):

parse_the_full_dom_into_a_dom_tree
strip_out_non_structural_nodes
strip_out_non_visual_nodes
simplify_attributes
collapse_deeply_nested_container_with_one_child
prune_repetitive_and_boilerplate_navigation_items
reduce_large_inline_SVGs_or_images_to_lightweight_placeholders
preserve_tables_as_markdown
preserve_deflists_as_markdown
preserve_lists_as_markdown
preserve_figures_as_markdown
preserve_css_tables_as_markdown
strip_tailwind_utility_classes
drop_row_ids_inside_large_tables
minify_whitespace

`ReduceOperation`

The object returned by .reduce(), with attributes:

success: bool — True if reduction ran through; False if aborted (e.g. JS shell) or error.
error: Optional[str] — any error message.
js_method_needed: bool — True if aborted due to JS-shell detection.
total_char: int — character length before reduction.
total_token: int — approximate token count before reduction.
reduced_char: int — character length after reduction.
reduced_token: int — approximate token count after reduction.
raw_data: str — original HTML.
reduced_data: str — the cleaned, reduced HTML.
reduction_details: dict — per-step Δchars/Δtokens, and any flags like "aborted": "js_shell_detected".

Contributing

Fork the repo
Create your feature branch (git checkout -b my-feature)
Commit your changes (git commit -am 'Add feature')
Push to the branch (git push origin my-feature)
Open a Pull Request

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.0.4

Jun 30, 2025

This version

0.0.3

Jun 11, 2025

0.0.1

Jun 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

domreducer-0.0.3.tar.gz (10.3 kB view details)

Uploaded Jun 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

domreducer-0.0.3-py3-none-any.whl (9.8 kB view details)

Uploaded Jun 11, 2025 Python 3

File details

Details for the file domreducer-0.0.3.tar.gz.

File metadata

Download URL: domreducer-0.0.3.tar.gz
Upload date: Jun 11, 2025
Size: 10.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for domreducer-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`477bf108a3cb5ed0a7d9d76d262bf202a5a1606c1ab15214c784cf44cab043bc`
MD5	`48a23ff8f9dce162b7c2f7396a7ef9ba`
BLAKE2b-256	`22586801fc84cfd7a1d299b168f2cbf45508ad337f7d92a566e1136c2b1cd4b3`

See more details on using hashes here.

File details

Details for the file domreducer-0.0.3-py3-none-any.whl.

File metadata

Download URL: domreducer-0.0.3-py3-none-any.whl
Upload date: Jun 11, 2025
Size: 9.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for domreducer-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7ec65e64c8b01ef9377676d0fdcae4f9e44e9807f4e3cdb5b18887ee0754d922`
MD5	`d69498fe238e0510a2eb4fff9771de81`
BLAKE2b-256	`ab623cbb9f4094a8afca24b05376bbed579a7f74d0356dadf51d6a27f1611a9e`

See more details on using hashes here.

domreducer 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

domreducer

Features

Installation

Quickstart

Custom Pipeline

API Reference

`HtmlReducer(html: str)`

`.reduce(order: List[str] = None, abort_on_js_shell: bool = True) → ReduceOperation`

`ReduceOperation`

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes