package to reduce DOM data (html or JS) without losing information so it can fit into LLMs.
Project description
domreducer
domreducer is a Python library for programmatically stripping down and sanitizing HTML documents into a minimal, text-focused form. It’s perfect for preparing web pages for LLM ingestion, text analysis, or any context where you only need the essential structural content.
Features
- Parse a full HTML document into a DOM tree.
- Strip out non-structural nodes (scripts, style blocks, comments).
- Strip out non-visual nodes (invisible elements, metadata).
- Simplify attributes (remove inline styles, classes, IDs where possible).
- Collapse deeply nested single-child containers.
- Prune repetitive or boilerplate navigation/footer elements.
- Reduce large inline SVGs or images to lightweight placeholders.
- Preserve tables, definition lists, lists, figures, and CSS-styled tables as Markdown.
- Detect single-page app “JS shells” and abort reduction so you can fall back to a JS-enabled fetch.
- Minify remaining whitespace.
All steps are tracked in a ReduceOperation result, which includes before/after sizes, token counts, and any reasons for aborting.
Installation
pip install domreducer
Quickstart
from domreducer import HtmlReducer
raw_html = "<html>…your full page…</html>"
# Run the full reduction pipeline (aborts if a JS shell is detected)
op = HtmlReducer(raw_html).reduce()
if not op.success:
print("Reduction aborted:", op.error or op.js_method_needed)
else:
print("Original size:", op.total_char, "chars")
print("Reduced size:", op.reduced_char, "chars")
print("Steps details:", op.reduction_details)
clean_html = op.reduced_data
# …use clean_html…
Custom Pipeline
Choose only the steps you want or disable JS-shell abort:
op = HtmlReducer(raw_html).reduce(
order=[
"parse_the_full_dom_into_a_dom_tree",
"strip_out_non_structural_nodes",
"simplify_attributes",
"minify_whitespace",
],
abort_on_js_shell=False,
)
API Reference
HtmlReducer(html: str)
Constructor takes your raw HTML string.
.reduce(order: List[str] = None, abort_on_js_shell: bool = True) → ReduceOperation
- order: list of step names (in the order to apply). Defaults to the full pipeline.
- abort_on_js_shell: if
True, calls.is_probably_js_shell()after parsing and returns an abortedReduceOperation.
Available steps (in pipeline order):
parse_the_full_dom_into_a_dom_treestrip_out_non_structural_nodesstrip_out_non_visual_nodessimplify_attributescollapse_deeply_nested_container_with_one_childprune_repetitive_and_boilerplate_navigation_itemsreduce_large_inline_SVGs_or_images_to_lightweight_placeholderspreserve_tables_as_markdownpreserve_deflists_as_markdownpreserve_lists_as_markdownpreserve_figures_as_markdownpreserve_css_tables_as_markdownstrip_tailwind_utility_classesdrop_row_ids_inside_large_tablesminify_whitespace
ReduceOperation
The object returned by .reduce(), with attributes:
success: bool—Trueif reduction ran through;Falseif aborted (e.g. JS shell) or error.error: Optional[str]— any error message.js_method_needed: bool—Trueif aborted due to JS-shell detection.total_char: int— character length before reduction.total_token: int— approximate token count before reduction.reduced_char: int— character length after reduction.reduced_token: int— approximate token count after reduction.raw_data: str— original HTML.reduced_data: str— the cleaned, reduced HTML.reduction_details: dict— per-step Δchars/Δtokens, and any flags like"aborted": "js_shell_detected".
Contributing
- Fork the repo
- Create your feature branch (
git checkout -b my-feature) - Commit your changes (
git commit -am 'Add feature') - Push to the branch (
git push origin my-feature) - Open a Pull Request
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file domreducer-0.0.4.tar.gz.
File metadata
- Download URL: domreducer-0.0.4.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a566c7dd47c0d02fa6fd17e1eb589df2ac6747cf25c4160b554c18d478e9baf
|
|
| MD5 |
0a7e70111363911d04253641a7a62bff
|
|
| BLAKE2b-256 |
471bc352ff1da6ee278d15a040b1adfd3e68f8e280edf3cdd12d8ba5e88fc357
|
File details
Details for the file domreducer-0.0.4-py3-none-any.whl.
File metadata
- Download URL: domreducer-0.0.4-py3-none-any.whl
- Upload date:
- Size: 10.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1ebc65f0e568766e7012a43f90b7aa0c75a49ec05d424bd9a45e42ff153cb17
|
|
| MD5 |
68e995605c4b8b37f9b74c81f471f353
|
|
| BLAKE2b-256 |
c9cc5c3d88df9711b4aeaf54919c5390a5650ff64eff9d7241d708cc3b76a574
|