Simple Python helper library that can significantly reduce LLM input tokens by removing unnecessary page code
Project description
html-shrinker
Simple Python helper library that can significantly reduce LLM input tokens by removing unnecessary page code (configurable).
AI scraping usually involves sending the whole page code to an LLM + instructions + output format. 99.9% of the time the information needed is somewhere in the body tag of the page, thus we can safely remove the whole head tag which contains a ton of styles and scripts and metadata that are not needed. This alone reduces the tokens/costs significantly. Further optimizations can be made, like removing specific html tags, attributes or even the innertext.
What it does
- Removes noisy tags/attributes or keeps only a whitelist
- Strips inner text
- Removes comments
- Flattens repeated single-child
div > divwrappers, even if they are nested many levels deep - Collapses whitespace between tags
Quick install
pip install html-shrinker
Quick start
from html_shrinker import HTMLShrinker
from html_shrinker.defaults import tags
raw_html = """
<html>
<head><script>ignore me</script></head>
<body>
<div><div><p>Hello world</p></div></div>
<script>alert("x")</script>
</body>
</html>
"""
shrinker = HTMLShrinker(
tags=list(tags),
)
result = shrinker.shrink(raw_html)
print(result)
API
from html_shrinker import HTMLShrinker
shrinker = HTMLShrinker(
tag_mode="remove",
tags=["script", "style", "head"],
attribute_mode="remove",
attributes=["class", "id", "style"],
strip_innertext=False,
remove_comments=True,
flatten_single_child_divs=True,
collapse_between_tags=True,
)
output = shrinker.shrink("<html>...</html>")
Default presets are available from:
from html_shrinker.defaults import tags, arguments
Invalid HTML input raises InvalidHTMLInputError:
from html_shrinker import HTMLShrinker, InvalidHTMLInputError
try:
HTMLShrinker().shrink("<div>fragment</div>")
except InvalidHTMLInputError as exc:
print(exc)
Configuration
HTMLShrinker(...) constructor parameters:
tag_mode:"remove"or"keep"(default:"remove")tags:list[str]- If
tag_mode="remove": these tags are removed. - If
tag_mode="keep": only these tags are kept.
- If
attribute_mode:"remove"or"keep"(default:"remove")attributes:list[str]- If
attribute_mode="remove": these attributes are removed. - If
attribute_mode="keep": only these attributes are kept.
- If
strip_innertext:bool(default:False)- If
True: removes text nodes. - Example:
<p>secret</p>becomes<p></p>.
- If
remove_comments:bool(default:True)- If
True: removes HTML comments such as<!-- comment -->.
- If
flatten_single_child_divs:bool(default:True)- If
True: flattens nesteddiv > divwrappers when adivcontains only one childdiv. - This is applied repeatedly, so a deep chain like
<div><div><div><p>...</p></div></div></div>becomes<div><p>...</p></div>. - These large div chains appear very often when shrinking aggresively.
- If
collapse_between_tags:bool(default:True)- If
True: removes whitespace between tags, so> <becomes><.
- If
Notes:
- Default
tagsis empty ([]), so no tags are removed by default. - Default
attributesis empty ([]), so no attributes are removed by default. - If
tag_mode="keep",tagsmust be non-empty. - If
attribute_mode="keep",attributesmust be non-empty.
Usage patterns
1) Remove mode (default)
from html_shrinker import HTMLShrinker
shrinker = HTMLShrinker(
tag_mode="remove",
tags=["script", "style", "head"],
attribute_mode="remove",
attributes=["class", "id", "style"],
)
clean = shrinker.shrink(raw_html)
2) Keep mode
from html_shrinker import HTMLShrinker
shrinker = HTMLShrinker(
tag_mode="keep",
tags=["main", "article", "h1", "h2", "p", "ul", "ol", "li", "a"],
attribute_mode="keep",
attributes=["href"],
)
clean = shrinker.shrink(raw_html)
3) Strip inner text
from html_shrinker import HTMLShrinker
shrinker = HTMLShrinker(strip_innertext=True)
clean = shrinker.shrink("<html><body><p>secret text</p></body></html>")
# <html><body><p></p></body></html>
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file html_shrinker-0.1.0.tar.gz.
File metadata
- Download URL: html_shrinker-0.1.0.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98d3e96700e7c9c5f68082c5b0804524e7a560eb988423b8668dbfbc4b71e939
|
|
| MD5 |
328176c7cb90cceeda5b8ff00be7e729
|
|
| BLAKE2b-256 |
ca96a60b54edf40774d0ef7fdcbc0e0638b83fc0abc9154fee0df89f4ef23876
|
File details
Details for the file html_shrinker-0.1.0-py3-none-any.whl.
File metadata
- Download URL: html_shrinker-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28c9b2e3a1dffc75feeefc4d9bd5027baab6e4a71f2a61030351784af753b7e8
|
|
| MD5 |
a073aebc8dc12c3bbc598101eb80cce2
|
|
| BLAKE2b-256 |
60ee7ac93e9ed1395795e7dd3cf38274ec42062a9e1bbd222921600968be03c5
|