Skip to main content

Simple Python helper library that can significantly reduce LLM input tokens by removing unnecessary page code

Project description

html-shrinker

Simple Python helper library that can significantly reduce LLM input tokens by removing unnecessary page code (configurable).

AI scraping usually involves sending the whole page code to an LLM + instructions + output format. 99.9% of the time the information needed is somewhere in the body tag of the page, thus we can safely remove the whole head tag which contains a ton of styles and scripts and metadata that are not needed. This alone reduces the tokens/costs significantly. Further optimizations can be made, like removing specific html tags, attributes or even the innertext.

What it does

  • Removes noisy tags/attributes or keeps only a whitelist
  • Strips inner text
  • Removes comments
  • Flattens repeated single-child div > div wrappers, even if they are nested many levels deep
  • Collapses whitespace between tags

Quick install

pip install html-shrinker

Quick start

from html_shrinker import HTMLShrinker
from html_shrinker.defaults import tags

raw_html = """
<html>
  <head><script>ignore me</script></head>
  <body>
    <div><div><p>Hello world</p></div></div>
    <script>alert("x")</script>
  </body>
</html>
"""

shrinker = HTMLShrinker(
    tags=list(tags),
)
result = shrinker.shrink(raw_html)
print(result)

API

from html_shrinker import HTMLShrinker

shrinker = HTMLShrinker(
    tag_mode="remove",
    tags=["script", "style", "head"],
    attribute_mode="remove",
    attributes=["class", "id", "style"],
    strip_innertext=False,
    remove_comments=True,
    flatten_single_child_divs=True,
    collapse_between_tags=True,
)

output = shrinker.shrink("<html>...</html>")

Default presets are available from:

from html_shrinker.defaults import tags, arguments

Invalid HTML input raises InvalidHTMLInputError:

from html_shrinker import HTMLShrinker, InvalidHTMLInputError

try:
    HTMLShrinker().shrink("<div>fragment</div>")
except InvalidHTMLInputError as exc:
    print(exc)

Configuration

HTMLShrinker(...) constructor parameters:

  • tag_mode: "remove" or "keep" (default: "remove")
  • tags: list[str]
    • If tag_mode="remove": these tags are removed.
    • If tag_mode="keep": only these tags are kept.
  • attribute_mode: "remove" or "keep" (default: "remove")
  • attributes: list[str]
    • If attribute_mode="remove": these attributes are removed.
    • If attribute_mode="keep": only these attributes are kept.
  • strip_innertext: bool (default: False)
    • If True: removes text nodes.
    • Example: <p>secret</p> becomes <p></p>.
  • remove_comments: bool (default: True)
    • If True: removes HTML comments such as <!-- comment -->.
  • flatten_single_child_divs: bool (default: True)
    • If True: flattens nested div > div wrappers when a div contains only one child div.
    • This is applied repeatedly, so a deep chain like <div><div><div><p>...</p></div></div></div> becomes <div><p>...</p></div>.
    • These large div chains appear very often when shrinking aggresively.
  • collapse_between_tags: bool (default: True)
    • If True: removes whitespace between tags, so > < becomes ><.

Notes:

  • Default tags is empty ([]), so no tags are removed by default.
  • Default attributes is empty ([]), so no attributes are removed by default.
  • If tag_mode="keep", tags must be non-empty.
  • If attribute_mode="keep", attributes must be non-empty.

Usage patterns

1) Remove mode (default)

from html_shrinker import HTMLShrinker

shrinker = HTMLShrinker(
    tag_mode="remove",
    tags=["script", "style", "head"],
    attribute_mode="remove",
    attributes=["class", "id", "style"],
)
clean = shrinker.shrink(raw_html)

2) Keep mode

from html_shrinker import HTMLShrinker

shrinker = HTMLShrinker(
    tag_mode="keep",
    tags=["main", "article", "h1", "h2", "p", "ul", "ol", "li", "a"],
    attribute_mode="keep",
    attributes=["href"],
)
clean = shrinker.shrink(raw_html)

3) Strip inner text

from html_shrinker import HTMLShrinker

shrinker = HTMLShrinker(strip_innertext=True)
clean = shrinker.shrink("<html><body><p>secret text</p></body></html>")
# <html><body><p></p></body></html>

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_shrinker-0.1.0.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

html_shrinker-0.1.0-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file html_shrinker-0.1.0.tar.gz.

File metadata

  • Download URL: html_shrinker-0.1.0.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for html_shrinker-0.1.0.tar.gz
Algorithm Hash digest
SHA256 98d3e96700e7c9c5f68082c5b0804524e7a560eb988423b8668dbfbc4b71e939
MD5 328176c7cb90cceeda5b8ff00be7e729
BLAKE2b-256 ca96a60b54edf40774d0ef7fdcbc0e0638b83fc0abc9154fee0df89f4ef23876

See more details on using hashes here.

File details

Details for the file html_shrinker-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: html_shrinker-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for html_shrinker-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 28c9b2e3a1dffc75feeefc4d9bd5027baab6e4a71f2a61030351784af753b7e8
MD5 a073aebc8dc12c3bbc598101eb80cce2
BLAKE2b-256 60ee7ac93e9ed1395795e7dd3cf38274ec42062a9e1bbd222921600968be03c5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page