LLM-optimized HTML cleaning: hydration extraction, token budgets, multiple output formats
Project description
llm-html
CMDOP Skill — install and use via CMDOP agent:
cmdop-skill install llm-html
LLM-optimized HTML cleaning: hydration extraction, token budgets, multiple output formats.
Install
pip install llm-html
Quick Start
from llm_html import HTMLCleaner, CleanerConfig, OutputFormat
# Basic cleaning
cleaner = HTMLCleaner()
result = cleaner.clean(html)
print(f"Reduction: {result.stats.reduction_percent}%")
# Hydration-first (extracts SSR data from Next.js, Nuxt, etc.)
if result.hydration_data:
data = result.hydration_data
else:
cleaned = result.html
Convenience Functions
from llm_html import clean, clean_to_json, clean_html, clean_for_llm
# Quick clean
result = clean(html)
# Get JSON if SSR data available, otherwise cleaned HTML
data = clean_to_json(html)
# Pipeline with full control
result = clean_html(html, max_tokens=5000)
result = clean_for_llm(html, output_format="markdown")
Output Formats
from llm_html import to_markdown, to_aom_yaml, to_xtree
md = to_markdown(html)
aom = to_aom_yaml(html)
xtree = to_xtree(html)
Downsampling
Token-budget targeting with D2Snap algorithm:
from llm_html import downsample_html, estimate_tokens
tokens = estimate_tokens(html)
if tokens > 10000:
html = downsample_html(html, target_tokens=8000)
Semantic Chunking
Split large pages into LLM-sized chunks:
from llm_html import SemanticChunker, ChunkConfig
config = ChunkConfig(max_tokens=8000, max_items=20)
chunker = SemanticChunker(config)
result = chunker.chunk(soup)
for chunk in result.chunks:
process(chunk.html)
Shadow DOM
Flatten Web Components for LLM visibility:
from llm_html import flatten_shadow_dom
flat = flatten_shadow_dom(html)
Helpers
from llm_html import html_to_text, extract_links, extract_images, json_to_toon
text = html_to_text(html)
links = extract_links(html, base_url="https://example.com")
images = extract_images(html)
toon = json_to_toon({"key": "value"})
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_html-0.1.6.tar.gz.
File metadata
- Download URL: llm_html-0.1.6.tar.gz
- Upload date:
- Size: 53.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d1b25ea3be6164fe699793a33f6d6c9f9dc938402830ad0bb6f07b63ef9c8712
|
|
| MD5 |
551ab347e8d7c0131850dbd92a008025
|
|
| BLAKE2b-256 |
565a91196d01efa12c31bbcbf8058e59e04f62d58d682ede753f23b425962e8c
|
File details
Details for the file llm_html-0.1.6-py3-none-any.whl.
File metadata
- Download URL: llm_html-0.1.6-py3-none-any.whl
- Upload date:
- Size: 70.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ca337409aaf39d0fd48e08208ee82c8a8401a8128dc71e6eb83cac2f597daad
|
|
| MD5 |
f5ca71eff304aaa7adcb0ee1b184cf3d
|
|
| BLAKE2b-256 |
701280d080bd791b0a22e7eae5c440d937e9246571603bd575858eb1bf1cec44
|