Parse DOM trees into clean, LLM-friendly context
Project description
domcontext
Parse DOM trees into clean, LLM-friendly context.
Converts messy HTML/CDP snapshots into structured markdown for LLM context windows. Designed for web automation agents that need to provide clean DOM context to LLMs.
⚠️ Development Status: This package is in active development (v0.1.x). APIs may change between minor versions. Not recommended for production use yet.
Why "domcontext"? It's a double pun! 🎯
- DOM (Document Object Model) + context (LLM context windows)
- Provides DOM context for your LLM agents
Quick Start
from domcontext import DomContext
# Parse HTML string
html = """
<html>
<head><title>Example</title></head>
<body>
<nav><a href="/home">Home</a></nav>
<main>
<button type="submit">Search</button>
</main>
</body>
</html>
"""
# Create DOM context
context = DomContext.from_html(html)
# Get markdown representation
print(context.markdown)
print(f"Tokens: {context.tokens}")
# Iterate through interactive elements
for element in context.elements():
print(f"{element.id}: {element.tag} - {element.text}")
Output:
- body-1
- nav-1
- a-1 (href="/home")
- "Home"
- main-1
- button-1 (type="submit")
- "Search"
Tokens: 42
Installation
# Install from source
pip install -e .
# Install with dev dependencies
pip install -e ".[dev]"
# Install with Playwright support (for live browser CDP capture)
pip install -e ".[playwright]"
# Install with Jupyter notebooks support (to run examples)
pip install -e ".[examples,playwright]"
# Install with all optional dependencies
pip install -e ".[dev,playwright,examples]"
After installing with Playwright support, install browser binaries:
playwright install chromium
Features
- 🧹 Semantic filtering - Removes scripts, styles, hidden elements automatically
- 📉 Token reduction - 60% average reduction in token count
- 🎯 Structure preservation - Maintains DOM hierarchy in clean format
- 🔍 Element lookup - Access original DOM elements by their generated IDs
- 📊 Token counting - Built-in token counting with tiktoken
- 🎛️ Configurable filtering - Fine-tune visibility and semantic filters
- 📦 Multiple input formats - Support for HTML strings and CDP snapshots
- 🧩 Smart chunking - Split large DOMs into context-sized chunks with configurable overlap
API
Parse HTML
from domcontext import DomContext
# Basic parsing
context = DomContext.from_html(html_string)
# With custom filter options
context = DomContext.from_html(
html_string,
filter_non_visible=True, # Remove script, style tags
filter_css_hidden=True, # Remove display:none, visibility:hidden
filter_zero_dimensions=True, # Remove zero-width/height elements
filter_empty_elements=True, # Remove empty wrapper divs
filter_attributes=True, # Keep only semantic attributes
collapse_wrappers=True # Collapse single-child wrappers
)
Parse CDP Snapshot
# From Chrome DevTools Protocol snapshot
cdp_snapshot = {
'documents': [...],
'strings': [...]
}
context = DomContext.from_cdp(cdp_snapshot)
Access Context
# Markdown representation
markdown = context.markdown
# Token count
token_count = context.tokens
# Get all interactive elements
for element in context.elements():
print(f"ID: {element.id}")
print(f"Tag: {element.tag}")
print(f"Text: {element.text}")
print(f"Attributes: {element.attributes}")
# Get element by ID
element = context.get_element("button-1")
print(element.attributes) # {'type': 'submit'}
# Get as dictionary
data = context.to_dict()
Chunking
# Split large DOMs into chunks
for chunk in context.chunks(max_tokens=1000, overlap=100):
print(f"Chunk tokens: {chunk.tokens}")
print(chunk.markdown)
Custom Tokenizer
from domcontext import DomContext, Tokenizer
class CustomTokenizer(Tokenizer):
def count_tokens(self, text: str) -> int:
# Your custom token counting logic
return len(text.split())
context = DomContext.from_html(html, tokenizer=CustomTokenizer())
Playwright Utilities (Optional)
Capture CDP snapshots directly from live browser sessions:
from playwright.async_api import async_playwright
from domcontext import DomContext
from domcontext.utils import capture_snapshot
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto('https://example.com')
# Capture CDP snapshot from live page
snapshot = await capture_snapshot(page)
# Parse into DomContext
context = DomContext.from_cdp(snapshot)
print(context.markdown)
await browser.close()
# Run with: python -m asyncio script.py
Note: Requires installation with pip install domcontext[playwright]
Architecture
The library uses a multi-stage filtering pipeline:
- Parse - HTML/CDP → DomIR (complete DOM tree with all data)
- Visibility Filter - Remove non-visible elements (optional flags)
- Non-visible tags (script, style, head)
- CSS hidden elements (display:none, visibility:hidden)
- Zero-dimension elements
- Semantic Filter - Extract semantic information (optional flags)
- Convert to SemanticIR
- Filter to semantic attributes only
- Remove empty nodes
- Collapse wrapper divs
- Generate readable IDs
- Output - SemanticIR → Markdown/JSON
Testing
# Run all tests
pytest
# Run with coverage
pytest --cov=domcontext --cov-report=html
# Run specific test suite
pytest tests/unit/parsers/
pytest tests/unit/filters/
pytest tests/unit/ir/
Test Coverage:
- 173 tests passing
- HTML Parser (13 tests)
- CDP Parser (12 tests)
- DomIR Layer (27 tests)
- SemanticIR Layer (34 tests)
- Visibility Filters (43 tests)
- Semantic Filters (28 tests)
- Tokenizers (13 tests)
Use Cases
- Web automation agents - Provide clean DOM context to LLMs for element selection
- Web scraping - Extract structured content from complex pages
- Testing - Generate clean snapshots of DOM state
- Accessibility - Extract semantic structure from pages
License
MIT
Examples
Check out the interactive Jupyter notebooks in examples/:
-
simple_demo.ipynb- Quick start guide with Google search example- Element lookup by ID
- Chunking demonstration
- Perfect for beginners
-
advanced_demo.ipynb- Advanced features showcase- Custom filters and tokenizers
- Element iteration and statistics
- LLM prompt generation
- Production patterns
Run with:
jupyter notebook examples/simple_demo.ipynb
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black src/ tests/
# Lint
ruff check src/ tests/
TODO
Chunking Improvements
- Handle super long text nodes - Improve chunk behavior when a single text element exceeds the max token limit. Currently, if a single text node is larger than
max_tokens, it will be placed in its own chunk, potentially exceeding the limit. Future improvement: split long text nodes across multiple chunks while preserving context.
Collapsing Improvements
- Collapse text-wrapping elements - Improve wrapper collapsing to also collapse elements that only wrap text (not just elements that wrap other elements). Currently,
<a><span>text</span></a>keeps thespan, but it should be collapsed to<a>text</a>if the span has no attributes. Exception: Don't collapse interactive elements (button, input, a, select, textarea, etc.) even when they only wrap text, as these are semantically meaningful.
Contributing
Contributions welcome! Please ensure tests pass and add new tests for new features.
# Run full test suite
pytest -v
# Check coverage
pytest --cov=domcontext
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file domcontext-0.1.0.tar.gz.
File metadata
- Download URL: domcontext-0.1.0.tar.gz
- Upload date:
- Size: 26.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e48659ca173e89af663b911e48fced399091ac533cd04ffc45b1dfb9c5e9346
|
|
| MD5 |
c9eed98aaf8266bea885eed2c37b0b35
|
|
| BLAKE2b-256 |
8ba68ddb181d1c8ce97db5d8b17219fa5c9c89f41c9298436a892dd02aceb078
|
File details
Details for the file domcontext-0.1.0-py3-none-any.whl.
File metadata
- Download URL: domcontext-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2631d64c11056cf506e4b2ecfb571f53d2c79db9225d8e8b668bae0b9e04d780
|
|
| MD5 |
9425a53e5550111dd8c02a709b43374c
|
|
| BLAKE2b-256 |
c6a2edff0ad1e3b2a891a9bd30be535d140741348f9706fd6cc4c4f2fe04d368
|