Parse DOM trees into clean, LLM-friendly context

These details have not been verified by PyPI

Project links

Project description

domcontext

Parse DOM trees into clean, LLM-friendly context.

Converts messy HTML/CDP snapshots into structured markdown for LLM context windows. Designed for web automation agents that need to provide clean DOM context to LLMs.

⚠️ Development Status: This package is in active development (v0.1.x). APIs may change between minor versions. Not recommended for production use yet.

Why "domcontext"? It's a double pun! 🎯

DOM (Document Object Model) + context (LLM context windows)

Provides DOM context for your LLM agents

Quick Start

from domcontext import DomContext

# Parse HTML string
html = """
<html>
<head><title>Example</title></head>
<body>
    <nav><a href="/home">Home</a></nav>
    <main>
        <button type="submit">Search</button>
    </main>
</body>
</html>
"""

# Create DOM context
context = DomContext.from_html(html)

# Get markdown representation
print(context.markdown)
print(f"Tokens: {context.tokens}")

# Iterate through interactive elements
for element in context.elements():
    print(f"{element.id}: {element.tag} - {element.text}")

Output:

- body-1
  - nav-1
    - a-1 (href="/home")
      - "Home"
  - main-1
    - button-1 (type="submit")
      - "Search"

Tokens: 42

Installation

# Install from source
pip install -e .

# Install with dev dependencies
pip install -e ".[dev]"

# Install with Playwright support (for live browser CDP capture)
pip install -e ".[playwright]"

# Install with Jupyter notebooks support (to run examples)
pip install -e ".[examples,playwright]"

# Install with all optional dependencies
pip install -e ".[dev,playwright,examples]"

After installing with Playwright support, install browser binaries:

playwright install chromium

Features

🧹 Semantic filtering - Removes scripts, styles, hidden elements automatically
📉 Token reduction - 60% average reduction in token count
🎯 Structure preservation - Maintains DOM hierarchy in clean format
🔍 Element lookup - Access original DOM elements by their generated IDs
📊 Token counting - Built-in token counting with tiktoken
🎛️ Configurable filtering - Fine-tune visibility and semantic filters
📦 Multiple input formats - Support for HTML strings and CDP snapshots
🧩 Smart chunking - Split large DOMs with continuation markers (...) and parent context for seamless chunk boundaries

API

Parse HTML

from domcontext import DomContext

# Basic parsing
context = DomContext.from_html(html_string)

# With custom filter options
context = DomContext.from_html(
    html_string,
    filter_non_visible=True,      # Remove script, style tags
    filter_css_hidden=True,        # Remove display:none, visibility:hidden
    filter_zero_dimensions=True,   # Remove zero-width/height elements
    filter_empty_elements=True,    # Remove empty wrapper divs
    filter_attributes=True,        # Keep only semantic attributes
    collapse_wrappers=True         # Collapse single-child wrappers
)

Parse CDP Snapshot

# From Chrome DevTools Protocol snapshot
cdp_snapshot = {
    'documents': [...],
    'strings': [...]
}

context = DomContext.from_cdp(cdp_snapshot)

Access Context

# Markdown representation
markdown = context.markdown

# Token count
token_count = context.tokens

# Get all interactive elements
for element in context.elements():
    print(f"ID: {element.id}")
    print(f"Tag: {element.tag}")
    print(f"Text: {element.text}")
    print(f"Attributes: {element.attributes}")

# Get element by ID
element = context.get_element("button-1")
print(element.attributes)  # {'type': 'submit'}

# Get as dictionary
data = context.to_dict()

Chunking

# Split large DOMs into chunks with smart continuation markers
for chunk in context.chunks(max_tokens=1000, overlap=100):
    print(f"Chunk tokens: {chunk.tokens}")
    print(chunk.markdown)

# Chunks automatically include:
# - Parent path context (e.g., "- body-1\n  - div-1")
# - Continuation markers (...) when elements span chunks
# Example: "- button-1 (type="submit" ...)" → continues in next chunk

# Disable parent path if needed
for chunk in context.chunks(max_tokens=1000, overlap=100, include_parent_path=False):
    print(chunk.markdown)  # No parent context

Custom Tokenizer

from domcontext import DomContext, Tokenizer

class CustomTokenizer(Tokenizer):
    def count_tokens(self, text: str) -> int:
        # Your custom token counting logic
        return len(text.split())

context = DomContext.from_html(html, tokenizer=CustomTokenizer())

Playwright Utilities (Optional)

Capture CDP snapshots directly from live browser sessions:

from playwright.async_api import async_playwright
from domcontext import DomContext
from domcontext.utils import capture_snapshot

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto('https://example.com')

        # Capture CDP snapshot from live page
        snapshot = await capture_snapshot(page)

        # Parse into DomContext
        context = DomContext.from_cdp(snapshot)
        print(context.markdown)

        await browser.close()

# Run with: python -m asyncio script.py

Note: Requires installation with pip install domcontext[playwright]

Architecture

The library uses a multi-stage filtering pipeline:

Parse - HTML/CDP → DomIR (complete DOM tree with all data)
Visibility Filter - Remove non-visible elements (optional flags)
- Non-visible tags (script, style, head)
- CSS hidden elements (display:none, visibility:hidden)
- Zero-dimension elements
Semantic Filter - Extract semantic information (optional flags)
- Convert to SemanticIR
- Filter to semantic attributes only
- Remove empty nodes
- Collapse wrapper divs
- Generate readable IDs
Output - SemanticIR → Markdown/JSON

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=domcontext --cov-report=html

# Run specific test suite
pytest tests/unit/parsers/
pytest tests/unit/filters/
pytest tests/unit/ir/

Test Coverage:

188 tests passing
HTML Parser (13 tests)
CDP Parser (12 tests)
DomIR Layer (27 tests)
SemanticIR Layer (34 tests)
Visibility Filters (43 tests)
Semantic Filters (28 tests)
Chunker (15 tests)
Tokenizers (13 tests)
Smoke tests (3 tests)

Use Cases

Web automation agents - Provide clean DOM context to LLMs for element selection
Web scraping - Extract structured content from complex pages
Testing - Generate clean snapshots of DOM state
Accessibility - Extract semantic structure from pages

License

MIT

Examples

Check out the interactive Jupyter notebooks in examples/:

simple_demo.ipynb - Quick start guide with Google search example
- Element lookup by ID
- Chunking demonstration
- Perfect for beginners
advanced_demo.ipynb - Advanced features showcase
- Custom filters and tokenizers
- Element iteration and statistics
- LLM prompt generation
- Production patterns

Run with:

jupyter notebook examples/simple_demo.ipynb

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black src/ tests/

# Lint
ruff check src/ tests/

TODO

Collapsing Improvements

Collapse text-wrapping elements - Improve wrapper collapsing to also collapse elements that only wrap text (not just elements that wrap other elements). Currently, <a><span>text</span></a> keeps the span, but it should be collapsed to <a>text</a> if the span has no attributes. Exception: Don't collapse interactive elements (button, input, a, select, textarea, etc.) even when they only wrap text, as these are semantically meaningful.

Evaluation & Benchmarking

Mind2Web dataset evaluation - Conduct comprehensive testing on the Mind2Web dataset to evaluate DOM context quality, token reduction rates, and element selection accuracy across diverse real-world websites. Report will include performance metrics, edge cases discovered, and comparison with baseline HTML parsing.

Recently Completed

✅ Chunking Improvements (v0.1.3)

Atomic-level chunking - Implemented word-by-word text splitting and attribute-by-attribute element splitting with continuation markers (...)
Smart chunk boundaries - Text and attributes now split across chunks seamlessly with proper context preservation
Parent path context - Each chunk includes parent hierarchy for better LLM understanding
Better token utilization - No more wasted chunk capacity from oversized single-line elements

Contributing

Contributions welcome! Please ensure tests pass and add new tests for new features.

# Run full test suite
pytest -v

# Check coverage
pytest --cov=domcontext

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.3

Oct 19, 2025

0.1.2

Oct 18, 2025

0.1.1

Oct 18, 2025

0.1.0

Oct 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

domcontext-0.1.3.tar.gz (30.8 kB view details)

Uploaded Oct 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

domcontext-0.1.3-py3-none-any.whl (38.0 kB view details)

Uploaded Oct 19, 2025 Python 3

File details

Details for the file domcontext-0.1.3.tar.gz.

File metadata

Download URL: domcontext-0.1.3.tar.gz
Upload date: Oct 19, 2025
Size: 30.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for domcontext-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`f279a41bdb17a25400083ff97e9e272470b5624384329acd3c3b4db4b3ea9377`
MD5	`d4d2663cc4e7b68fde074f8b1d70cb90`
BLAKE2b-256	`465b9ba30e283a4d80273c5b8050ae2391fe906f17323766fe0f623ba0a1ee21`

See more details on using hashes here.

File details

Details for the file domcontext-0.1.3-py3-none-any.whl.

File metadata

Download URL: domcontext-0.1.3-py3-none-any.whl
Upload date: Oct 19, 2025
Size: 38.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for domcontext-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5602a31d6163eee6cfc06d1dc5785f8edd051e28b45254f03cb629b724e1ddc0`
MD5	`ad2ee9744998ef74e7292af1f5bb6a88`
BLAKE2b-256	`7a5307ad368ad4a56440387f40a6083d25086d015f0bc1c70a87cebece780ed1`

See more details on using hashes here.

domcontext 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

domcontext

Quick Start

Installation

Features

API

Parse HTML

Parse CDP Snapshot

Access Context

Chunking

Custom Tokenizer

Playwright Utilities (Optional)

Architecture

Testing

Use Cases

License

Examples

Development

TODO

Collapsing Improvements

Evaluation & Benchmarking

Recently Completed

✅ Chunking Improvements (v0.1.3)

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes