Skip to main content

Parse DOM trees into clean, LLM-friendly context

Project description

domcontext

Parse DOM trees into clean, LLM-friendly context.

Converts messy HTML/CDP snapshots into structured markdown for LLM context windows. Designed for web automation agents that need to provide clean DOM context to LLMs.

⚠️ Development Status: This package is in active development (v0.1.x). APIs may change between minor versions. Not recommended for production use yet.

Why "domcontext"? It's a double pun! 🎯

  • DOM (Document Object Model) + context (LLM context windows)
  • Provides DOM context for your LLM agents

Tests Python License Version


Quick Start

from domcontext import DomContext

# Parse HTML string
html = """
<html>
<head><title>Example</title></head>
<body>
    <nav><a href="/home">Home</a></nav>
    <main>
        <button type="submit">Search</button>
    </main>
</body>
</html>
"""

# Create DOM context
context = DomContext.from_html(html)

# Get markdown representation
print(context.markdown)
print(f"Tokens: {context.tokens}")

# Iterate through interactive elements
for element in context.elements():
    print(f"{element.id}: {element.tag} - {element.text}")

Output:

- body-1
  - nav-1
    - a-1 (href="/home")
      - "Home"
  - main-1
    - button-1 (type="submit")
      - "Search"

Tokens: 42

Installation

# Install from source
pip install -e .

# Install with dev dependencies
pip install -e ".[dev]"

# Install with Playwright support (for live browser CDP capture)
pip install -e ".[playwright]"

# Install with Jupyter notebooks support (to run examples)
pip install -e ".[examples,playwright]"

# Install with all optional dependencies
pip install -e ".[dev,playwright,examples]"

After installing with Playwright support, install browser binaries:

playwright install chromium

Features

  • 🧹 Semantic filtering - Removes scripts, styles, hidden elements automatically
  • 📉 Token reduction - 60% average reduction in token count
  • 🎯 Structure preservation - Maintains DOM hierarchy in clean format
  • 🔍 Element lookup - Access original DOM elements by their generated IDs
  • 📊 Token counting - Built-in token counting with tiktoken
  • 🎛️ Configurable filtering - Fine-tune visibility and semantic filters
  • 📦 Multiple input formats - Support for HTML strings and CDP snapshots
  • 🧩 Smart chunking - Split large DOMs with continuation markers (...) and parent context for seamless chunk boundaries

API

Parse HTML

from domcontext import DomContext

# Basic parsing
context = DomContext.from_html(html_string)

# With custom filter options
context = DomContext.from_html(
    html_string,
    filter_non_visible=True,      # Remove script, style tags
    filter_css_hidden=True,        # Remove display:none, visibility:hidden
    filter_zero_dimensions=True,   # Remove zero-width/height elements
    filter_empty_elements=True,    # Remove empty wrapper divs
    filter_attributes=True,        # Keep only semantic attributes
    collapse_wrappers=True         # Collapse single-child wrappers
)

Parse CDP Snapshot

# From Chrome DevTools Protocol snapshot
cdp_snapshot = {
    'documents': [...],
    'strings': [...]
}

context = DomContext.from_cdp(cdp_snapshot)

Access Context

# Markdown representation
markdown = context.markdown

# Token count
token_count = context.tokens

# Get all interactive elements
for element in context.elements():
    print(f"ID: {element.id}")
    print(f"Tag: {element.tag}")
    print(f"Text: {element.text}")
    print(f"Attributes: {element.attributes}")

# Get element by ID
element = context.get_element("button-1")
print(element.attributes)  # {'type': 'submit'}

# Get as dictionary
data = context.to_dict()

Chunking

# Split large DOMs into chunks with smart continuation markers
for chunk in context.chunks(max_tokens=1000, overlap=100):
    print(f"Chunk tokens: {chunk.tokens}")
    print(chunk.markdown)

# Chunks automatically include:
# - Parent path context (e.g., "- body-1\n  - div-1")
# - Continuation markers (...) when elements span chunks
# Example: "- button-1 (type="submit" ...)" → continues in next chunk

# Disable parent path if needed
for chunk in context.chunks(max_tokens=1000, overlap=100, include_parent_path=False):
    print(chunk.markdown)  # No parent context

Custom Tokenizer

from domcontext import DomContext, Tokenizer

class CustomTokenizer(Tokenizer):
    def count_tokens(self, text: str) -> int:
        # Your custom token counting logic
        return len(text.split())

context = DomContext.from_html(html, tokenizer=CustomTokenizer())

Playwright Utilities (Optional)

Capture CDP snapshots directly from live browser sessions:

from playwright.async_api import async_playwright
from domcontext import DomContext
from domcontext.utils import capture_snapshot

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto('https://example.com')

        # Capture CDP snapshot from live page
        snapshot = await capture_snapshot(page)

        # Parse into DomContext
        context = DomContext.from_cdp(snapshot)
        print(context.markdown)

        await browser.close()

# Run with: python -m asyncio script.py

Note: Requires installation with pip install domcontext[playwright]


Architecture

The library uses a multi-stage filtering pipeline:

  1. Parse - HTML/CDP → DomIR (complete DOM tree with all data)
  2. Visibility Filter - Remove non-visible elements (optional flags)
    • Non-visible tags (script, style, head)
    • CSS hidden elements (display:none, visibility:hidden)
    • Zero-dimension elements
  3. Semantic Filter - Extract semantic information (optional flags)
    • Convert to SemanticIR
    • Filter to semantic attributes only
    • Remove empty nodes
    • Collapse wrapper divs
    • Generate readable IDs
  4. Output - SemanticIR → Markdown/JSON

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=domcontext --cov-report=html

# Run specific test suite
pytest tests/unit/parsers/
pytest tests/unit/filters/
pytest tests/unit/ir/

Test Coverage:

  • 188 tests passing
  • HTML Parser (13 tests)
  • CDP Parser (12 tests)
  • DomIR Layer (27 tests)
  • SemanticIR Layer (34 tests)
  • Visibility Filters (43 tests)
  • Semantic Filters (28 tests)
  • Chunker (15 tests)
  • Tokenizers (13 tests)
  • Smoke tests (3 tests)

Use Cases

  • Web automation agents - Provide clean DOM context to LLMs for element selection
  • Web scraping - Extract structured content from complex pages
  • Testing - Generate clean snapshots of DOM state
  • Accessibility - Extract semantic structure from pages

License

MIT


Examples

Check out the interactive Jupyter notebooks in examples/:

  • simple_demo.ipynb - Quick start guide with Google search example

    • Element lookup by ID
    • Chunking demonstration
    • Perfect for beginners
  • advanced_demo.ipynb - Advanced features showcase

    • Custom filters and tokenizers
    • Element iteration and statistics
    • LLM prompt generation
    • Production patterns

Run with:

jupyter notebook examples/simple_demo.ipynb

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black src/ tests/

# Lint
ruff check src/ tests/

TODO

Collapsing Improvements

  1. Collapse text-wrapping elements - Improve wrapper collapsing to also collapse elements that only wrap text (not just elements that wrap other elements). Currently, <a><span>text</span></a> keeps the span, but it should be collapsed to <a>text</a> if the span has no attributes. Exception: Don't collapse interactive elements (button, input, a, select, textarea, etc.) even when they only wrap text, as these are semantically meaningful.

Evaluation & Benchmarking

  1. Mind2Web dataset evaluation - Conduct comprehensive testing on the Mind2Web dataset to evaluate DOM context quality, token reduction rates, and element selection accuracy across diverse real-world websites. Report will include performance metrics, edge cases discovered, and comparison with baseline HTML parsing.

Recently Completed

✅ Chunking Improvements (v0.1.3)

  • Atomic-level chunking - Implemented word-by-word text splitting and attribute-by-attribute element splitting with continuation markers (...)
  • Smart chunk boundaries - Text and attributes now split across chunks seamlessly with proper context preservation
  • Parent path context - Each chunk includes parent hierarchy for better LLM understanding
  • Better token utilization - No more wasted chunk capacity from oversized single-line elements

Contributing

Contributions welcome! Please ensure tests pass and add new tests for new features.

# Run full test suite
pytest -v

# Check coverage
pytest --cov=domcontext

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

domcontext-0.1.3.tar.gz (30.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

domcontext-0.1.3-py3-none-any.whl (38.0 kB view details)

Uploaded Python 3

File details

Details for the file domcontext-0.1.3.tar.gz.

File metadata

  • Download URL: domcontext-0.1.3.tar.gz
  • Upload date:
  • Size: 30.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for domcontext-0.1.3.tar.gz
Algorithm Hash digest
SHA256 f279a41bdb17a25400083ff97e9e272470b5624384329acd3c3b4db4b3ea9377
MD5 d4d2663cc4e7b68fde074f8b1d70cb90
BLAKE2b-256 465b9ba30e283a4d80273c5b8050ae2391fe906f17323766fe0f623ba0a1ee21

See more details on using hashes here.

File details

Details for the file domcontext-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: domcontext-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 38.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for domcontext-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5602a31d6163eee6cfc06d1dc5785f8edd051e28b45254f03cb629b724e1ddc0
MD5 ad2ee9744998ef74e7292af1f5bb6a88
BLAKE2b-256 7a5307ad368ad4a56440387f40a6083d25086d015f0bc1c70a87cebece780ed1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page