Skip to main content

DOM nodes with browser rendering data for web automation

Project description

domnode

DOM nodes with browser rendering data for web automation.

domnode is a Python library that provides DOM node types enriched with browser rendering information (computed styles, bounding boxes, CDP metadata). It includes parsers for HTML and Chrome DevTools Protocol (CDP) snapshots, plus powerful filtering utilities to extract only visible, semantic content.

Features

  • ๐ŸŒณ Rich DOM nodes: Includes computed styles, bounding boxes, and CDP backend node IDs
  • ๐Ÿ“ฆ Dual parsers: Parse from HTML strings or CDP snapshots
  • ๐ŸŽฏ Smart filtering: Remove hidden elements, non-semantic attributes, and wrapper divs
  • ๐Ÿ” Visibility detection: Handle display:none, visibility:hidden, opacity:0, zero-size elements
  • ๐Ÿท๏ธ Semantic extraction: Keep only meaningful attributes (role, aria-*, type, href, etc.)
  • ๐Ÿงน Tree optimization: Collapse unnecessary wrapper elements
  • โœ… Well-tested: 86 unit tests with comprehensive coverage

Installation

pip install domnode

Quick Start

from domnode import parse_html, filter_visible

# Parse HTML
html = """
<div>
    <script>console.log('hidden')</script>
    <div style="display: none">Hidden content</div>
    <button role="button" class="btn">Click me</button>
</div>
"""

root = parse_html(html)

# Filter to only visible elements
visible = filter_visible(root)

# Result: Only the button remains
for child in visible:
    print(child.tag, child.attrib)
# Output: button {'role': 'button', 'class': 'btn'}

Usage

Parsing HTML

from domnode.parsers import parse_html

html = '<div class="container"><button>Click</button></div>'
root = parse_html(html)

print(root.tag)          # 'div'
print(root.attrib)       # {'class': 'container'}
print(root.children[0])  # Node(tag='button', ...)

Parsing CDP Snapshots

from domnode.parsers import parse_cdp

# From Playwright/Puppeteer
snapshot = await page.cdp_session.send('DOMSnapshot.captureSnapshot', {
    'computedStyles': [],
    'includeDOMRects': True
})

root = parse_cdp(snapshot)
print(root.bounds)  # BoundingBox(x=0, y=0, width=1920, height=1080)
print(root.styles)  # {'display': 'block', 'position': 'static', ...}

Filtering - Visibility

Remove hidden and non-visible elements:

from domnode import parse_html, filter_visible

html = """
<div>
    <script>alert('hidden')</script>
    <style>.hide { display: none; }</style>
    <div style="display: none">Hidden</div>
    <div style="opacity: 0">Invisible</div>
    <button>Visible</button>
</div>
"""

root = parse_html(html)
visible = filter_visible(root)

# Only button remains
assert len(visible.children) == 1
assert visible.children[0].tag == 'button'

Filtering - Semantic

Keep only semantic attributes and clean structure:

from domnode import parse_html, filter_semantic

html = """
<div class="wrapper" id="container">
    <div class="inner">
        <button class="btn" role="button" aria-label="Submit">Click</button>
    </div>
</div>
"""

root = parse_html(html)
semantic = filter_semantic(root)

# Wrappers collapsed, only semantic attributes remain
assert semantic.tag == 'button'
assert semantic.attrib == {'role': 'button', 'aria-label': 'Submit'}

Filtering - All (Visibility + Semantic)

from domnode import parse_html, filter_all

html = """
<html>
    <head>
        <script src="app.js"></script>
    </head>
    <body class="page">
        <div class="wrapper">
            <button class="btn" role="button">Click</button>
        </div>
    </body>
</html>
"""

root = parse_html(html)
clean = filter_all(root)

# Head removed, wrappers collapsed, only semantic attributes
assert clean.tag == 'button'
assert clean.attrib == {'role': 'button'}

Granular Filtering

Use individual filters for fine-grained control:

from domnode.parsers import parse_html
from domnode.filters.visibility import filter_css_hidden, filter_zero_dimensions
from domnode.filters.semantic import filter_attributes, collapse_wrappers

root = parse_html(html)

# Apply specific filters
root = filter_css_hidden(root)
root = filter_attributes(root)
root = collapse_wrappers(root)

Working with Nodes

from domnode import Node, Text, BoundingBox

# Create nodes
div = Node(tag='div', attrib={'class': 'container'})
button = Node(
    tag='button',
    attrib={'role': 'button'},
    styles={'display': 'block'},
    bounds=BoundingBox(x=10, y=20, width=100, height=50)
)

# Build tree
div.append(Text('Click here: '))
div.append(button)
button.append(Text('Submit'))

# Navigate
for child in div:
    if isinstance(child, Node):
        print(f"Element: {child.tag}")
    elif isinstance(child, Text):
        print(f"Text: {child.content}")

# Get all text
print(div.get_text())  # "Click here: Submit"

# Check visibility
print(button.is_visible())      # True
print(button.has_zero_size())   # False

API Reference

Types

  • Node: DOM element with tag, attributes, styles, bounds, metadata, and children
  • Text: Text node with content
  • BoundingBox: Element bounding box (x, y, width, height)

Parsers

  • parse_html(html: str) -> Node: Parse HTML string to Node tree
  • parse_cdp(snapshot: dict) -> Node: Parse CDP snapshot to Node tree

Filters

Presets (convenience)

  • filter_visible(node) -> Node | None: Remove all hidden elements
  • filter_semantic(node) -> Node | None: Keep only semantic content
  • filter_all(node) -> Node | None: Apply all filters

Visibility Filters

  • filter_non_visible_tags(node): Remove script, style, head, meta, etc.
  • filter_css_hidden(node): Remove display:none, visibility:hidden, opacity:0
  • filter_zero_dimensions(node): Remove zero-width/height elements

Semantic Filters

  • filter_attributes(node, keep=SEMANTIC_ATTRIBUTES): Keep only semantic attributes
  • filter_empty(node): Remove empty nodes (no attributes, no children)
  • collapse_wrappers(node): Collapse single-child wrapper elements

Node Methods

  • node.append(child): Add a child node or text
  • node.remove(child): Remove a child
  • node.is_visible(): Check if element is visible (based on styles)
  • node.has_zero_size(): Check if element has zero dimensions
  • node.get_text(separator=''): Get all text content recursively

Architecture

domnode is designed as a foundational library for web automation:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   natural-selector  โ”‚  (RAG-based element selection)
โ”‚   (embeddings, LLM) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚ uses
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     domcontext      โ”‚  (LLM context formatting)
โ”‚  (markdown, tokens) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚ uses
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚      domnode        โ”‚  (Core DOM + filtering)
โ”‚  (this package)     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Semantic Attributes

By default, filter_attributes keeps these semantic attributes:

SEMANTIC_ATTRIBUTES = {
    "role", "aria-label", "aria-labelledby", "aria-describedby",
    "aria-checked", "aria-selected", "aria-expanded", "aria-hidden",
    "aria-disabled", "type", "name", "placeholder", "value",
    "alt", "title", "href", "disabled", "checked", "selected"
}

You can customize:

from domnode.filters.semantic import filter_attributes

custom_attrs = {"role", "href", "data-test-id"}
filtered = filter_attributes(node, keep=custom_attrs)

Use Cases

Web Scraping

Extract only visible, meaningful content from web pages.

Browser Automation

Filter DOM to only interactive elements for AI agents.

LLM Context

Reduce HTML to essential semantic structure for language models.

Accessibility Testing

Analyze semantic attributes and ARIA labels.

Testing

Build and manipulate DOM trees programmatically.

Development

# Clone repository
git clone https://github.com/yourusername/domnode.git
cd domnode

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=domnode --cov-report=html

Testing

The package includes 86 comprehensive unit tests covering:

  • Core node types and operations
  • HTML and CDP parsing
  • All visibility filters
  • All semantic filters
  • Preset filter combinations
  • Edge cases and error handling
pytest -v

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Related Projects

Changelog

0.1.0 (2025-01-XX)

  • Initial release
  • Core Node, Text, BoundingBox types
  • HTML and CDP parsers
  • Visibility and semantic filters
  • 86 unit tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

domnode-0.1.0.tar.gz (21.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

domnode-0.1.0-py3-none-any.whl (19.6 kB view details)

Uploaded Python 3

File details

Details for the file domnode-0.1.0.tar.gz.

File metadata

  • Download URL: domnode-0.1.0.tar.gz
  • Upload date:
  • Size: 21.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for domnode-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ea88877682601976ef582afdda9e85379379aa1c419afc944dd62a62debf6af2
MD5 1315748d9031e0b98fb8e2560a6b3964
BLAKE2b-256 5b547c3426f9a703fc42387e44825fc773f09d12f93cec8879fe1c312644f0ab

See more details on using hashes here.

File details

Details for the file domnode-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: domnode-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for domnode-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a6bbbf2e3835b8d8ab6bc7151f2dea2dec7d5cb9dbb85b78a89602cc60fae0fa
MD5 591d7041f28f8f4eb021171a83778859
BLAKE2b-256 42b9deec91da0920a655f8dddf912a03c792f810c2108e83ef94ce4c82788998

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page