Skip to main content

DOM nodes with browser rendering data for web automation

Project description

domnode

DOM nodes with browser rendering data for web automation.

A Python library for parsing and filtering DOM trees with browser rendering information. Supports HTML and Chrome DevTools Protocol snapshots.

Installation

pip install domnode

Quick Start

from domnode import parse_html, filter_visible

html = """
<div>
    <script>console.log('hidden')</script>
    <div style="display: none">Hidden content</div>
    <button role="button" class="btn">Click me</button>
</div>
"""

root = parse_html(html)
visible = filter_visible(root)

for child in visible:
    print(child.tag, child.attrib)
# Output: button {'role': 'button', 'class': 'btn'}

Features

  • Parse HTML strings and CDP snapshots into rich DOM trees
  • Filter visibility (display:none, visibility:hidden, opacity:0, zero-size)
  • Filter semantically (keep only meaningful attributes, collapse wrappers)
  • Access computed styles and bounding boxes
  • 86 comprehensive unit tests

Usage

Parsing HTML

from domnode.parsers import parse_html

html = '<div class="container"><button>Click</button></div>'
root = parse_html(html)

print(root.tag)          # 'div'
print(root.attrib)       # {'class': 'container'}
print(root.children[0])  # Node(tag='button', ...)

Parsing CDP Snapshots

from domnode.parsers import parse_cdp

# From Playwright/Puppeteer
snapshot = await page.cdp_session.send('DOMSnapshot.captureSnapshot', {
    'computedStyles': [],
    'includeDOMRects': True
})

root = parse_cdp(snapshot)
print(root.bounds)  # BoundingBox(x=0, y=0, width=1920, height=1080)
print(root.styles)  # {'display': 'block', 'position': 'static', ...}

Filtering Visible Elements

from domnode import parse_html, filter_visible

html = """
<div>
    <script>alert('hidden')</script>
    <style>.hide { display: none; }</style>
    <div style="display: none">Hidden</div>
    <div style="opacity: 0">Invisible</div>
    <button>Visible</button>
</div>
"""

root = parse_html(html)
visible = filter_visible(root)

# Only button remains
assert len(visible.children) == 1
assert visible.children[0].tag == 'button'

Filtering Semantic Content

from domnode import parse_html, filter_semantic

html = """
<div class="wrapper" id="container">
    <div class="inner">
        <button class="btn" role="button" aria-label="Submit">Click</button>
    </div>
</div>
"""

root = parse_html(html)
semantic = filter_semantic(root)

# Wrappers collapsed, only semantic attributes remain
assert semantic.tag == 'button'
assert semantic.attrib == {'role': 'button', 'aria-label': 'Submit'}

Combining Filters

from domnode import parse_html, filter_all

html = """
<html>
    <head>
        <script src="app.js"></script>
    </head>
    <body class="page">
        <div class="wrapper">
            <button class="btn" role="button">Click</button>
        </div>
    </body>
</html>
"""

root = parse_html(html)
clean = filter_all(root)

# Head removed, wrappers collapsed, only semantic attributes
assert clean.tag == 'button'
assert clean.attrib == {'role': 'button'}

Granular Filtering

from domnode.parsers import parse_html
from domnode.filters.visibility import filter_css_hidden, filter_zero_dimensions
from domnode.filters.semantic import filter_attributes, collapse_wrappers

root = parse_html(html)

# Apply specific filters
root = filter_css_hidden(root)
root = filter_attributes(root)
root = collapse_wrappers(root)

Working with Nodes

from domnode import Node, Text, BoundingBox

# Create nodes
div = Node(tag='div', attrib={'class': 'container'})
button = Node(
    tag='button',
    attrib={'role': 'button'},
    styles={'display': 'block'},
    bounds=BoundingBox(x=10, y=20, width=100, height=50)
)

# Build tree
div.append(Text('Click here: '))
div.append(button)
button.append(Text('Submit'))

# Navigate
for child in div:
    if isinstance(child, Node):
        print(f"Element: {child.tag}")
    elif isinstance(child, Text):
        print(f"Text: {child.content}")

# Get all text
print(div.get_text())  # "Click here: Submit"

# Check visibility
print(button.is_visible())      # True
print(button.has_zero_size())   # False

API Reference

Types

Node DOM element with tag, attributes, styles, bounds, metadata, and children.

Text Text node with content.

BoundingBox Element bounding box with x, y, width, height.

Parsers

parse_html(html: str) -> Node Parse HTML string to Node tree.

parse_cdp(snapshot: dict) -> Node Parse CDP snapshot to Node tree.

Preset Filters

filter_visible(node) -> Node | None Remove all hidden elements.

filter_semantic(node) -> Node | None Keep only semantic content.

filter_all(node) -> Node | None Apply all filters.

Visibility Filters

filter_non_visible_tags(node) Remove script, style, head, meta, etc.

filter_css_hidden(node) Remove display:none, visibility:hidden, opacity:0.

filter_zero_dimensions(node) Remove zero-width/height elements.

Semantic Filters

filter_attributes(node, keep=SEMANTIC_ATTRIBUTES) Keep only semantic attributes.

filter_empty(node) Remove empty nodes.

collapse_wrappers(node) Collapse single-child wrapper elements.

Node Methods

node.append(child) Add a child node or text.

node.remove(child) Remove a child.

node.is_visible() Check if element is visible.

node.has_zero_size() Check if element has zero dimensions.

node.get_text(separator='') Get all text content recursively.

Semantic Attributes

By default, filter_attributes keeps these attributes:

SEMANTIC_ATTRIBUTES = {
    "role", "aria-label", "aria-labelledby", "aria-describedby",
    "aria-checked", "aria-selected", "aria-expanded", "aria-hidden",
    "aria-disabled", "type", "name", "placeholder", "value",
    "alt", "title", "href", "disabled", "checked", "selected"
}

You can customize:

from domnode.filters.semantic import filter_attributes

custom_attrs = {"role", "href", "data-test-id"}
filtered = filter_attributes(node, keep=custom_attrs)

Use Cases

Web Scraping Extract only visible, meaningful content from web pages.

Browser Automation Filter DOM to only interactive elements for AI agents.

LLM Context Reduce HTML to essential semantic structure for language models.

Accessibility Testing Analyze semantic attributes and ARIA labels.

Testing Build and manipulate DOM trees programmatically.

Development

# Clone repository
git clone https://github.com/steve-z-wang/domnode.git
cd domnode

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=domnode --cov-report=html

License

MIT

Contributing

Contributions are welcome. Please submit a Pull Request.

Related Projects

domcontext - DOM to LLM context with markdown serialization

natural-selector - Natural language element selection with RAG

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

domnode-0.2.0.tar.gz (20.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

domnode-0.2.0-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file domnode-0.2.0.tar.gz.

File metadata

  • Download URL: domnode-0.2.0.tar.gz
  • Upload date:
  • Size: 20.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for domnode-0.2.0.tar.gz
Algorithm Hash digest
SHA256 033e0fdaeadca57e0325133e5c000eb7ef4434c23d148d8cf6497d33c6389fc0
MD5 a22710d55a35224b4dff08cc486f7ed3
BLAKE2b-256 85e916eb57a4b7c814ad9da4e207472bd33de1fbbf45d4d5a2f4ed18134d2d5f

See more details on using hashes here.

File details

Details for the file domnode-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: domnode-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 19.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for domnode-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4b645edb338ef82ac3b33a945185783979663a5483dd70575da6fc74a7c395c6
MD5 fa3d6c6c47299a625feabafd0cb936a7
BLAKE2b-256 02d642d2da646b860ef7e1b8f0ddf6c88d96d4b59841cd25c5d8c10613f6e44a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page