Machine-readable web semantics for AI agents. O(1) lookup, deterministic navigation, token-efficient serialization.
Project description
semantic-dom-ssg
Machine-readable web semantics for AI agents.
O(1) element lookup, deterministic navigation, and token-efficient serialization optimized for LLM consumption.
Features
- O(1) Lookup: Hash-indexed nodes via dict for constant-time element access
- Semantic State Graph: Explicit FSM for UI states and transitions
- Agent Summary: ~100 tokens vs ~800 for JSON (87% reduction)
- Security Hardened: Input validation, URL sanitization, size limits
Installation
pip install semantic-dom-ssg
Quick Start
from semantic_dom_ssg import SemanticDOM, Config
html = """
<html>
<body>
<nav><a href="/">Home</a></nav>
<main><button>Submit</button></main>
</body>
</html>
"""
# Parse HTML
sdom = SemanticDOM.parse(html)
# O(1) lookup
for node_id, node in sdom.index.items():
print(f"{node_id}: {node.role.value} - {node.label}")
# Token-efficient summary (~100 tokens)
print(sdom.to_agent_summary())
# One-line summary (~20 tokens)
print(sdom.to_one_liner())
CLI Tool
# Parse HTML to JSON
semantic-dom parse input.html --format json
# Token-efficient summary
semantic-dom parse input.html --format summary
# One-line summary (~20 tokens)
semantic-dom parse input.html --format oneline
# Validate for agent compatibility
semantic-dom validate input.html --level aa --ci
# Compare token usage
semantic-dom tokens input.html
Output Formats
JSON (Full)
{
"title": "My Page",
"landmarks": ["sdom_nav_1", "sdom_main_2"],
"interactables": ["sdom_a_1", "sdom_button_1"],
"nodes": { ... }
}
Agent Summary (~100 tokens)
PAGE: My Page
LANDMARKS: nav(nav), main(main)
ACTIONS: [nav]Home, [act]Submit
STATE: initial -> Home
STATS: 2L 2A 0H
One-liner (~20 tokens)
My Page | 2L 2A | nav,main | lnk:Home,btn:Submit
Security
This package implements security hardening per ISO/IEC-SDOM-SSG-DRAFT-2024:
- Input Size Limits: 10MB default maximum
- URL Validation: Only
https,http,fileprotocols allowed - Protocol Blocking:
javascript:,data:,vbscript:,blob:blocked - No Script Execution: HTML parsing only, no JS evaluation
from semantic_dom_ssg import validate_url
from semantic_dom_ssg.security import InvalidUrlProtocolError
# Safe URLs
assert validate_url("https://example.com") == "https://example.com"
assert validate_url("/relative/path") == "/relative/path"
# Blocked URLs
try:
validate_url("javascript:alert(1)")
except InvalidUrlProtocolError as e:
print(f"Blocked: {e.protocol}")
Agent Certification
Validate HTML documents for AI agent compatibility:
from semantic_dom_ssg import SemanticDOM, AgentCertification
sdom = SemanticDOM.parse(html)
cert = AgentCertification.certify(sdom)
print(f"{cert.level.badge} Level: {cert.level.name_str} (Score: {cert.score})")
print(f"Passed: {cert.stats.passed_checks}/{cert.stats.total_checks} checks")
Certification Levels
| Level | Badge | Requirements |
|---|---|---|
| AAA | 🥇 | Score 90+ (full compliance) |
| AA | 🥈 | Score 70-89 (deterministic FSM) |
| A | 🥉 | Score 50-69 (basic compliance) |
| None | ❌ | Score < 50 |
API Reference
SemanticDOM
class SemanticDOM:
# Attributes
index: dict[str, SemanticNode] # O(1) lookup
landmarks: list[str] # Landmark IDs
interactables: list[str] # Interactive element IDs
headings: list[str] # Heading IDs
state_graph: StateGraph # UI state machine
title: Optional[str] # Document title
lang: Optional[str] # Document language
# Methods
@classmethod
def parse(cls, html: str, config: Optional[Config] = None) -> "SemanticDOM"
def get(self, node_id: str) -> Optional[SemanticNode]
def get_landmarks(self) -> list[SemanticNode]
def get_interactables(self) -> list[SemanticNode]
def to_json(self, indent: int = 2) -> str
def to_dict(self) -> dict
def to_agent_summary(self) -> str
def to_one_liner(self) -> str
Config
@dataclass
class Config:
max_input_size: int = 10 * 1024 * 1024 # 10MB
id_prefix: str = "sdom"
max_depth: int = 50
exclude_tags: list[str] = ["script", "style", "noscript", "template"]
include_state_graph: bool = True
validate: bool = True
Standards
Implements ISO/IEC-SDOM-SSG-DRAFT-2024 specification for:
- Semantic element classification
- State graph construction
- Agent-ready certification
- Token-efficient serialization
Related
- semantic-dom-ssg (npm) - TypeScript implementation
- semantic-dom-ssg (crates.io) - Rust implementation
License
MIT License - see LICENSE for details.
Author
George Alexander info@gorgalxandr.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semantic_dom_ssg-0.2.0.tar.gz.
File metadata
- Download URL: semantic_dom_ssg-0.2.0.tar.gz
- Upload date:
- Size: 18.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81004f3623ce36647cf56ffdfc0393b2c0123599203d0a4317c2a5138d74c538
|
|
| MD5 |
f3920b72e994b8ddb9473c475272920b
|
|
| BLAKE2b-256 |
679eef707fab18f8e8b89e9a605885cdb5b6089ca233fea2cc447c84c54268b4
|
File details
Details for the file semantic_dom_ssg-0.2.0-py3-none-any.whl.
File metadata
- Download URL: semantic_dom_ssg-0.2.0-py3-none-any.whl
- Upload date:
- Size: 21.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f6375accf1936a61ea562b24aba4cb1b329d9a70223d5cd616f2e329197bf8c
|
|
| MD5 |
e08eb76b8680a9ac7bb29659af6e9b34
|
|
| BLAKE2b-256 |
154a3a4d11532f9be914b131f3d84e2e8d77310f9df00a93274c4ad0ed2b58a3
|