Parse and query SOM (Semantic Object Model) - the structured web format for AI agents
Project description
som-parser
Parse and query SOM (Semantic Object Model) output in Python. SOM is a structured JSON format that represents web pages as semantic regions and elements, designed for AI agents, browser automation, and web scraping. This library provides Pydantic v2 models for type-safe parsing, validation, and a rich set of query utilities to extract exactly what you need.
Install
pip install som-parser
Quick Start
Parse Plasmate output
import subprocess
from som_parser import parse_som, from_plasmate
# Parse a SOM JSON string or dict
som = parse_som('{"som_version": "0.1", ...}')
# Or parse raw Plasmate CLI output directly
result = subprocess.run(["plasmate", "https://example.com"], capture_output=True, text=True)
som = from_plasmate(result.stdout)
print(som.title) # "Example Domain"
print(som.url) # "https://example.com/"
print(som.som_version) # "0.1"
Find links
from som_parser import parse_som, get_links, find_by_role
som = parse_som(data)
# Get all links as simple dicts
for link in get_links(som):
print(f"{link['text']} -> {link['href']}")
# Or find by role for full SomElement objects
for el in find_by_role(som, "link"):
print(el.id, el.text, el.attrs.href)
Get interactive elements
from som_parser import parse_som, get_interactive_elements
som = parse_som(data)
for el in get_interactive_elements(som):
print(f"{el.id}: {el.role.value} - actions: {[a.value for a in el.actions]}")
Convert to markdown
from som_parser import parse_som, to_markdown
som = parse_som(data)
print(to_markdown(som))
Use Pydantic models directly
from som_parser import Som, SomElement, ElementRole
# Validate and construct from a dict
som = Som.model_validate(my_dict)
# Access typed fields
for region in som.regions:
for element in region.elements:
if element.role == ElementRole.LINK:
print(element.attrs.href)
# Serialize back to JSON
print(som.model_dump_json(indent=2))
API Reference
Parser
| Function | Description |
|---|---|
parse_som(input: str | dict) -> Som |
Parse JSON string or dict into a validated Som object |
is_valid_som(input) -> bool |
Check if input conforms to the SOM schema |
from_plasmate(json_output: str) -> Som |
Parse raw Plasmate CLI JSON output |
Query Utilities
| Function | Description |
|---|---|
get_all_elements(som) -> list[SomElement] |
Flatten all elements from all regions |
find_by_role(som, role) -> list[SomElement] |
Find elements by role (enum or string) |
find_by_id(som, id) -> SomElement | None |
Find a single element by its SOM id |
find_by_text(som, text, exact=False) -> list[SomElement] |
Search elements by text content |
get_interactive_elements(som) -> list[SomElement] |
Get elements that have actions |
get_links(som) -> list[dict] |
Extract all links as {text, href, id} dicts |
get_forms(som) -> list[SomRegion] |
Get all form regions |
get_inputs(som) -> list[SomElement] |
Get all input elements |
get_headings(som) -> list[dict] |
Extract heading hierarchy as {level, text, id} |
get_text(som) -> str |
Extract all visible text content |
get_text_by_region(som) -> list[dict] |
Extract text grouped by region |
get_compression_ratio(som) -> float |
Return html_bytes / som_bytes |
to_markdown(som) -> str |
Convert SOM to readable markdown |
filter_elements(som, predicate) -> list[SomElement] |
Generic filter with a callable |
Types
All Pydantic v2 models are exported from the top level:
Som,SomRegion,SomElement,SomElementAttrs,SomMetaStructuredData,LinkElement,SelectOption,ListItemRegionRole,ElementRole,ElementAction,SemanticHint(enums)
Links
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file som_parser-0.3.0.tar.gz.
File metadata
- Download URL: som_parser-0.3.0.tar.gz
- Upload date:
- Size: 8.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
79d222287dd265fb2c87e1b805494467d9649f8abd58b5ab46436cfa11fb6480
|
|
| MD5 |
60e435d972a0a68c31ba9daed6d2bee2
|
|
| BLAKE2b-256 |
16a0831a0d60dcd67221a056bb9e177c4d098737cd08bd8f6997076da50df820
|
File details
Details for the file som_parser-0.3.0-py3-none-any.whl.
File metadata
- Download URL: som_parser-0.3.0-py3-none-any.whl
- Upload date:
- Size: 7.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
050a626b3d6ca13ed7ad10592944e4235d1be7e9ee2b0fe4e4548050af56f4ae
|
|
| MD5 |
c48055bfb1584ee983e1235c18bbdd6e
|
|
| BLAKE2b-256 |
288c2c915476bc0765eafb9bc79b85b28aed108bd960fa36258be7ae3895fc20
|