Logical segmentation of web pages using visual and structural DOM heuristics.
Project description
Logical Segmentation Algorithm
Goal
Cover all user-visible content on a page with non-overlapping logical segments — header, nav, sidebar, main content, footer, card grid, breadcrumb, etc. — each identified by a unique CSS selector and bounding box. Every visible region belongs to exactly one segment.
Core Principle
The page already has logical segments. The algorithm is a structural parsing problem, not a content-finding problem. It reads developer intent from the DOM rather than inferring it from content density.
A node is either a segment leaf (owns its subtree) or a layout container (transparent — descend into its children). A node never appears in both roles.
Dynamic Threshold Configuration
The segmenter adapts its pruning and structural heuristics based on an optionally provided page_type string (e.g., product_list, doc_page, homepage). Page types are mapped to logical "families" which adjust the baseline configuration:
- Baseline/Default:
MIN_WIDTH=80,MIN_HEIGHT=20,MIN_SUBTREE_DEPTH=3,MIN_SUBTREE_NODES=10,COMPONENT_SCORE_THRESHOLD=4 - Commerce (Product Lists): Relaxes node/depth checks (
MIN_SUBTREE_DEPTH=2,MIN_SUBTREE_NODES=5) to ensure small product cards aren't skipped. - Content/Docs (Blogs, FAQ): Lowers
COMPONENT_SCORE_THRESHOLD=3, as text-heavy pages often lack hard visual boundaries (borders/shadows) but still have distinct sections. Relaxes node/depth checks similarly. - Marketing (Homepages): Increases
MIN_SUBTREE_NODES=15 to prevent cohesive, visually rich hero sections from splintering into tiny fragmented pieces.
Phase 1 — Pruning
Remove nodes that are never user-visible before any analysis:
- Non-visible tags:
script,style,noscript,meta,svg,iframe,template - Hidden elements:
display:none,visibility:hidden,opacity:0, zero bounding box - Elements smaller than
MIN_WIDTH×MIN_HEIGHT(true UI atoms — tiny badges, invisible spacers) - Transient noise overlays:
modal,popup,overlay,toast,tooltip,dropdown,cookie,consent
What is NOT pruned (changed from earlier design):
nav,header,footerat any depth — these are user-visible page zones- Elements with ARIA roles
navigation,banner,contentinfo,complementary,main,search— these are all user-visible landmark zones. ARIA landmark roles are treated the same as semantic HTML5 tags. <nav>tag at top-level or any depth — previously blanket-pruned, now treated asSEMANTIC_SEGMENT_TAGS(always a leaf segment)
Phase 2 — Decision Logic (per node, recursive)
At each surviving node, run these checks in order. Stop at the first match.
2a. Leaf tags
pre, code, table, ul, ol, figure, blockquote, video, audio, canvas, picture — always a segment leaf, never descend. These are atomic content units.
2b. ARIA landmark roles
If the element has an ARIA landmark role (navigation, banner, contentinfo, complementary, main, search, form, region), it is always declared a segment leaf. ARIA landmarks are developer-encoded zone boundaries equivalent to semantic HTML5 tags.
2c. Semantic segment tags
section, article, aside, form, nav, header, footer — always a segment leaf. The developer used a semantic tag to explicitly mark a component boundary — trust it unconditionally.
main — trust only when not a full-width layout wrapper (width < 95% viewport) OR when a hard visual signal is present (border, shadow, radius, background isolation).
2d. Parent identity check
Score the node on visual signals. Requires at least one hard signal (background-isolation, border, box-shadow, border-radius) plus total score ≥ 4 to fire. If visible siblings with content exist, fall through instead — the parent owns this node and its siblings together.
Scoring:
| Signal | Points |
|---|---|
| Background differs from parent (non-transparent) | +1 |
| Has border | +2 |
| Has box-shadow | +2 |
| Has border-radius > 0 | +1 |
| Padding ≥ 16px on any side | +1 |
| Spatial gap ≥ 16px above (from previous sibling) | +1 |
| Compositional completeness (≥2 of: text, media, interactive) | +2 |
Guards that override scoring:
- Full-width (≥95% viewport) with no hard isolation → score zeroed
- Height <
MIN_HEIGHT→ score capped at 2 (true UI atoms only — nav bars are typically 35-60px and are not capped) - Child diversity > 3 (more than 3 independently meaningful or semantic children) →
background-isolationis demoted and removed from score. A canvas wrapper has many diverse children; a real component does not.
2e. Raw text node check
If the node has direct non-whitespace text node children, it is a content-mixed node that cannot be safely split. Declare as leaf immediately.
2f. Structural similarity check
If ≥75% of direct children share the same deep fingerprint (tag + children tags + grandchildren tags), and there are ≥3 children, this is a repeating pattern (card grid, product list). The parent owns the repetition — declare as leaf.
2g. Meaningful children
A child is meaningful if:
- It is a semantic tag (
SEMANTIC_SEGMENT_TAGS) OR - It is a leaf tag (
LEAF_TAGS) OR - It has subtree depth ≥
MIN_SUBTREE_DEPTHAND node count ≥MIN_SUBTREE_NODES
If no meaningful children exist, the node is a composed unit (heading + shallow elements) — declare as leaf.
2h. Coupled-sibling check
If exactly 2–3 meaningful children exist and none individually score ≥ 4 on identity, they are functionally coupled (image column + text column, slider + thumbnails). The parent is the component — declare as leaf. Single-child nodes always descend.
2i. Orphan check
Before descending, verify no non-meaningful sibling would be left stranded. A non-meaningful sibling is skipped (not treated as an orphan) when any of the following are true:
- It is hidden or < 10×10px
- Its tag is in
PRUNE_TAGS(script, style, etc.) - Its tag is in
LEAF_TAGS(ul,ol,table, etc.) — independently processable as a leaf segment - Its tag is in
SEMANTIC_SEGMENT_TAGS— will be processed independently - Its height is < 15% of the largest meaningful child's height — it is a section title, label, or heading that naturally belongs to the surrounding content and will be silently absorbed
A non-meaningful sibling triggers an orphan stop only when:
- It is visible (≥10×10px)
- Its height is ≥ 15% of the tallest meaningful child (substantial enough to be genuinely stranded)
- It contains visible content (text nodes,
img[src], visible headings/links)
If an orphan is detected, the parent declares itself the leaf — no descent.
Coupling ratio rationale: The 15% threshold distinguishes two cases:
- A 76px heading next to a 640px product card (ratio = 12%) → heading, absorb silently
- A 200px promo block next to a 640px card (ratio = 31%) → potential orphan, check content
Why LEAF_TAGS are excluded from orphan detection: A <ul class="breadcrumb"> is always an independently processable leaf segment. Treating it as an orphan would cause the entire page wrapper to become one giant segment.
2j. Container descent
Recurse into each meaningful child, each semantic-tag child, and each leaf-tag child. Non-meaningful, non-semantic, non-leaf siblings are silently absorbed into the parent's segment. If descent yields no results, fall back to declaring the current node a leaf.
Role Inference
Role is determined in this priority order:
- ARIA role:
role="navigation"→nav,role="banner"→header,role="contentinfo"→footer,role="complementary"/role="search"→sidebar,role="main"→main - Semantic HTML tag:
<nav>→nav,<header>→header,<footer>→footer,<aside>→sidebar,<main>→main,<article>→article,<form>→form - Class/id vocabulary scan: keywords like
hero,sidebar,card,grid,cta,pricing,faqmatched against class + id names - Fallback:
section
Selector Generation
Each segment gets a unique CSS selector built by walking up the DOM to the nearest #id anchor, then constructing a > descendant path downward. Each step prefers #id → tag.class1.class2 (verified unique within parent scope) → tag:nth-of-type(n). Final selector is verified unique against the full document. This grounds selectors to meaningful anchors rather than document-relative positional paths.
Output
Each segment is a flat dict:
{
"selector": "#product_detail > div.container > div.grid",
"role": "grid",
"depth": 4,
"boundingBox": { "x": 50, "y": 300, "width": 1300, "height": 2599 },
"identityScore": 4,
"identitySignals": ["border", "compositional-completeness"],
"children": []
}
children is always empty in the current flat output model. The tree structure is implicit in depth and selector ancestry.
Page Load Strategy
For SSR and SPA frameworks (Next.js, React, Vue, etc.) that never reach a true network-idle state (due to keep-alive polling, websockets, or continuous background fetches), a two-phase wait is used:
- Wait for
domcontentloaded(reliable, fires after initial HTML parse) - Optionally wait up to 5 seconds for
networkidle— if it times out, proceed with existing DOM
This prevents the 30-second timeout that wait_until="networkidle" causes on React/Next.js apps.
Key Design Decisions
Structural parsing, not content scoring. The algorithm reads developer-encoded boundaries (semantic tags, visual containment, structural repetition) rather than scoring text density or link ratios. This recovers intentional component structure rather than finding "the most content-rich zone."
Coverage is the primary goal. Every user-visible region should belong to exactly one segment. Navigation bars, sidebars, breadcrumbs, and footers are all user-visible and are included. Only true technical noise (hidden elements, script/style tags, transient overlays) is pruned.
ARIA landmarks and semantic tags are authoritative. nav, section, article, aside, header, footer, form, and any element with an ARIA landmark role (navigation, banner, contentinfo, etc.) are declared segments unconditionally. Developer intent encoded in HTML/ARIA is more reliable than any heuristic.
Hard signal requirement for identity check. Soft signals (padding, spatial gap, mixed content types) fire on layout wrappers as readily as on real components. The identity check only fires when at least one hard visual signal (border, shadow, radius, background) is present. This prevents plain wrapper divs from being declared components.
Child diversity cancels background isolation. A node with more than 3 independently meaningful children is a layout container. Background isolation on such nodes is a canvas reset, not component identity — the signal is demoted.
LEAF_TAGS are never orphans. <ul>, <ol>, <table>, <figure>, etc. are always independently processable as leaf segments. They are excluded from orphan detection and included in the descent loop. Previously, a shallow <ul class="breadcrumb"> next to a large content div would trigger an orphan stop, collapsing the entire page into one segment.
Orphan check uses a small-sibling exclusion, not a large-sibling exception. A non-meaningful sibling with content only blocks descent when its height is ≥ 15% of the tallest meaningful child. Siblings smaller than 15% are section titles, headings, or label blocks that will be silently absorbed — they are not stranded. Previously the guard was inverted (only skipping siblings > 120%), causing small heading divs to trigger orphan stops and collapse entire product-listing pages into one segment.
MIN_HEIGHT is 30px, not 60px. Navigation bars, breadcrumbs, and tab bars are commonly 30–55px tall and are user-visible. The original 60px threshold was incorrectly filtering these out at the prune step. The identity-score atom guard (cap at 2) is also lowered to 30px so only true sub-pixel elements are capped.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file page_segmenter-0.1.0.tar.gz.
File metadata
- Download URL: page_segmenter-0.1.0.tar.gz
- Upload date:
- Size: 22.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd9f731f2584e0c9285d78381a900cd30a280247622d357827ca8de526ecab4c
|
|
| MD5 |
e1519db1234aad31b31b6622dd79d8f6
|
|
| BLAKE2b-256 |
5e2b7de663406b4c59df338ea379632a0f32fef00a2521a8047672a14766990b
|
File details
Details for the file page_segmenter-0.1.0-py3-none-any.whl.
File metadata
- Download URL: page_segmenter-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4203f2c108c0a0000a4d8be61a4bfbc5086e64c008a9a41d4d2baad6da4343c7
|
|
| MD5 |
da97d940a51eebaf57b10537d16fcce5
|
|
| BLAKE2b-256 |
785a1877d37781c93721577e99a30de3f3a9e8c7033514ad939b104cab7231c4
|