Skip to main content

Logical segmentation of web pages using visual and structural DOM heuristics.

Project description

Logical Segmentation Algorithm

Goal

Cover all user-visible content on a page with non-overlapping logical segments — header, nav, sidebar, main content, footer, card grid, breadcrumb, etc. — each identified by a unique CSS selector and bounding box. Every visible region belongs to exactly one segment.


Core Principle

The page already has logical segments. The algorithm is a structural parsing problem, not a content-finding problem. It reads developer intent from the DOM rather than inferring it from content density.

A node is either a segment leaf (owns its subtree) or a layout container (transparent — descend into its children). A node never appears in both roles.


Dynamic Threshold Configuration

The segmenter adapts its pruning and structural heuristics based on an optionally provided page_type string (e.g., product_list, doc_page, homepage). Page types are mapped to logical "families" which adjust the baseline configuration:

  • Baseline/Default: MIN_WIDTH=80, MIN_HEIGHT=20, MIN_SUBTREE_DEPTH=3, MIN_SUBTREE_NODES=10, COMPONENT_SCORE_THRESHOLD=4
  • Commerce (Product Lists): Relaxes node/depth checks (MIN_SUBTREE_DEPTH=2, MIN_SUBTREE_NODES=5) to ensure small product cards aren't skipped.
  • Content/Docs (Blogs, FAQ): Lowers COMPONENT_SCORE_THRESHOLD=3, as text-heavy pages often lack hard visual boundaries (borders/shadows) but still have distinct sections. Relaxes node/depth checks similarly.
  • Marketing (Homepages): Increases MIN_SUBTREE_NODES=15 to prevent cohesive, visually rich hero sections from splintering into tiny fragmented pieces.

Phase 1 — Pruning

Remove nodes that are never user-visible before any analysis:

  • Non-visible tags: script, style, noscript, meta, svg, iframe, template
  • Hidden elements: display:none, visibility:hidden, opacity:0, zero bounding box
  • Elements smaller than MIN_WIDTH×MIN_HEIGHT (true UI atoms — tiny badges, invisible spacers)
  • Transient noise overlays: modal, popup, overlay, toast, tooltip, dropdown, cookie, consent

What is NOT pruned (changed from earlier design):

  • nav, header, footer at any depth — these are user-visible page zones
  • Elements with ARIA roles navigation, banner, contentinfo, complementary, main, search — these are all user-visible landmark zones. ARIA landmark roles are treated the same as semantic HTML5 tags.
  • <nav> tag at top-level or any depth — previously blanket-pruned, now treated as SEMANTIC_SEGMENT_TAGS (always a leaf segment)

Phase 2 — Decision Logic (per node, recursive)

At each surviving node, run these checks in order. Stop at the first match.

2a. Leaf tags

pre, code, table, ul, ol, figure, blockquote, video, audio, canvas, picture — always a segment leaf, never descend. These are atomic content units.

2b. ARIA landmark roles

If the element has an ARIA landmark role (navigation, banner, contentinfo, complementary, main, search, form, region), it is always declared a segment leaf. ARIA landmarks are developer-encoded zone boundaries equivalent to semantic HTML5 tags.

2c. Semantic segment tags

section, article, aside, form, nav, header, footer — always a segment leaf. The developer used a semantic tag to explicitly mark a component boundary — trust it unconditionally.

main — trust only when not a full-width layout wrapper (width < 95% viewport) OR when a hard visual signal is present (border, shadow, radius, background isolation).

2d. Parent identity check

Score the node on visual signals. Requires at least one hard signal (background-isolation, border, box-shadow, border-radius) plus total score ≥ 4 to fire. If visible siblings with content exist, fall through instead — the parent owns this node and its siblings together.

Scoring:

Signal Points
Background differs from parent (non-transparent) +1
Has border +2
Has box-shadow +2
Has border-radius > 0 +1
Padding ≥ 16px on any side +1
Spatial gap ≥ 16px above (from previous sibling) +1
Compositional completeness (≥2 of: text, media, interactive) +2

Guards that override scoring:

  • Full-width (≥95% viewport) with no hard isolation → score zeroed
  • Height < MIN_HEIGHT → score capped at 2 (true UI atoms only — nav bars are typically 35-60px and are not capped)
  • Child diversity > 3 (more than 3 independently meaningful or semantic children) → background-isolation is demoted and removed from score. A canvas wrapper has many diverse children; a real component does not.

2e. Raw text node check

If the node has direct non-whitespace text node children, it is a content-mixed node that cannot be safely split. Declare as leaf immediately.

2f. Structural similarity check

If ≥75% of direct children share the same deep fingerprint (tag + children tags + grandchildren tags), and there are ≥3 children, this is a repeating pattern (card grid, product list). The parent owns the repetition — declare as leaf.

2g. Meaningful children

A child is meaningful if:

  • It is a semantic tag (SEMANTIC_SEGMENT_TAGS) OR
  • It is a leaf tag (LEAF_TAGS) OR
  • It has subtree depth ≥ MIN_SUBTREE_DEPTH AND node count ≥ MIN_SUBTREE_NODES

If no meaningful children exist, the node is a composed unit (heading + shallow elements) — declare as leaf.

2h. Coupled-sibling check

If exactly 2–3 meaningful children exist and none individually score ≥ 4 on identity, they are functionally coupled (image column + text column, slider + thumbnails). The parent is the component — declare as leaf. Single-child nodes always descend.

2i. Orphan check

Before descending, verify no non-meaningful sibling would be left stranded. A non-meaningful sibling is skipped (not treated as an orphan) when any of the following are true:

  • It is hidden or < 10×10px
  • Its tag is in PRUNE_TAGS (script, style, etc.)
  • Its tag is in LEAF_TAGS (ul, ol, table, etc.) — independently processable as a leaf segment
  • Its tag is in SEMANTIC_SEGMENT_TAGS — will be processed independently
  • Its height is < 15% of the largest meaningful child's height — it is a section title, label, or heading that naturally belongs to the surrounding content and will be silently absorbed

A non-meaningful sibling triggers an orphan stop only when:

  • It is visible (≥10×10px)
  • Its height is ≥ 15% of the tallest meaningful child (substantial enough to be genuinely stranded)
  • It contains visible content (text nodes, img[src], visible headings/links)

If an orphan is detected, the parent declares itself the leaf — no descent.

Coupling ratio rationale: The 15% threshold distinguishes two cases:

  • A 76px heading next to a 640px product card (ratio = 12%) → heading, absorb silently
  • A 200px promo block next to a 640px card (ratio = 31%) → potential orphan, check content

Why LEAF_TAGS are excluded from orphan detection: A <ul class="breadcrumb"> is always an independently processable leaf segment. Treating it as an orphan would cause the entire page wrapper to become one giant segment.

2j. Container descent

Recurse into each meaningful child, each semantic-tag child, and each leaf-tag child. Non-meaningful, non-semantic, non-leaf siblings are silently absorbed into the parent's segment. If descent yields no results, fall back to declaring the current node a leaf.


Role Inference

Role is determined in this priority order:

  1. ARIA role: role="navigation"nav, role="banner"header, role="contentinfo"footer, role="complementary" / role="search"sidebar, role="main"main
  2. Semantic HTML tag: <nav>nav, <header>header, <footer>footer, <aside>sidebar, <main>main, <article>article, <form>form
  3. Class/id vocabulary scan: keywords like hero, sidebar, card, grid, cta, pricing, faq matched against class + id names
  4. Fallback: section

Selector Generation

Each segment gets a unique CSS selector built by walking up the DOM to the nearest #id anchor, then constructing a > descendant path downward. Each step prefers #idtag.class1.class2 (verified unique within parent scope) → tag:nth-of-type(n). Final selector is verified unique against the full document. This grounds selectors to meaningful anchors rather than document-relative positional paths.


Output

Each segment is a flat dict:

{
  "selector": "#product_detail > div.container > div.grid",
  "role": "grid",
  "depth": 4,
  "boundingBox": { "x": 50, "y": 300, "width": 1300, "height": 2599 },
  "identityScore": 4,
  "identitySignals": ["border", "compositional-completeness"],
  "children": []
}

children is always empty in the current flat output model. The tree structure is implicit in depth and selector ancestry.


Page Load Strategy

For SSR and SPA frameworks (Next.js, React, Vue, etc.) that never reach a true network-idle state (due to keep-alive polling, websockets, or continuous background fetches), a two-phase wait is used:

  1. Wait for domcontentloaded (reliable, fires after initial HTML parse)
  2. Optionally wait up to 5 seconds for networkidle — if it times out, proceed with existing DOM

This prevents the 30-second timeout that wait_until="networkidle" causes on React/Next.js apps.


Key Design Decisions

Structural parsing, not content scoring. The algorithm reads developer-encoded boundaries (semantic tags, visual containment, structural repetition) rather than scoring text density or link ratios. This recovers intentional component structure rather than finding "the most content-rich zone."

Coverage is the primary goal. Every user-visible region should belong to exactly one segment. Navigation bars, sidebars, breadcrumbs, and footers are all user-visible and are included. Only true technical noise (hidden elements, script/style tags, transient overlays) is pruned.

ARIA landmarks and semantic tags are authoritative. nav, section, article, aside, header, footer, form, and any element with an ARIA landmark role (navigation, banner, contentinfo, etc.) are declared segments unconditionally. Developer intent encoded in HTML/ARIA is more reliable than any heuristic.

Hard signal requirement for identity check. Soft signals (padding, spatial gap, mixed content types) fire on layout wrappers as readily as on real components. The identity check only fires when at least one hard visual signal (border, shadow, radius, background) is present. This prevents plain wrapper divs from being declared components.

Child diversity cancels background isolation. A node with more than 3 independently meaningful children is a layout container. Background isolation on such nodes is a canvas reset, not component identity — the signal is demoted.

LEAF_TAGS are never orphans. <ul>, <ol>, <table>, <figure>, etc. are always independently processable as leaf segments. They are excluded from orphan detection and included in the descent loop. Previously, a shallow <ul class="breadcrumb"> next to a large content div would trigger an orphan stop, collapsing the entire page into one segment.

Orphan check uses a small-sibling exclusion, not a large-sibling exception. A non-meaningful sibling with content only blocks descent when its height is ≥ 15% of the tallest meaningful child. Siblings smaller than 15% are section titles, headings, or label blocks that will be silently absorbed — they are not stranded. Previously the guard was inverted (only skipping siblings > 120%), causing small heading divs to trigger orphan stops and collapse entire product-listing pages into one segment.

MIN_HEIGHT is 30px, not 60px. Navigation bars, breadcrumbs, and tab bars are commonly 30–55px tall and are user-visible. The original 60px threshold was incorrectly filtering these out at the prune step. The identity-score atom guard (cap at 2) is also lowered to 30px so only true sub-pixel elements are capped.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

page_segmenter-0.1.0.tar.gz (22.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

page_segmenter-0.1.0-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file page_segmenter-0.1.0.tar.gz.

File metadata

  • Download URL: page_segmenter-0.1.0.tar.gz
  • Upload date:
  • Size: 22.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.2

File hashes

Hashes for page_segmenter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dd9f731f2584e0c9285d78381a900cd30a280247622d357827ca8de526ecab4c
MD5 e1519db1234aad31b31b6622dd79d8f6
BLAKE2b-256 5e2b7de663406b4c59df338ea379632a0f32fef00a2521a8047672a14766990b

See more details on using hashes here.

File details

Details for the file page_segmenter-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: page_segmenter-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.2

File hashes

Hashes for page_segmenter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4203f2c108c0a0000a4d8be61a4bfbc5086e64c008a9a41d4d2baad6da4343c7
MD5 da97d940a51eebaf57b10537d16fcce5
BLAKE2b-256 785a1877d37781c93721577e99a30de3f3a9e8c7033514ad939b104cab7231c4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page