Logical segmentation of web pages using visual and structural DOM heuristics.

These details have not been verified by PyPI

Project description

Logical Segmentation Algorithm

Goal

Cover all user-visible content on a page with non-overlapping logical segments — header, nav, sidebar, main content, footer, card grid, breadcrumb, etc. — each identified by a unique CSS selector and bounding box. Every visible region belongs to exactly one segment.

Core Principle

The page already has logical segments. The algorithm is a structural parsing problem, not a content-finding problem. It reads developer intent from the DOM rather than inferring it from content density.

A node is either a segment leaf (owns its subtree) or a layout container (transparent — descend into its children). A node never appears in both roles.

Dynamic Threshold Configuration

The segmenter adapts its pruning and structural heuristics based on an optionally provided page_type string (e.g., product_list, doc_page, homepage). Page types are mapped to logical "families" which adjust the baseline configuration:

Baseline/Default: MIN_WIDTH=80, MIN_HEIGHT=20, MIN_SUBTREE_DEPTH=3, MIN_SUBTREE_NODES=10, COMPONENT_SCORE_THRESHOLD=4
Commerce (Product Lists): Relaxes node/depth checks (MIN_SUBTREE_DEPTH=2, MIN_SUBTREE_NODES=5) to ensure small product cards aren't skipped.
Content/Docs (Blogs, FAQ): Lowers COMPONENT_SCORE_THRESHOLD=3, as text-heavy pages often lack hard visual boundaries (borders/shadows) but still have distinct sections. Relaxes node/depth checks similarly.
Marketing (Homepages): Increases MIN_SUBTREE_NODES=15 to prevent cohesive, visually rich hero sections from splintering into tiny fragmented pieces.

Phase 1 — Pruning

Remove nodes that are never user-visible before any analysis:

Non-visible tags: script, style, noscript, meta, svg, iframe, template
Hidden elements: display:none, visibility:hidden, opacity:0, zero bounding box
Elements smaller than MIN_WIDTH×MIN_HEIGHT (true UI atoms — tiny badges, invisible spacers)
Transient noise overlays: modal, popup, overlay, toast, tooltip, dropdown, cookie, consent

What is NOT pruned (changed from earlier design):

nav, header, footer at any depth — these are user-visible page zones
Elements with ARIA roles navigation, banner, contentinfo, complementary, main, search — these are all user-visible landmark zones. ARIA landmark roles are treated the same as semantic HTML5 tags.
<nav> tag at top-level or any depth — previously blanket-pruned, now treated as SEMANTIC_SEGMENT_TAGS (always a leaf segment)

Phase 2 — Decision Logic (per node, recursive)

At each surviving node, run these checks in order. Stop at the first match.

2a. Leaf tags

pre, code, table, ul, ol, figure, blockquote, video, audio, canvas, picture — always a segment leaf, never descend. These are atomic content units.

2b. ARIA landmark roles

If the element has an ARIA landmark role (navigation, banner, contentinfo, complementary, main, search, form, region), it is always declared a segment leaf. ARIA landmarks are developer-encoded zone boundaries equivalent to semantic HTML5 tags.

2c. Semantic segment tags

section, article, aside, form, nav, header, footer — always a segment leaf. The developer used a semantic tag to explicitly mark a component boundary — trust it unconditionally.

main — trust only when not a full-width layout wrapper (width < 95% viewport) OR when a hard visual signal is present (border, shadow, radius, background isolation).

2d. Parent identity check

Score the node on visual signals. Requires at least one hard signal (background-isolation, border, box-shadow, border-radius) plus total score ≥ 4 to fire. If visible siblings with content exist, fall through instead — the parent owns this node and its siblings together.

Scoring:

Signal	Points
Background differs from parent (non-transparent)	+1
Has border	+2
Has box-shadow	+2
Has border-radius > 0	+1
Padding ≥ 16px on any side	+1
Spatial gap ≥ 16px above (from previous sibling)	+1
Compositional completeness (≥2 of: text, media, interactive)	+2

Guards that override scoring:

Full-width (≥95% viewport) with no hard isolation → score zeroed
Height < MIN_HEIGHT → score capped at 2 (true UI atoms only — nav bars are typically 35-60px and are not capped)
Child diversity > 3 (more than 3 independently meaningful or semantic children) → background-isolation is demoted and removed from score. A canvas wrapper has many diverse children; a real component does not.

2e. Raw text node check

If the node has direct non-whitespace text node children, it is a content-mixed node that cannot be safely split. Declare as leaf immediately.

2f. Structural similarity check

If ≥75% of direct children share the same deep fingerprint (tag + children tags + grandchildren tags), and there are ≥3 children, this is a repeating pattern (card grid, product list). The parent owns the repetition — declare as leaf.

2g. Meaningful children

A child is meaningful if:

It is a semantic tag (SEMANTIC_SEGMENT_TAGS) OR
It is a leaf tag (LEAF_TAGS) OR
It has subtree depth ≥ MIN_SUBTREE_DEPTH AND node count ≥ MIN_SUBTREE_NODES

If no meaningful children exist, the node is a composed unit (heading + shallow elements) — declare as leaf.

2h. Coupled-sibling check

If exactly 2–3 meaningful children exist and none individually score ≥ 4 on identity, they are functionally coupled (image column + text column, slider + thumbnails). The parent is the component — declare as leaf. Single-child nodes always descend.

2i. Orphan check

Before descending, verify no non-meaningful sibling would be left stranded. A non-meaningful sibling is skipped (not treated as an orphan) when any of the following are true:

It is hidden or < 10×10px
Its tag is in PRUNE_TAGS (script, style, etc.)
Its tag is in LEAF_TAGS (ul, ol, table, etc.) — independently processable as a leaf segment
Its tag is in SEMANTIC_SEGMENT_TAGS — will be processed independently
Its height is < 15% of the largest meaningful child's height — it is a section title, label, or heading that naturally belongs to the surrounding content and will be silently absorbed

A non-meaningful sibling triggers an orphan stop only when:

It is visible (≥10×10px)
Its height is ≥ 15% of the tallest meaningful child (substantial enough to be genuinely stranded)
It contains visible content (text nodes, img[src], visible headings/links)

If an orphan is detected, the parent declares itself the leaf — no descent.

Coupling ratio rationale: The 15% threshold distinguishes two cases:

A 76px heading next to a 640px product card (ratio = 12%) → heading, absorb silently
A 200px promo block next to a 640px card (ratio = 31%) → potential orphan, check content

Why LEAF_TAGS are excluded from orphan detection: A <ul class="breadcrumb"> is always an independently processable leaf segment. Treating it as an orphan would cause the entire page wrapper to become one giant segment.

2j. Container descent

Recurse into each meaningful child, each semantic-tag child, and each leaf-tag child. Non-meaningful, non-semantic, non-leaf siblings are silently absorbed into the parent's segment. If descent yields no results, fall back to declaring the current node a leaf.

Role Inference

Role is determined in this priority order:

ARIA role: role="navigation" → nav, role="banner" → header, role="contentinfo" → footer, role="complementary" / role="search" → sidebar, role="main" → main
Semantic HTML tag: <nav> → nav, <header> → header, <footer> → footer, <aside> → sidebar, <main> → main, <article> → article, <form> → form
Class/id vocabulary scan: keywords like hero, sidebar, card, grid, cta, pricing, faq matched against class + id names
Fallback: section

Selector Generation

Each segment gets a unique CSS selector built by walking up the DOM to the nearest #id anchor, then constructing a > descendant path downward. Each step prefers #id → tag.class1.class2 (verified unique within parent scope) → tag:nth-of-type(n). Final selector is verified unique against the full document. This grounds selectors to meaningful anchors rather than document-relative positional paths.

Output

Each segment is a flat dict:

{
  "selector": "#product_detail > div.container > div.grid",
  "role": "grid",
  "depth": 4,
  "boundingBox": { "x": 50, "y": 300, "width": 1300, "height": 2599 },
  "identityScore": 4,
  "identitySignals": ["border", "compositional-completeness"],
  "children": []
}

children is always empty in the current flat output model. The tree structure is implicit in depth and selector ancestry.

Page Load Strategy

For SSR and SPA frameworks (Next.js, React, Vue, etc.) that never reach a true network-idle state (due to keep-alive polling, websockets, or continuous background fetches), a two-phase wait is used:

Wait for domcontentloaded (reliable, fires after initial HTML parse)
Optionally wait up to 5 seconds for networkidle — if it times out, proceed with existing DOM

This prevents the 30-second timeout that wait_until="networkidle" causes on React/Next.js apps.

Key Design Decisions

Structural parsing, not content scoring. The algorithm reads developer-encoded boundaries (semantic tags, visual containment, structural repetition) rather than scoring text density or link ratios. This recovers intentional component structure rather than finding "the most content-rich zone."

Coverage is the primary goal. Every user-visible region should belong to exactly one segment. Navigation bars, sidebars, breadcrumbs, and footers are all user-visible and are included. Only true technical noise (hidden elements, script/style tags, transient overlays) is pruned.

ARIA landmarks and semantic tags are authoritative. nav, section, article, aside, header, footer, form, and any element with an ARIA landmark role (navigation, banner, contentinfo, etc.) are declared segments unconditionally. Developer intent encoded in HTML/ARIA is more reliable than any heuristic.

Hard signal requirement for identity check. Soft signals (padding, spatial gap, mixed content types) fire on layout wrappers as readily as on real components. The identity check only fires when at least one hard visual signal (border, shadow, radius, background) is present. This prevents plain wrapper divs from being declared components.

Child diversity cancels background isolation. A node with more than 3 independently meaningful children is a layout container. Background isolation on such nodes is a canvas reset, not component identity — the signal is demoted.

LEAF_TAGS are never orphans. <ul>, <ol>, <table>, <figure>, etc. are always independently processable as leaf segments. They are excluded from orphan detection and included in the descent loop. Previously, a shallow <ul class="breadcrumb"> next to a large content div would trigger an orphan stop, collapsing the entire page into one segment.

Orphan check uses a small-sibling exclusion, not a large-sibling exception. A non-meaningful sibling with content only blocks descent when its height is ≥ 15% of the tallest meaningful child. Siblings smaller than 15% are section titles, headings, or label blocks that will be silently absorbed — they are not stranded. Previously the guard was inverted (only skipping siblings > 120%), causing small heading divs to trigger orphan stops and collapse entire product-listing pages into one segment.

MIN_HEIGHT is 30px, not 60px. Navigation bars, breadcrumbs, and tab bars are commonly 30–55px tall and are user-visible. The original 60px threshold was incorrectly filtering these out at the prune step. The identity-score atom guard (cap at 2) is also lowered to 30px so only true sub-pixel elements are capped.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

May 2, 2026

This version

0.1.0

May 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

page_segmenter-0.1.0.tar.gz (22.7 kB view details)

Uploaded May 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

page_segmenter-0.1.0-py3-none-any.whl (18.7 kB view details)

Uploaded May 2, 2026 Python 3

File details

Details for the file page_segmenter-0.1.0.tar.gz.

File metadata

Download URL: page_segmenter-0.1.0.tar.gz
Upload date: May 2, 2026
Size: 22.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.2

File hashes

Hashes for page_segmenter-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`dd9f731f2584e0c9285d78381a900cd30a280247622d357827ca8de526ecab4c`
MD5	`e1519db1234aad31b31b6622dd79d8f6`
BLAKE2b-256	`5e2b7de663406b4c59df338ea379632a0f32fef00a2521a8047672a14766990b`

See more details on using hashes here.

File details

Details for the file page_segmenter-0.1.0-py3-none-any.whl.

File metadata

Download URL: page_segmenter-0.1.0-py3-none-any.whl
Upload date: May 2, 2026
Size: 18.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.2

File hashes

Hashes for page_segmenter-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4203f2c108c0a0000a4d8be61a4bfbc5086e64c008a9a41d4d2baad6da4343c7`
MD5	`da97d940a51eebaf57b10537d16fcce5`
BLAKE2b-256	`785a1877d37781c93721577e99a30de3f3a9e8c7033514ad939b104cab7231c4`

See more details on using hashes here.

page-segmenter 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Logical Segmentation Algorithm

Goal

Core Principle

Dynamic Threshold Configuration

Phase 1 — Pruning

Phase 2 — Decision Logic (per node, recursive)

2a. Leaf tags

2b. ARIA landmark roles

2c. Semantic segment tags

2d. Parent identity check

2e. Raw text node check

2f. Structural similarity check

2g. Meaningful children

2h. Coupled-sibling check

2i. Orphan check

2j. Container descent

Role Inference

Selector Generation

Output

Page Load Strategy

Key Design Decisions

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes