Skip to main content

Logical segmentation of web pages using visual and structural DOM heuristics.

Project description

Page Segmenter

A highly robust, structure-aware tool that parses HTML pages into non-overlapping logical segments (e.g., header, navigation, sidebar, main content, footer, cards). It uses dynamic visual and structural DOM heuristics rather than content-density metrics.

Features

  • Logical Structure Parsing: Determines segments using developer intent via DOM structure, ARIA landmarks, and semantic HTML.
  • Dynamic Threshold Configuration: Adjusts structural thresholds based on page_type (e.g., commerce, content, marketing) to handle varied website architectures without splintering or collapsing components.
  • Playwright-Backed Evaluation: Executes visual heuristics directly in the browser context to account for exact layout constraints (widths, heights, visibilities).

Installation

You can install the package directly from PyPI (once published):

pip install page-segmenter

For development, clone the repository and install it using make:

git clone https://github.com/innerkorehq/page_segmenter.git
cd page_segmenter
make install
# or for dev dependencies
make install-dev

Usage

Command-Line Interface (CLI)

You can segment a live URL directly from the terminal. Use the optional --type argument to apply type-specific heuristics.

python main.py "https://example.com" --type "marketing"

You can also segment a local HTML file:

python main.py --html ./path/to/page.html --type "doc_page"

Visual Tester

To visually debug and inspect the detected segments inside a browser window:

python visual_tester.py "https://example.com" --type "product_list"

Python API

You can use the segmenter programmatically in your asynchronous Python applications:

import asyncio
import json
from page_segmenter import find_segments, find_segments_from_html

async def main():
    # Segment a live URL
    url = "https://docs.python.org/3/"
    segments = await find_segments(url, page_type="doc_page")
    print(json.dumps(segments, indent=2))

    # Segment from raw HTML
    html_content = "<html>...</html>"
    segments = await find_segments_from_html(html_content, base_url="https://example.com", page_type="commerce")

if __name__ == "__main__":
    asyncio.run(main())

How It Works

The segmenter processes the DOM in a series of logical phases:

  1. Pruning: Discards invisible nodes, tracking noise (like script or modal), and microscopic elements.
  2. Decision Logic: Recursively traverses the DOM evaluating ARIA landmarks, semantic tags, parent identity scores (padding, borders, shadows), raw text density, structural similarity (card grids), and orphaned child checks.
  3. Adaptive Thresholds: Changes internal variables (like MIN_SUBTREE_NODES or MIN_HEIGHT) dynamically based on the passed page_type family (commerce, content, marketing, etc.).

Read the full algorithm details in algo.md.

Development

A Makefile is included to streamline development tasks:

  • make install: Install the project.
  • make install-dev: Install with development dependencies.
  • make build: Build the distribution packages (sdist and wheel).
  • make publish: Build and publish the package to PyPI using twine.
  • make clean: Clean up build artifacts and cache directories.
  • make lint: Run basic syntax checks.
  • make docs: Build Sphinx documentation.
  • make test-run: Run a quick smoke test.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

page_segmenter-0.1.2.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

page_segmenter-0.1.2-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file page_segmenter-0.1.2.tar.gz.

File metadata

  • Download URL: page_segmenter-0.1.2.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.2

File hashes

Hashes for page_segmenter-0.1.2.tar.gz
Algorithm Hash digest
SHA256 e3650939a664e7948aca3bd51a5b73776950dcc1a55ebe99d27542a5e595616c
MD5 62ec65fa349fb8c943eec99005b79ccb
BLAKE2b-256 3337d9125d45ff8ca7a60a1a1213d9fa223ce5d1534527fd748ec371211c477e

See more details on using hashes here.

File details

Details for the file page_segmenter-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: page_segmenter-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.2

File hashes

Hashes for page_segmenter-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3b6f3b7284e1670c400797c658ca60db95ac9a120d2ad0ec0d8a8ee279bd8fc9
MD5 7fd87e5588f095493c3b95199a2a0d42
BLAKE2b-256 92c23526e1dd2281463cf0ad011d8287b8f61616c69868a6bb43292d42bb4d45

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page