Logical segmentation of web pages using visual and structural DOM heuristics.
Project description
Page Segmenter
A highly robust, structure-aware tool that parses HTML pages into non-overlapping logical segments (e.g., header, navigation, sidebar, main content, footer, cards). It uses dynamic visual and structural DOM heuristics rather than content-density metrics.
Features
- Logical Structure Parsing: Determines segments using developer intent via DOM structure, ARIA landmarks, and semantic HTML.
- Dynamic Threshold Configuration: Adjusts structural thresholds based on
page_type(e.g.,commerce,content,marketing) to handle varied website architectures without splintering or collapsing components. - Playwright-Backed Evaluation: Executes visual heuristics directly in the browser context to account for exact layout constraints (widths, heights, visibilities).
Installation
You can install the package directly from PyPI (once published):
pip install page-segmenter
For development, clone the repository and install it using make:
git clone https://github.com/innerkorehq/page_segmenter.git
cd page_segmenter
make install
# or for dev dependencies
make install-dev
Usage
Command-Line Interface (CLI)
You can segment a live URL directly from the terminal. Use the optional --type argument to apply type-specific heuristics.
python main.py "https://example.com" --type "marketing"
You can also segment a local HTML file:
python main.py --html ./path/to/page.html --type "doc_page"
Visual Tester
To visually debug and inspect the detected segments inside a browser window:
python visual_tester.py "https://example.com" --type "product_list"
Python API
You can use the segmenter programmatically in your asynchronous Python applications:
import asyncio
import json
from page_segmenter import find_segments, find_segments_from_html
async def main():
# Segment a live URL
url = "https://docs.python.org/3/"
segments = await find_segments(url, page_type="doc_page")
print(json.dumps(segments, indent=2))
# Segment from raw HTML
html_content = "<html>...</html>"
segments = await find_segments_from_html(html_content, base_url="https://example.com", page_type="commerce")
if __name__ == "__main__":
asyncio.run(main())
How It Works
The segmenter processes the DOM in a series of logical phases:
- Pruning: Discards invisible nodes, tracking noise (like
scriptormodal), and microscopic elements. - Decision Logic: Recursively traverses the DOM evaluating ARIA landmarks, semantic tags, parent identity scores (padding, borders, shadows), raw text density, structural similarity (card grids), and orphaned child checks.
- Adaptive Thresholds: Changes internal variables (like
MIN_SUBTREE_NODESorMIN_HEIGHT) dynamically based on the passedpage_typefamily (commerce,content,marketing, etc.).
Read the full algorithm details in algo.md.
Development
A Makefile is included to streamline development tasks:
make install: Install the project.make install-dev: Install with development dependencies.make build: Build the distribution packages (sdistandwheel).make publish: Build and publish the package to PyPI using twine.make clean: Clean up build artifacts and cache directories.make lint: Run basic syntax checks.make docs: Build Sphinx documentation.make test-run: Run a quick smoke test.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file page_segmenter-0.1.1.tar.gz.
File metadata
- Download URL: page_segmenter-0.1.1.tar.gz
- Upload date:
- Size: 16.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f6caeae42801f28fecd8e4367dac4ba3c169dc2e482563e05308dbfa4133ce0
|
|
| MD5 |
57a722c7c7664a26863044995d6e1087
|
|
| BLAKE2b-256 |
6f6c54880e0ff2fa1a1bb808ce678061bd9cb15bfe674aebcba466efea5b7e58
|
File details
Details for the file page_segmenter-0.1.1-py3-none-any.whl.
File metadata
- Download URL: page_segmenter-0.1.1-py3-none-any.whl
- Upload date:
- Size: 15.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82c3799f0048d44ca02375a36c29e4ab4b3ef56c1ac8d3d5f53d4953d3a05422
|
|
| MD5 |
074cc408c1c4775594d16eb7405b58bf
|
|
| BLAKE2b-256 |
c612542b2eb22f6146f31ec96902422f61fea8c09614acc6dfe957aa90085beb
|