Skip to main content

Async segment classifier library

Project description

Segment Classifier

An asynchronous Python library that classifies HTML segments extracted by a page-segmenter into structured component types.

Overview

The segment_classifier implements a 4-stage classification pipeline with progressive fallback to optimize for cost and speed:

  1. Rule-based heuristics — Zero LLM cost. Uses DOM structure, text density, siblings, and attributes.
  2. L1 exact fingerprint cache — Zero LLM cost. Exact matching on structural DOM fingerprint hashes.
  3. L2 fuzzy cluster cache — Zero LLM cost. TF-IDF and cosine similarity on fingerprint tokens.
  4. LLM batch classification — Batched fallback via LiteLLM with feature-based model routing based on segment complexity.

Installation

You can install the package using poetry:

poetry install

Or via pip (once published):

pip install segment-classifier

Setup

The library uses pydantic-settings to manage configuration via a .env file or environment variables.

Required environment variables:

CLASSIFIER_LITELLM_API_KEY="your-api-key"

Usage

import asyncio
from segment_classifier import ClassifierPipeline
from segment_classifier.config import ClassifierSettings
from segment_classifier.models import InputSegment, SegmentPosition

async def main():
    settings = ClassifierSettings()
    pipeline = ClassifierPipeline(settings)
    await pipeline.initialize()

    segments = [
        InputSegment(
            segment_id="seg_001",
            page_url="https://example.com/products",
            page_slug="products",
            raw_html="<div class='product-card'>...</div>",
            text_content="Product Item",
            position_hint=SegmentPosition.MIDDLE,
            sibling_count=3,
        )
    ]

    result = await pipeline.run(segments)
    await pipeline.shutdown()

    for seg in result.classified:
        print(seg.component_type)

asyncio.run(main())

Caching

Caches are stored by default in .cache/l1_fingerprints.json and .cache/l2_clusters.json / .cache/l2_embeddings.npy.

Stages Breakdown

Every returned ClassifiedSegment will be marked with a classification_stage indicating which of the 4 stages resolved the query.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

segment_classifier-0.3.0-py3-none-any.whl (29.7 kB view details)

Uploaded Python 3

File details

Details for the file segment_classifier-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for segment_classifier-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5bc0c0b5619b6b65b891871c6c90114c4fc3591f31e3386ffcc69ba2e21445bc
MD5 60eae6f0912fe20b228450298e4d3e50
BLAKE2b-256 dfde6cdfd8b463ee41c2b0a6c06a42cfdc890f50bcde5d580317a077ec175b99

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page