Skip to main content

Async segment classifier library

Project description

Segment Classifier

An asynchronous Python library that classifies HTML segments extracted by a page-segmenter into structured component types.

Overview

The segment_classifier implements a 4-stage classification pipeline with progressive fallback to optimize for cost and speed:

  1. Rule-based heuristics — Zero LLM cost. Uses DOM structure, text density, siblings, and attributes.
  2. L1 exact fingerprint cache — Zero LLM cost. Exact matching on structural DOM fingerprint hashes.
  3. L2 fuzzy cluster cache — Zero LLM cost. TF-IDF and cosine similarity on fingerprint tokens.
  4. LLM batch classification — Batched fallback via LiteLLM with feature-based model routing based on segment complexity.

Installation

You can install the package using poetry:

poetry install

Or via pip (once published):

pip install segment-classifier

Setup

The library uses pydantic-settings to manage configuration via a .env file or environment variables.

Required environment variables:

CLASSIFIER_LITELLM_API_KEY="your-api-key"

Usage

import asyncio
from segment_classifier import ClassifierPipeline
from segment_classifier.config import ClassifierSettings
from segment_classifier.models import InputSegment, SegmentPosition

async def main():
    settings = ClassifierSettings()
    pipeline = ClassifierPipeline(settings)
    await pipeline.initialize()

    segments = [
        InputSegment(
            segment_id="seg_001",
            page_url="https://example.com/products",
            page_slug="products",
            raw_html="<div class='product-card'>...</div>",
            text_content="Product Item",
            position_hint=SegmentPosition.MIDDLE,
            sibling_count=3,
        )
    ]

    result = await pipeline.run(segments)
    await pipeline.shutdown()

    for seg in result.classified:
        print(seg.component_type)

asyncio.run(main())

Caching

Caches are stored by default in .cache/l1_fingerprints.json and .cache/l2_clusters.json / .cache/l2_embeddings.npy.

Stages Breakdown

Every returned ClassifiedSegment will be marked with a classification_stage indicating which of the 4 stages resolved the query.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

segment_classifier-0.1.0.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

segment_classifier-0.1.0-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file segment_classifier-0.1.0.tar.gz.

File metadata

  • Download URL: segment_classifier-0.1.0.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.2 Darwin/25.4.0

File hashes

Hashes for segment_classifier-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d9294db723333c0411d7049d10875d54eed7e2b4308613b6bb1d1712f114d67c
MD5 867498a0de60c44ae27acff015ed1f40
BLAKE2b-256 76d3858e80e211f43651901a6e1bd9d26b910cac121e514f151829762a1a0842

See more details on using hashes here.

File details

Details for the file segment_classifier-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: segment_classifier-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.2 Darwin/25.4.0

File hashes

Hashes for segment_classifier-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 975d5ac3eba7ac5919d78d22e3a8d5c6eb9fcca8d3f54637962ec909af8eba90
MD5 1aaf8c6f4c9cb76714b38de491e6aa4f
BLAKE2b-256 272d9b7721e3b17de6ef9314ae07185fc85ea9884a0b38e3d23a8b60a2a153f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page