Skip to main content

Async segment classifier library

Project description

Segment Classifier

An asynchronous Python library that classifies HTML segments extracted by a page-segmenter into structured component types.

Overview

The segment_classifier implements a 4-stage classification pipeline with progressive fallback to optimize for cost and speed:

  1. Rule-based heuristics — Zero LLM cost. Uses DOM structure, text density, siblings, and attributes.
  2. L1 exact fingerprint cache — Zero LLM cost. Exact matching on structural DOM fingerprint hashes.
  3. L2 fuzzy cluster cache — Zero LLM cost. TF-IDF and cosine similarity on fingerprint tokens.
  4. LLM batch classification — Batched fallback via LiteLLM with feature-based model routing based on segment complexity.

Installation

You can install the package using poetry:

poetry install

Or via pip (once published):

pip install segment-classifier

Setup

The library uses pydantic-settings to manage configuration via a .env file or environment variables.

Required environment variables:

CLASSIFIER_LITELLM_API_KEY="your-api-key"

Usage

import asyncio
from segment_classifier import ClassifierPipeline
from segment_classifier.config import ClassifierSettings
from segment_classifier.models import InputSegment, SegmentPosition

async def main():
    settings = ClassifierSettings()
    pipeline = ClassifierPipeline(settings)
    await pipeline.initialize()

    segments = [
        InputSegment(
            segment_id="seg_001",
            page_url="https://example.com/products",
            page_slug="products",
            raw_html="<div class='product-card'>...</div>",
            text_content="Product Item",
            position_hint=SegmentPosition.MIDDLE,
            sibling_count=3,
        )
    ]

    result = await pipeline.run(segments)
    await pipeline.shutdown()

    for seg in result.classified:
        print(seg.component_type)

asyncio.run(main())

Caching

Caches are stored by default in .cache/l1_fingerprints.json and .cache/l2_clusters.json / .cache/l2_embeddings.npy.

Stages Breakdown

Every returned ClassifiedSegment will be marked with a classification_stage indicating which of the 4 stages resolved the query.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

segment_classifier-0.1.1.tar.gz (16.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

segment_classifier-0.1.1-py3-none-any.whl (21.8 kB view details)

Uploaded Python 3

File details

Details for the file segment_classifier-0.1.1.tar.gz.

File metadata

  • Download URL: segment_classifier-0.1.1.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.2 Darwin/25.4.0

File hashes

Hashes for segment_classifier-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3e6046074f5abcc24a2dc7f799956defdd9bc9a9815a1e556ca978a1dcc15c04
MD5 c5531538c9894fe677e8c7b4c283ee15
BLAKE2b-256 d2d0f7cfcaf07aaabf945b2fe018d0ed235a211cf654b06f5cbcfaf5c1dd5fa7

See more details on using hashes here.

File details

Details for the file segment_classifier-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: segment_classifier-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 21.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.2 Darwin/25.4.0

File hashes

Hashes for segment_classifier-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 531563d5c87c5cf52d8a2799d68851a532081dc0bdd9d0e7565929c8f9a770aa
MD5 ea5ced35ca753f000c97c3114116fba7
BLAKE2b-256 a8ea40dd99ece15750b21c41b2e88d17f4644f0c8ee35d2f403829faaeba5871

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page