Async segment classifier library
Project description
Segment Classifier
An asynchronous Python library that classifies HTML segments extracted by a page-segmenter into structured component types.
Overview
The segment_classifier implements a 4-stage classification pipeline with progressive fallback to optimize for cost and speed:
- Rule-based heuristics — Zero LLM cost. Uses DOM structure, text density, siblings, and attributes.
- L1 exact fingerprint cache — Zero LLM cost. Exact matching on structural DOM fingerprint hashes.
- L2 fuzzy cluster cache — Zero LLM cost. TF-IDF and cosine similarity on fingerprint tokens.
- LLM batch classification — Batched fallback via LiteLLM with feature-based model routing based on segment complexity.
Installation
You can install the package using poetry:
poetry install
Or via pip (once published):
pip install segment-classifier
Setup
The library uses pydantic-settings to manage configuration via a .env file or environment variables.
Required environment variables:
CLASSIFIER_LITELLM_API_KEY="your-api-key"
Usage
import asyncio
from segment_classifier import ClassifierPipeline
from segment_classifier.config import ClassifierSettings
from segment_classifier.models import InputSegment, SegmentPosition
async def main():
settings = ClassifierSettings()
pipeline = ClassifierPipeline(settings)
await pipeline.initialize()
segments = [
InputSegment(
segment_id="seg_001",
page_url="https://example.com/products",
page_slug="products",
raw_html="<div class='product-card'>...</div>",
text_content="Product Item",
position_hint=SegmentPosition.MIDDLE,
sibling_count=3,
)
]
result = await pipeline.run(segments)
await pipeline.shutdown()
for seg in result.classified:
print(seg.component_type)
asyncio.run(main())
Caching
Caches are stored by default in .cache/l1_fingerprints.json and .cache/l2_clusters.json / .cache/l2_embeddings.npy.
Stages Breakdown
Every returned ClassifiedSegment will be marked with a classification_stage indicating which of the 4 stages resolved the query.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file segment_classifier-0.1.1.tar.gz.
File metadata
- Download URL: segment_classifier-0.1.1.tar.gz
- Upload date:
- Size: 16.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.12.2 Darwin/25.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e6046074f5abcc24a2dc7f799956defdd9bc9a9815a1e556ca978a1dcc15c04
|
|
| MD5 |
c5531538c9894fe677e8c7b4c283ee15
|
|
| BLAKE2b-256 |
d2d0f7cfcaf07aaabf945b2fe018d0ed235a211cf654b06f5cbcfaf5c1dd5fa7
|
File details
Details for the file segment_classifier-0.1.1-py3-none-any.whl.
File metadata
- Download URL: segment_classifier-0.1.1-py3-none-any.whl
- Upload date:
- Size: 21.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.12.2 Darwin/25.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
531563d5c87c5cf52d8a2799d68851a532081dc0bdd9d0e7565929c8f9a770aa
|
|
| MD5 |
ea5ced35ca753f000c97c3114116fba7
|
|
| BLAKE2b-256 |
a8ea40dd99ece15750b21c41b2e88d17f4644f0c8ee35d2f403829faaeba5871
|