Self-building hybrid classifier: FAISS embeddings + LLM fallback with feedback loop

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

2lines

These details have not been verified by PyPI

Project description

adaptive-simple-text-classifier

Self-building hybrid text classifier. FAISS embedding search with LLM fallback and automatic feedback loop.

Classifies messy, abbreviated text into structured taxonomies. The index grows as LLM results feed back, so accuracy improves and LLM costs decrease over time.

Install

pip install adaptive-simple-text-classifier

# With LLM provider:
pip install adaptive-simple-text-classifier[anthropic]   # Direct Anthropic API
pip install adaptive-simple-text-classifier[vertex]      # Google Cloud Vertex AI
pip install adaptive-simple-text-classifier[bedrock]     # AWS Bedrock
pip install adaptive-simple-text-classifier[all]         # All providers

How It Works

Input Text ──> Normalize ──> FAISS Search ──┬──> Confident? ──> Return result
                                            │
                                            └──> Uncertain? ──> LLM Classify ──> Return result
                                                                     │
                                                                     └──> Feed back into FAISS index

First run: Most items go to the LLM (cold start, only taxonomy labels in the index)
LLM results get embedded and stored back in the FAISS index
Subsequent runs: FAISS handles most items, LLM handles only novel patterns
Over time: Hit rate climbs toward 100%, LLM costs drop to near zero

Quick Start

from adaptive_classifier import AdaptiveClassifier, create_normalizer

# Define your taxonomy (nested dict, flat list, or YAML/JSON file)
taxonomy = {
    "Food": {
        "Burgers": ["Hamburger", "Cheeseburger", "Veggie Burger"],
        "Pizza": ["Pepperoni", "Margherita", "Hawaiian"],
        "Drinks": ["Coffee", "Juice", "Soda"],
    },
    "Retail": {
        "Electronics": ["Phone", "Laptop", "Tablet"],
        "Furniture": ["Chair", "Table", "Bookshelf"],
    },
}

classifier = AdaptiveClassifier(
    taxonomy=taxonomy,
    provider="anthropic",              # or "vertex", "bedrock", callable
    index_path="./my_classifier",      # persists to disk
    confidence_threshold=0.65,         # below this -> LLM fallback
    normalizer=create_normalizer(),    # expands abbreviations, strips noise
)

results = classifier.classify([
    "chz brgr",
    "lg pep pizza",
    "bkshf oak - $249.99",
    "iced coffee lg",
])

for r in results:
    print(f"{r.input_text:30s} -> {r.category_path:40s} ({r.confidence:.2f}, {r.source.value})")

# Check stats
print(results.stats.to_dict())
# {'total': 4, 'embedding_hits': 1, 'llm_calls': 1, 'llm_items': 3, 'fed_back': 6, ...}

# Run again - more hits from the index, fewer LLM calls
results2 = classifier.classify(["double chz burger", "pepperoni pza sm"])
print(f"Hit rate: {results2.stats.embedding_hits}/{results2.stats.total}")

Benchmark Results

Tested against the Financial Transaction Categorization Dataset (4.5M records, 10 categories) with 50 training examples and 100 test records. See example/ for the full benchmark.

Accuracy across 3 runs

Run          Embed-only       Hybrid+LLM    Post-feedback
---------------------------------------------------------
#1               42.0%           90.0%           72.0%
#2                               90.0%           82.0%
#3                               90.0%           84.0%

LLM usage decreasing as the index learns

Run           LLM items        LLM calls       Index size
---------------------------------------------------------
#1                   98                2              168
#2                   48                1              216
#3                   15                1              231

Run 3 performance

Metric	Hybrid+LLM	Post-feedback (no LLM)
Accuracy	90.0%	84.0%
Macro F1	0.8972	0.8366
Throughput	49.9 items/s	471.7 items/s
Embedding hits	85/100	100/100
LLM fallback items	15	0

By run 3, LLM usage dropped 85% (98 -> 15 items) and embedding-only throughput is 9x faster than hybrid. The index grows from 70 vectors to 231 as LLM results feed back.

Use Case Examples

Banking Transaction Classification

from adaptive_classifier import AdaptiveClassifier, create_normalizer

envelopes = {
    "Housing": ["Rent", "Mortgage", "Property Tax", "Home Insurance", "Maintenance"],
    "Transportation": ["Gas", "Car Payment", "Insurance", "Parking", "Transit"],
    "Food": ["Groceries", "Restaurants", "Coffee Shops", "Fast Food"],
    "Utilities": ["Electric", "Gas Utility", "Water", "Internet", "Phone"],
    "Health": ["Doctor", "Dentist", "Pharmacy", "Gym"],
    "Entertainment": ["Streaming", "Movies", "Games", "Books"],
    "Savings": ["Emergency Fund", "Retirement", "Investment"],
}

classifier = AdaptiveClassifier(
    taxonomy=envelopes,
    provider="anthropic",
    index_path="./budget_classifier",
    normalizer=create_normalizer(
        abbreviations={"wal-mart": "walmart grocery", "amzn": "amazon"},
        strip_codes=True,
    ),
)

transactions = [
    "WALMART SUPERCENTER #4532",
    "SHELL OIL 57442",
    "NETFLIX.COM",
    "CITY OF CALGARY UTILITIES",
    "TIM HORTONS #0891",
    "PHARMACHOICE #112",
]

results = classifier.classify(transactions)
for r in results:
    print(f"{r.input_text:35s} -> {r.leaf_label}")

Property Valuation CRN Lookup

from adaptive_classifier import AdaptiveClassifier, Taxonomy

crn_taxonomy = Taxonomy.from_flat([
    "Furniture > Seating > Office Chair",
    "Furniture > Seating > Dining Chair",
    "Furniture > Storage > Bookshelf",
    "Furniture > Storage > Filing Cabinet",
    "Furniture > Tables > Desk",
    "Furniture > Tables > Dining Table",
    "Electronics > Computing > Desktop Computer",
    "Electronics > Computing > Laptop",
    "Electronics > Audio Visual > Television",
    "Electronics > Audio Visual > Projector",
    "Appliances > Kitchen > Refrigerator",
    "Appliances > Kitchen > Dishwasher",
    "Appliances > Laundry > Washing Machine",
])

classifier = AdaptiveClassifier(
    taxonomy=crn_taxonomy,
    provider="vertex",
    index_path="./crn_classifier",
    provider_kwargs={"project_id": "my-gcp-project", "region": "us-east5"},
)

items = [
    "oak bkshf 5-shelf",
    "Herman Miller Aeron",
    "Samsung 65in QLED",
    "ikea kallax",
    "dell latitude 5540",
]

results = classifier.classify(items)

POS Product Hierarchy

from adaptive_classifier import AdaptiveClassifier, create_normalizer

menu = {
    "Burgers": {
        "Beef": ["Hamburger", "Cheeseburger", "Bacon Burger", "Double Burger"],
        "Chicken": ["Chicken Burger", "Spicy Chicken", "Grilled Chicken"],
        "Plant": ["Veggie Burger", "Beyond Burger"],
    },
    "Sides": {
        "Fries": ["Regular Fries", "Sweet Potato Fries", "Poutine"],
        "Salads": ["Garden Salad", "Caesar Salad", "Coleslaw"],
    },
    "Drinks": {
        "Hot": ["Coffee", "Tea", "Hot Chocolate"],
        "Cold": ["Soda", "Iced Tea", "Milkshake", "Water"],
    },
}

classifier = AdaptiveClassifier(
    taxonomy=menu,
    provider="bedrock",
    index_path="./pos_classifier",
    normalizer=create_normalizer(
        abbreviations={
            "chz": "cheese", "brgr": "burger", "dbl": "double",
            "reg": "regular", "sw pot": "sweet potato",
        }
    ),
)

pos_entries = [
    "dbl chz brgr",
    "reg fry",
    "lg coff blk",
    "spcy chkn sndwch",
    "grdn salad",
    "sw pot fry",
]

results = classifier.classify(pos_entries)

Pluggable LLM Backend

# Any callable works
def my_custom_llm(items, system_prompt, user_prompt):
    # Call OpenAI, local model, whatever
    response = my_api.complete(system=system_prompt, user=user_prompt)
    return response.text  # Must return JSON string

classifier = AdaptiveClassifier(
    taxonomy=my_taxonomy,
    provider=my_custom_llm,
)

# Or implement the LLMProvider protocol directly
from adaptive_classifier import LLMProvider

class MyProvider:
    def classify_batch(self, items, taxonomy_prompt, batch_size=50):
        # Your implementation
        return [{"input": item, "category": "..."} for item in items]

Pluggable Vector Store

FAISS is the default, but you can swap in any vector backend by implementing the VectorStore protocol:

from adaptive_classifier import AdaptiveClassifier, VectorStore
import numpy as np
from pathlib import Path

class MyVectorStore:
    """Drop-in replacement - e.g. Pinecone, Qdrant, Annoy, etc."""

    @property
    def size(self) -> int: ...
    def add(self, vectors: np.ndarray) -> None: ...
    def search(self, queries: np.ndarray, k: int) -> tuple[np.ndarray, np.ndarray]: ...
    def reset(self) -> None: ...
    def save(self, path: Path) -> None: ...
    def load(self, path: Path) -> None: ...

classifier = AdaptiveClassifier(
    taxonomy=my_taxonomy,
    vector_store=MyVectorStore(),
)

Pre-seeding with Known Mappings

# If you already have labeled data, seed the index directly
classifier.add_examples({
    "WALMART SUPERCENTER": "Food > Groceries",
    "COSTCO WHOLESALE": "Food > Groceries",
    "NETFLIX.COM": "Entertainment > Streaming",
    "SPOTIFY": "Entertainment > Streaming",
})

Configuration

Parameter	Default	Description
`confidence_threshold`	`0.65`	Below this -> LLM fallback
`k_neighbors`	`5`	Neighbors for majority voting
`llm_batch_size`	`50`	Items per LLM API call
`auto_feedback`	`True`	Feed LLM results back to index
`auto_save`	`True`	Save index after each classify()
`embedding_model`	`all-MiniLM-L6-v2`	Sentence transformer model
`vector_store`	`None` (FAISS)	Custom `VectorStore` backend

Taxonomy Formats

# Nested dict
taxonomy = {"Category": {"Subcategory": ["Leaf1", "Leaf2"]}}

# Flat path list
taxonomy = ["Category > Subcategory > Leaf1", "Category > Subcategory > Leaf2"]

# From file
classifier = AdaptiveClassifier(taxonomy="./taxonomy.json")
classifier = AdaptiveClassifier(taxonomy="./taxonomy.yaml")

Architecture

adaptive_classifier/
├── classifier.py      # AdaptiveClassifier orchestrator
├── taxonomy.py        # Taxonomy tree management
├── index.py           # Vector index + persistence + feedback
├── vector_stores.py   # Pluggable vector store backends (FAISS default)
├── embeddings.py      # Embedding provider abstraction
├── providers.py       # Pluggable LLM backends
├── normalizer.py      # Text normalization / abbreviation expansion
└── types.py           # Classification, BatchStats, etc.

Development

Setup

# Clone the repo
git clone https://github.com/johncarpenter/adaptive-simple-text-classifier.git
cd adaptive-simple-text-classifier

# Install with uv (recommended)
uv sync

# Or with pip
pip install -e ".[dev]"

Testing

# Run unit tests
uv run pytest -v

# Run the benchmark (requires ANTHROPIC_API_KEY for hybrid mode)
uv run python example/benchmark.py --embedding-only  # no API key needed
uv run python example/benchmark.py                   # full hybrid benchmark
uv run python example/benchmark.py --runs 3           # watch the learning curve

See example/README.md for benchmark details and dataset setup.

Linting

uv run ruff check .
uv run ruff format .

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Releasing

Releases are published to PyPI automatically when a GitHub release is created.

To create a new release:

Update the version in pyproject.toml and adaptive_classifier/__init__.py
Commit: git commit -am "Bump version to X.Y.Z"
Tag: git tag vX.Y.Z
Push: git push origin main --tags
Create a GitHub release from the tag

The publish workflow will build and upload to PyPI using trusted publishing.

License

MIT - see LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

2lines

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Mar 19, 2026

0.1.2

Mar 8, 2026

This version

0.1.1

Mar 8, 2026

0.1.0

Mar 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adaptive_simple_text_classifier-0.1.1.tar.gz (33.4 kB view details)

Uploaded Mar 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

adaptive_simple_text_classifier-0.1.1-py3-none-any.whl (29.7 kB view details)

Uploaded Mar 8, 2026 Python 3

File details

Details for the file adaptive_simple_text_classifier-0.1.1.tar.gz.

File metadata

Download URL: adaptive_simple_text_classifier-0.1.1.tar.gz
Upload date: Mar 8, 2026
Size: 33.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for adaptive_simple_text_classifier-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`144eca10b4972e2f446ae23a85f685a62149aab1c955b7f45aab4a2c06109525`
MD5	`85d10995e2b6e18ad821e1cc68fa5a1c`
BLAKE2b-256	`6720c045db29a028976132b6ea766df0bb1424995a9406c8d26ccc3d993bd1bc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for adaptive_simple_text_classifier-0.1.1.tar.gz:

Publisher: publish.yml on johncarpenter/adaptive-simple-text-classifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: adaptive_simple_text_classifier-0.1.1.tar.gz
- Subject digest: 144eca10b4972e2f446ae23a85f685a62149aab1c955b7f45aab4a2c06109525
- Sigstore transparency entry: 1061027707
- Sigstore integration time: Mar 8, 2026
Source repository:
- Permalink: johncarpenter/adaptive-simple-text-classifier@f292e4ff487494466db448186a34b76e359924f4
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/johncarpenter
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f292e4ff487494466db448186a34b76e359924f4
- Trigger Event: release

File details

Details for the file adaptive_simple_text_classifier-0.1.1-py3-none-any.whl.

File metadata

Download URL: adaptive_simple_text_classifier-0.1.1-py3-none-any.whl
Upload date: Mar 8, 2026
Size: 29.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for adaptive_simple_text_classifier-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ab995e24a872cddb0a5f5daeeb001b9ff9d64d37f630248a0ea1b50f312fb2af`
MD5	`858903b190049c8e11667427b73ba4f0`
BLAKE2b-256	`4c58c60716526c9c8a6adb6f04fa18ce0343627d93a0a38b5928cf08fe0ad671`

See more details on using hashes here.

Provenance

The following attestation bundles were made for adaptive_simple_text_classifier-0.1.1-py3-none-any.whl:

Publisher: publish.yml on johncarpenter/adaptive-simple-text-classifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: adaptive_simple_text_classifier-0.1.1-py3-none-any.whl
- Subject digest: ab995e24a872cddb0a5f5daeeb001b9ff9d64d37f630248a0ea1b50f312fb2af
- Sigstore transparency entry: 1061027757
- Sigstore integration time: Mar 8, 2026
Source repository:
- Permalink: johncarpenter/adaptive-simple-text-classifier@f292e4ff487494466db448186a34b76e359924f4
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/johncarpenter
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f292e4ff487494466db448186a34b76e359924f4
- Trigger Event: release

adaptive-simple-text-classifier 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

adaptive-simple-text-classifier

Install

How It Works

Quick Start

Benchmark Results

Accuracy across 3 runs

LLM usage decreasing as the index learns

Run 3 performance

Use Case Examples

Banking Transaction Classification

Property Valuation CRN Lookup

POS Product Hierarchy

Pluggable LLM Backend

Pluggable Vector Store

Pre-seeding with Known Mappings

Configuration

Taxonomy Formats

Architecture

Development

Setup

Testing

Linting

Contributing

Releasing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance