Taxonomy-agnostic text classification pipeline
Project description
Classivore
Open-source, taxonomy-agnostic text classification pipeline. Give it any hierarchical taxonomy CSV and it builds a production multi-label classifier — from data collection through training to inference.
What It Does
Classivore automates the full pipeline for building a text classifier on a custom taxonomy:
- Enrich your taxonomy with LLM-generated descriptions, boundaries, aliases, and difficulty ratings
- Collect training data from the web using search APIs (Brave, Serper) and Common Crawl
- Label collected pages using hierarchical LLM classification via the Anthropic Batch API
- Train a DeBERTa-v3 classifier with focal loss, per-category thresholds, and quality reporting
- Publish the trained model to HuggingFace Hub for serving
The entire pipeline is driven from the command line. Each stage is resumable — interrupt and restart without losing progress.
Quick Start
# Install
git clone https://github.com/NotYoCheese/classivore.git
cd classivore
python -m venv venv && source venv/bin/activate
pip install -e .
# Initialize a new taxonomy
classivore init --csv your_taxonomy.csv --name "My Taxonomy" --version "1.0" --slug my-tax
# Or use an existing taxonomy and run the pipeline step by step
classivore enrich --taxonomy my-tax
classivore collect --taxonomy my-tax
classivore label --taxonomy my-tax
classivore train --taxonomy my-tax
# Run inference
classivore classify --text "Article about machine learning trends..."
classivore classify --file articles.json --output predictions.json
classivore classify --interactive
Pipeline Stages
classivore init
Onboard a new taxonomy from a CSV file. Validates the CSV structure, generates a config.yaml with sensible defaults, optionally runs LLM enrichment and domain hint generation, and prints an onboarding report with collection cost estimates.
classivore enrich
Generate descriptions, boundaries, aliases, and difficulty ratings for each taxonomy category using the Anthropic Batch API. These fields improve search query quality and help the labeling stage make better decisions.
classivore collect
Discover and scrape web pages for training data. Uses search APIs (Brave, Serper) with automatic fallback, Common Crawl CDX for historical pages, and content quality filters. Collection targets are tiered by category difficulty — hard categories get more pages to compensate for scarce editorial content.
classivore label
Classify collected pages using a two-stage hierarchical LLM approach:
- Stage 1: Tier-1 triage identifies which top-level categories apply (cheap, broad pass)
- Stage 2: Subtree classification within selected tier-1s (detailed, with chain-of-thought)
Uses the Anthropic Batch API for 50% cost reduction. Crash-recoverable — pages at each stage are checkpointed.
classivore train
Fine-tune DeBERTa-v3-large for multi-label classification. Features:
- Weighted focal loss for extreme class imbalance
- Confidence-weighted training (legacy labels discounted)
- Per-category threshold optimization (+5% F1 macro over global threshold)
- Comprehensive quality report with per-category metrics, confusion pairs, and overfitting detection
classivore classify
Run inference using a trained model. Supports single text, batch JSON/NDJSON, and interactive mode. Long documents are automatically chunked with a sliding window. Auto-discovers the most recent trained model.
classivore agent
Automated collect-label-evaluate loop. Analyzes coverage gaps, collects pages for the weakest categories, labels them, and repeats until targets are met or budget is exhausted.
classivore publish
Push a trained model to a private HuggingFace Hub repo with version tagging. The published artifact is self-contained — includes model weights, tokenizer, thresholds, label mappings, and taxonomy metadata (paths, IDs). No taxonomy CSV needed at serve time.
Other Commands
| Command | Description |
|---|---|
classivore taxonomy |
Show taxonomy stats, coverage gaps, and exclusions |
classivore validate |
Run data quality checks via label-lens |
classivore hints |
Generate domain hints for tier-1 categories |
classivore hf init |
Create a private HuggingFace repo |
Taxonomy Configuration
Each taxonomy lives in taxonomies/<slug>/ with a config.yaml that controls everything: collection targets by difficulty, query budgets, LLM models, filter relaxations, domain hints, and category exclusions. See taxonomies/ for examples.
Self-Contained Inference
The Classifier class at classivore.inference.Classifier is designed for production use. It has zero classivore internal dependencies — only torch, transformers, numpy, and json. Load a model directory and get predictions:
from classivore.inference import Classifier
classifier = Classifier("models/my-tax/20260408_162922")
results = classifier.predict("Article text here...")
# [{"name": "Category", "id": "42", "path": ["Parent", "Category"], "confidence": 0.93}]
The companion classivore-api repo uses this for serving.
Project Structure
src/classivore/
cli/ Command-line interface
config/ Settings loader and defaults
taxonomy/ CSV loader, enricher, onboarding
collection/ Search, scraping, filters, state
labeling/ Two-stage LLM labeling pipeline
training/ DeBERTa trainer, focal loss, thresholds, evaluation
inference/ Self-contained Classifier for production inference
publishing/ HuggingFace Hub publishing
agent/ Automated collect-label-evaluate loop
validation/ Data quality checks
API Keys & External Services
Create a .env file in the project root and add the keys you need:
ANTHROPIC_API_KEY=...
BRAVE_API_KEY=...
SERPER_API_KEY=...
EXA_API_KEY=...
HUGGINGFACE_TOKEN=...
Anthropic API — Required for enrichment and labeling
Used by: classivore enrich, classivore label, classivore hints, classivore collect (LLM query generation)
Get a key at console.anthropic.com. Set ANTHROPIC_API_KEY (or CLASSIVORE_API_KEY if you want to use a separate key).
Enrichment and labeling are the most API-intensive stages. Both use the Batch API for a 50% cost reduction. A rough estimate for the IAB 2.2 taxonomy (~700 categories, ~30K pages): enrichment ~$1–2, labeling ~$15–25 depending on model choice.
Brave Search — Optional, recommended for collection
Used by: classivore collect
Get a key at api.search.brave.com. Set BRAVE_API_KEY. The free plan includes 2,000 queries/month.
Brave is the first provider tried for keyword search. Without at least one search provider, collection cannot discover new URLs.
Serper — Optional, Brave fallback
Used by: classivore collect
Get a key at serper.dev. Set SERPER_API_KEY. Returns Google results. Used automatically when Brave's quota is exhausted.
Exa AI — Optional, semantic search and scrape fallback
Used by: classivore collect
Get a key at dashboard.exa.ai. Set EXA_API_KEY.
Exa serves two roles in the collection pipeline:
- Neural search fallback — when Brave and Serper are both exhausted, Exa's semantic search finds relevant pages for hard categories where keyword queries underperform.
- Scrape fallback — when live scraping fails (WAF blocks, 403s), Exa's
/contentsendpoint retrieves the page through their own infrastructure. Pages fetched this way bypass site-level blocks entirely.
Results from Exa include full page text, so pages retrieved via Exa skip the scraping step.
HuggingFace Hub — Required for publishing
Used by: classivore publish, classivore hf init
Get a write-access token at huggingface.co/settings/tokens. Set HUGGINGFACE_TOKEN, or pass --token directly to the publish command.
Common Crawl — No key required
Used by: classivore collect
Classivore queries the Common Crawl CDX index for historical page snapshots before attempting live scrapes. No API key needed. The crawl ID is configured per-taxonomy in config.yaml (commoncrawl_crawl_id). Set to null to disable.
Requirements
- Python >= 3.11
- GPU recommended for training (RTX 4090: ~45 min for 30K pages)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file classivore-1.4.0.tar.gz.
File metadata
- Download URL: classivore-1.4.0.tar.gz
- Upload date:
- Size: 102.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b819b7840512f70424d27b06dd348355212b8fef3709b4027c2d20264af1e0fd
|
|
| MD5 |
001482f116d715d9557bb5237fc488b5
|
|
| BLAKE2b-256 |
0c8a996b11244475ae3eda2be01b520f5bc8463fbc43b7f3a167be4079ff6ede
|
File details
Details for the file classivore-1.4.0-py3-none-any.whl.
File metadata
- Download URL: classivore-1.4.0-py3-none-any.whl
- Upload date:
- Size: 121.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
891d4343bb9a69d02ed3bf03f9cade0dd59d576ee658dba951de2b7d9ad965bb
|
|
| MD5 |
2ecccfe0a6f05a6beaee2b30d12c8e8f
|
|
| BLAKE2b-256 |
6aec4ecc1b0619c684babdeb820e65bfa3d5d5454ca5118b26f54d14e87702a1
|