Taxonomy-agnostic text classification pipeline

Project description

Classivore

Open-source, taxonomy-agnostic text classification pipeline. Give it any hierarchical taxonomy CSV and it builds a production multi-label classifier — from data collection through training to inference.

What It Does

Classivore automates the full pipeline for building a text classifier on a custom taxonomy:

Enrich your taxonomy with LLM-generated descriptions, boundaries, aliases, and difficulty ratings
Collect training data from the web using search APIs (Brave, Serper) and Common Crawl
Label collected pages using hierarchical LLM classification via the Anthropic Batch API
Train a DeBERTa-v3 classifier with focal loss, per-category thresholds, and quality reporting
Publish the trained model to HuggingFace Hub for serving

The entire pipeline is driven from the command line. Each stage is resumable — interrupt and restart without losing progress.

Quick Start

# Install
git clone https://github.com/NotYoCheese/classivore.git
cd classivore
python -m venv venv && source venv/bin/activate
pip install -e .

# Initialize a new taxonomy
classivore init --csv your_taxonomy.csv --name "My Taxonomy" --version "1.0" --slug my-tax

# Or use an existing taxonomy and run the pipeline step by step
classivore enrich --taxonomy my-tax
classivore collect --taxonomy my-tax
classivore label --taxonomy my-tax
classivore train --taxonomy my-tax

# Run inference
classivore classify --text "Article about machine learning trends..."
classivore classify --file articles.json --output predictions.json
classivore classify --interactive

Pipeline Stages

`classivore init`

Onboard a new taxonomy from a CSV file. Validates the CSV structure, generates a config.yaml with sensible defaults, optionally runs LLM enrichment and domain hint generation, and prints an onboarding report with collection cost estimates.

`classivore enrich`

Generate descriptions, boundaries, aliases, and difficulty ratings for each taxonomy category using the Anthropic Batch API. These fields improve search query quality and help the labeling stage make better decisions.

`classivore collect`

Discover and scrape web pages for training data. Uses search APIs (Brave, Serper) with automatic fallback, Common Crawl CDX for historical pages, and content quality filters. Collection targets are tiered by category difficulty — hard categories get more pages to compensate for scarce editorial content.

`classivore label`

Classify collected pages using a two-stage hierarchical LLM approach:

Stage 1: Tier-1 triage identifies which top-level categories apply (cheap, broad pass)
Stage 2: Subtree classification within selected tier-1s (detailed, with chain-of-thought)

Uses the Anthropic Batch API for 50% cost reduction. Crash-recoverable — pages at each stage are checkpointed.

`classivore train`

Fine-tune DeBERTa-v3-large for multi-label classification. Features:

Weighted focal loss for extreme class imbalance
Confidence-weighted training (legacy labels discounted)
Per-category threshold optimization (+5% F1 macro over global threshold)
Comprehensive quality report with per-category metrics, confusion pairs, and overfitting detection

`classivore classify`

Run inference using a trained model. Supports single text, batch JSON/NDJSON, and interactive mode. Long documents are automatically chunked with a sliding window. Auto-discovers the most recent trained model.

`classivore agent`

Automated collect-label-evaluate loop. Analyzes coverage gaps, collects pages for the weakest categories, labels them, and repeats until targets are met or budget is exhausted.

`classivore publish`

Push a trained model to a private HuggingFace Hub repo with version tagging. The published artifact is self-contained — includes model weights, tokenizer, thresholds, label mappings, and taxonomy metadata (paths, IDs). No taxonomy CSV needed at serve time.

Other Commands

Command	Description
`classivore taxonomy`	Show taxonomy stats, coverage gaps, and exclusions
`classivore validate`	Run data quality checks via label-lens
`classivore hints`	Generate domain hints for tier-1 categories
`classivore hf init`	Create a private HuggingFace repo

Taxonomy Configuration

Each taxonomy lives in taxonomies/<slug>/ with a config.yaml that controls everything: collection targets by difficulty, query budgets, LLM models, filter relaxations, domain hints, and category exclusions. See taxonomies/ for examples.

Self-Contained Inference

The Classifier class at classivore.inference.Classifier is designed for production use. It has zero classivore internal dependencies — only torch, transformers, numpy, and json. Load a model directory and get predictions:

from classivore.inference import Classifier

classifier = Classifier("models/my-tax/20260408_162922")
results = classifier.predict("Article text here...")
# [{"name": "Category", "id": "42", "path": ["Parent", "Category"], "confidence": 0.93}]

The companion classivore-api repo uses this for serving.

Project Structure

src/classivore/
  cli/            Command-line interface
  config/         Settings loader and defaults
  taxonomy/       CSV loader, enricher, onboarding
  collection/     Search, scraping, filters, state
  labeling/       Two-stage LLM labeling pipeline
  training/       DeBERTa trainer, focal loss, thresholds, evaluation
  inference/      Self-contained Classifier for production inference
  publishing/     HuggingFace Hub publishing
  agent/          Automated collect-label-evaluate loop
  validation/     Data quality checks

API Keys & External Services

Create a .env file in the project root and add the keys you need:

ANTHROPIC_API_KEY=...
BRAVE_API_KEY=...
SERPER_API_KEY=...
EXA_API_KEY=...
HUGGINGFACE_TOKEN=...

Anthropic API — Required for enrichment and labeling

Used by: classivore enrich, classivore label, classivore hints, classivore collect (LLM query generation)

Get a key at console.anthropic.com. Set ANTHROPIC_API_KEY (or CLASSIVORE_API_KEY if you want to use a separate key).

Enrichment and labeling are the most API-intensive stages. Both use the Batch API for a 50% cost reduction. A rough estimate for the IAB 2.2 taxonomy (~700 categories, ~30K pages): enrichment ~$1–2, labeling ~$15–25 depending on model choice.

Brave Search — Optional, recommended for collection

Used by: classivore collect

Get a key at api.search.brave.com. Set BRAVE_API_KEY. The free plan includes 2,000 queries/month.

Brave is the first provider tried for keyword search. Without at least one search provider, collection cannot discover new URLs.

Serper — Optional, Brave fallback

Used by: classivore collect

Get a key at serper.dev. Set SERPER_API_KEY. Returns Google results. Used automatically when Brave's quota is exhausted.

Exa AI — Optional, semantic search and scrape fallback

Used by: classivore collect

Get a key at dashboard.exa.ai. Set EXA_API_KEY.

Exa serves two roles in the collection pipeline:

Neural search fallback — when Brave and Serper are both exhausted, Exa's semantic search finds relevant pages for hard categories where keyword queries underperform.
Scrape fallback — when live scraping fails (WAF blocks, 403s), Exa's /contents endpoint retrieves the page through their own infrastructure. Pages fetched this way bypass site-level blocks entirely.

Results from Exa include full page text, so pages retrieved via Exa skip the scraping step.

HuggingFace Hub — Required for publishing

Used by: classivore publish, classivore hf init

Get a write-access token at huggingface.co/settings/tokens. Set HUGGINGFACE_TOKEN, or pass --token directly to the publish command.

Common Crawl — No key required

Used by: classivore collect

Classivore queries the Common Crawl CDX index for historical page snapshots before attempting live scrapes. No API key needed. The crawl ID is configured per-taxonomy in config.yaml (commoncrawl_crawl_id). Set to null to disable.

Requirements

Python >= 3.11
GPU recommended for training (RTX 4090: ~45 min for 30K pages)

License

MIT

Project details

Release history Release notifications | RSS feed

This version

1.4.0

May 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

classivore-1.4.0.tar.gz (102.5 kB view details)

Uploaded May 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

classivore-1.4.0-py3-none-any.whl (121.8 kB view details)

Uploaded May 18, 2026 Python 3

File details

Details for the file classivore-1.4.0.tar.gz.

File metadata

Download URL: classivore-1.4.0.tar.gz
Upload date: May 18, 2026
Size: 102.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for classivore-1.4.0.tar.gz
Algorithm	Hash digest
SHA256	`b819b7840512f70424d27b06dd348355212b8fef3709b4027c2d20264af1e0fd`
MD5	`001482f116d715d9557bb5237fc488b5`
BLAKE2b-256	`0c8a996b11244475ae3eda2be01b520f5bc8463fbc43b7f3a167be4079ff6ede`

See more details on using hashes here.

File details

Details for the file classivore-1.4.0-py3-none-any.whl.

File metadata

Download URL: classivore-1.4.0-py3-none-any.whl
Upload date: May 18, 2026
Size: 121.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for classivore-1.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`891d4343bb9a69d02ed3bf03f9cade0dd59d576ee658dba951de2b7d9ad965bb`
MD5	`2ecccfe0a6f05a6beaee2b30d12c8e8f`
BLAKE2b-256	`6aec4ecc1b0619c684babdeb820e65bfa3d5d5454ca5118b26f54d14e87702a1`

See more details on using hashes here.

classivore 1.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Classivore

What It Does

Quick Start

Pipeline Stages

classivore init

classivore enrich

classivore collect

classivore label

classivore train

classivore classify

classivore agent

classivore publish

Other Commands

Taxonomy Configuration

Self-Contained Inference

Project Structure

API Keys & External Services

Anthropic API — Required for enrichment and labeling

Brave Search — Optional, recommended for collection

Serper — Optional, Brave fallback

Exa AI — Optional, semantic search and scrape fallback

HuggingFace Hub — Required for publishing

Common Crawl — No key required

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`classivore init`

`classivore enrich`

`classivore collect`

`classivore label`

`classivore train`

`classivore classify`

`classivore agent`

`classivore publish`