Skip to main content

BTZSC: A Benchmark for Zero-Shot Text Classification across Cross-Encoders, Embedding Models, Rerankers and LLMs

Project description

BTZSC banner

BTZSC

A unified benchmark for zero-shot text classification across embedding models, cross-encoders, rerankers, and LLMs.

version python-versions ci-status publish-status license


Table of Contents

Overview

BTZSC is a benchmark package for evaluating zero-shot text classification models under a unified interface. It helps you compare very different model families using the same datasets, task groupings, and metrics.

It is also the evaluation harness behind the BTZSC Hugging Face leaderboard: you can run the benchmark locally, export a leaderboard-ready JSON artifact, and submit new entries to keep the public results up to date.

The package includes:

  • Dataset loaders for BTZSC benchmark tasks.
  • A shared benchmark runner across model adapters.
  • Built-in adapters for embedding, NLI, reranker, and LLM-style models.
  • Baseline comparison utilities and a CLI for reproducible evaluation.

Paper and Resources

For Users

Installation

Install with pip:

pip install btzsc

Install with uv in an existing project:

uv add btzsc

Run as a standalone CLI tool with uvx (no project install needed):

uvx btzsc list-datasets

Quick Start (Python API)

Use this as a recommended first workflow:

  1. Start with one or two task groups to validate your setup.
  2. Inspect summary and per-dataset outputs.
  3. Compare against bundled baselines.
  4. Export a leaderboard-ready JSON artifact.

API notes:

  • BTZSCBenchmark(tasks=...) accepts either task groups ("sentiment", "topic", "intent", "emotion") or explicit dataset names. Leave empty to run all datasets.
  • evaluate(model=..., model_type=...) returns a BTZSCResults object.
  • model_type is required when model is a string model ID (if you pass a BaseModel instance, you can omit it). Choose from embedding, nli, reranker, llm.
  • Use max_samples for quick smoke tests; increase batch_size for throughput if your hardware allows it.
from btzsc import BTZSCBenchmark

benchmark = BTZSCBenchmark(tasks=["sentiment", "topic"])
results = benchmark.evaluate(
	model="intfloat/e5-base-v2",
	model_type="embedding",
	batch_size=64,
)

print(results.summary())
print(results.per_dataset())

# Compare against bundled baselines
print(results.compare_baselines(metric="f1"))

# Export leaderboard-ready JSON
results.to_json("results/embedding/e5-base-v2.json")

Quick Start (CLI)

Equivalent end-to-end CLI flow:

Note: when --model is a model ID string, you must also provide --type.

# 1) Explore benchmark metadata
btzsc list-datasets
btzsc list-model-types

# 2) Run an initial benchmark
btzsc evaluate --model intfloat/e5-base-v2 --type embedding --tasks sentiment,topic

# 3) Compare with packaged baselines
btzsc baselines --metric f1 --top 10

# 4) Export JSON for leaderboard submission
btzsc evaluate \
	--model intfloat/e5-base-v2 \
	--type embedding \
	--output-json results/embedding/e5-base-v2.json

# 5) Validate the JSON locally
btzsc validate-result results/embedding/e5-base-v2.json

Tip: run a small pilot first, then repeat with your full task scope for final reporting.

Supported Model Types

BTZSC currently supports these adapter families:

  • embedding
  • nli
  • reranker
  • llm

Pass the model type explicitly (model_type in Python or --type in CLI).

Extending with Custom Models

To make a custom model compatible with BTZSC, implement an adapter that subclasses BaseModel.

Contract requirements:

  • predict_scores(texts, labels, batch_size) must return a score matrix with shape (len(texts), len(labels)) where higher means more likely.
  • predict(texts, labels, batch_size) must return predicted label indices with shape (len(texts),).
  • Set model_type on your class. Use embedding, nli, reranker, or llm when applicable.
import numpy as np

from btzsc.models.base import BaseModel


class MyCustomAdapter(BaseModel):
	model_type = "embedding"

	def __init__(self, model_name: str = "my-org/my-model"):
		self.model_name = model_name

	def predict_scores(
		self,
		texts: list[str],
		labels: list[str],
		batch_size: int = 32,
	) -> np.ndarray:
		# Replace this with your real scoring implementation.
		return np.zeros((len(texts), len(labels)), dtype=float)

	def predict(
		self,
		texts: list[str],
		labels: list[str],
		batch_size: int = 32,
	) -> np.ndarray:
		scores = self.predict_scores(texts, labels, batch_size=batch_size)
		return scores.argmax(axis=1)

Run it in the benchmark:

from btzsc import BTZSCBenchmark

benchmark = BTZSCBenchmark(tasks=["sentiment", "topic"])
custom_model = MyCustomAdapter("my-org/my-model")

results = benchmark.evaluate(
	model=custom_model,
	batch_size=32,
	max_samples=200,
)

print(results.summary())
results.to_json("results/custom/my-model.json")

When you pass a BaseModel instance to evaluate(), you do not need model_type=... in the call.

Submitting to the Leaderboard

After exporting your JSON (results.to_json(...) or --output-json), first validate it:

btzsc validate-result results/<model_type>/<model-name>.json

Then publish it to the results dataset repo:

https://huggingface.co/datasets/btzsc/btzsc-results

Required destination path format:

results/<model_type>/<model-name>.json

Example:

results/embedding/e5-base-v2.json

You can submit using any of these workflows:

  1. Web UI (no clone required)

  2. Git workflow (clone/fork + push)

    • Clone (or fork) btzsc/btzsc-results, add your JSON at the required path, then push.
    • If you pushed to a fork, open a PR to btzsc/btzsc-results.
git lfs install
git clone https://huggingface.co/datasets/btzsc/btzsc-results
cd btzsc-results

# Copy your exported JSON into the correct folder
mkdir -p results/reranker
cp /path/to/my_result.json results/reranker/my-model.json

git add results/reranker/my-model.json
git commit -m "Add BTZSC results for my-model"
git push
  1. API workflow (huggingface_hub, PR-based)

    • Authenticate first (huggingface-cli login or HF_TOKEN).
    • create_pr=True creates a PR branch instead of pushing directly to main.
from huggingface_hub import HfApi

api = HfApi()
api.upload_file(
    path_or_fileobj="results/reranker/my-model.json",
    path_in_repo="results/reranker/my-model.json",
    repo_id="btzsc/btzsc-results",
    repo_type="dataset",
    commit_message="Add BTZSC results for my-model",
    create_pr=True,
)

The leaderboard Space reads from this results dataset and updates as new valid entries are added.

For full submission requirements, see hf/results_repo/SUBMISSION.md.

Benchmark Protocol

BTZSC follows a strict zero-shot protocol:

  • 22 English single-label datasets
  • 4 task families: sentiment, topic, intent, emotion
  • No BTZSC-label training or tuning on evaluation datasets
  • Primary leaderboard metric: macro-F1
  • Secondary metrics: accuracy, macro-precision, macro-recall

The leaderboard is continuously updated as new submissions are added.

Dataset

BTZSC benchmark data is available on Hugging Face:

https://huggingface.co/datasets/btzsc/btzsc

To load the raw paired-format rows with datasets:

from datasets import get_dataset_config_names, load_dataset

repo_id = "btzsc/btzsc"

# Each dataset is a config name (e.g. "agnews", "imdb", ...)
print(get_dataset_config_names(repo_id)[:5])

# Load one dataset's test split
ds = load_dataset(repo_id, "agnews", split="test")
print(ds.column_names)
print(ds[0])

The dataset stores rows as (text, hypothesis, labels) where labels is binary entailment. The package reconstructs grouped multiclass samples internally for evaluation.

Citing

@inproceedings{aarab2026btzsc,
	title     = {BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, and Rerankers},
	author    = {Aarab, Ilias},
	booktitle = {International Conference on Learning Representations (ICLR) 2026},
	year      = {2026},
	note      = {OpenReview PDF: https://openreview.net/pdf?id=IxMryAz2p3},
	url       = {https://openreview.net/forum?id=IxMryAz2p3}
}

License

Released under the MIT license.


For Developers

Developer Setup

git clone https://github.com/IliasAarab/btzsc.git
cd btzsc
uv sync --dev

Project Structure

High-level layout:

  • src/btzsc/benchmark.py: benchmark orchestration and result objects.
  • src/btzsc/data.py: dataset loading and task grouping.
  • src/btzsc/metrics.py: metric computation and summaries.
  • src/btzsc/baselines.py: baseline loading and comparison table creation.
  • src/btzsc/models/: model adapters (embedding, nli, reranker, llm).
  • src/btzsc/cli.py: command-line interface.

Quality Checks

Run formatting, linting, and typing checks before opening a PR:

uv run ruff format
uv run ruff check
uv run pyright

Packaging and Release

Build locally:

uv build

Release process:

  1. Bump version in pyproject.toml.
  2. Commit and push to main.
  3. Create and push a version tag, for example:
git tag v0.1.1
git push origin v0.1.1

GitHub Actions builds and publishes tagged releases to PyPI via trusted publishing.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

btzsc-0.1.2.tar.gz (57.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

btzsc-0.1.2-py3-none-any.whl (66.6 kB view details)

Uploaded Python 3

File details

Details for the file btzsc-0.1.2.tar.gz.

File metadata

  • Download URL: btzsc-0.1.2.tar.gz
  • Upload date:
  • Size: 57.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for btzsc-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0e63b011f7dffe6a95836a3e6a7a5253aad76637f785d1e4e1c32f45c3cf06d1
MD5 194a9e191a9002fed6d92da30b1a2a67
BLAKE2b-256 0fc44ffd61d679c4a8286cdc6df38d6ce1feece8dd7fabccddcf113af197d920

See more details on using hashes here.

File details

Details for the file btzsc-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: btzsc-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 66.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for btzsc-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e91c7ba438a4913dfd3efc5d0d9d469025136533719f962b7a180c0e8136a25a
MD5 dc4d2966767944c0ce587c6f53591b1e
BLAKE2b-256 73df96dde5fdaa5b797e6db33a4d1d68c7242a176f2519b805cf53870d7dfc41

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page