BTZSC: A Benchmark for Zero-Shot Text Classification across Cross-Encoders, Embedding Models, Rerankers and LLMs
Project description
BTZSC
A unified benchmark for zero-shot text classification across embedding models, cross-encoders, rerankers, and LLMs.
Table of Contents
Overview
BTZSC is a benchmark package for evaluating zero-shot text classification models under a unified interface. It helps you compare very different model families using the same datasets, task groupings, and metrics.
It is also the evaluation harness behind the BTZSC Hugging Face leaderboard: you can run the benchmark locally, export a leaderboard-ready JSON artifact, and submit new entries to keep the public results up to date.
The package includes:
- Dataset loaders for BTZSC benchmark tasks.
- A shared benchmark runner across model adapters.
- Built-in adapters for embedding, NLI, reranker, and LLM-style models.
- Baseline comparison utilities and a CLI for reproducible evaluation.
Paper and Resources
- Paper (OpenReview): https://openreview.net/forum?id=IxMryAz2p3
- PDF: https://openreview.net/pdf?id=IxMryAz2p3
- Eval harness (GitHub): https://github.com/IliasAarab/btzsc
- Leaderboard results dataset: https://huggingface.co/datasets/btzsc/btzsc-results
- Leaderboard Space: https://huggingface.co/spaces/btzsc/btzsc-leaderboard
For Users
Installation
Install with pip:
pip install btzsc
Install with uv in an existing project:
uv add btzsc
Run as a standalone CLI tool with uvx (no project install needed):
uvx btzsc list-datasets
Quick Start (Python API)
Use this as a recommended first workflow:
- Start with one or two task groups to validate your setup.
- Inspect summary and per-dataset outputs.
- Compare against bundled baselines.
- Export a leaderboard-ready JSON artifact.
API notes:
BTZSCBenchmark(tasks=...)accepts either task groups ("sentiment","topic","intent","emotion") or explicit dataset names. Leave empty to run all datasets.evaluate(model=..., model_type=...)returns aBTZSCResultsobject.model_typeis required whenmodelis a string model ID (if you pass aBaseModelinstance, you can omit it). Choose fromembedding,nli,reranker,llm.- Use
max_samplesfor quick smoke tests; increasebatch_sizefor throughput if your hardware allows it.
from btzsc import BTZSCBenchmark
benchmark = BTZSCBenchmark(tasks=["sentiment", "topic"])
results = benchmark.evaluate(
model="intfloat/e5-base-v2",
model_type="embedding",
batch_size=64,
)
print(results.summary())
print(results.per_dataset())
# Compare against bundled baselines
print(results.compare_baselines(metric="f1"))
# Export leaderboard-ready JSON
results.to_json("results/embedding/e5-base-v2.json")
Quick Start (CLI)
Equivalent end-to-end CLI flow:
Note: when --model is a model ID string, you must also provide --type.
# 1) Explore benchmark metadata
btzsc list-datasets
btzsc list-model-types
# 2) Run an initial benchmark
btzsc evaluate --model intfloat/e5-base-v2 --type embedding --tasks sentiment,topic
# 3) Compare with packaged baselines
btzsc baselines --metric f1 --top 10
# 4) Export JSON for leaderboard submission
btzsc evaluate \
--model intfloat/e5-base-v2 \
--type embedding \
--output-json results/embedding/e5-base-v2.json
# 5) Validate the JSON locally
btzsc validate-result results/embedding/e5-base-v2.json
Tip: run a small pilot first, then repeat with your full task scope for final reporting.
Supported Model Types
BTZSC currently supports these adapter families:
embeddingnlirerankerllm
Pass the model type explicitly (model_type in Python or --type in CLI).
Extending with Custom Models
To make a custom model compatible with BTZSC, implement an adapter that subclasses BaseModel.
Contract requirements:
predict_scores(texts, labels, batch_size)must return a score matrix with shape(len(texts), len(labels))where higher means more likely.predict(texts, labels, batch_size)must return predicted label indices with shape(len(texts),).- Set
model_typeon your class. Useembedding,nli,reranker, orllmwhen applicable.
import numpy as np
from btzsc.models.base import BaseModel
class MyCustomAdapter(BaseModel):
model_type = "embedding"
def __init__(self, model_name: str = "my-org/my-model"):
self.model_name = model_name
def predict_scores(
self,
texts: list[str],
labels: list[str],
batch_size: int = 32,
) -> np.ndarray:
# Replace this with your real scoring implementation.
return np.zeros((len(texts), len(labels)), dtype=float)
def predict(
self,
texts: list[str],
labels: list[str],
batch_size: int = 32,
) -> np.ndarray:
scores = self.predict_scores(texts, labels, batch_size=batch_size)
return scores.argmax(axis=1)
Run it in the benchmark:
from btzsc import BTZSCBenchmark
benchmark = BTZSCBenchmark(tasks=["sentiment", "topic"])
custom_model = MyCustomAdapter("my-org/my-model")
results = benchmark.evaluate(
model=custom_model,
batch_size=32,
max_samples=200,
)
print(results.summary())
results.to_json("results/custom/my-model.json")
When you pass a BaseModel instance to evaluate(), you do not need model_type=... in the call.
Submitting to the Leaderboard
After exporting your JSON (results.to_json(...) or --output-json), first validate it:
btzsc validate-result results/<model_type>/<model-name>.json
Then publish it to the results dataset repo:
https://huggingface.co/datasets/btzsc/btzsc-results
Required destination path format:
results/<model_type>/<model-name>.json
Example:
results/embedding/e5-base-v2.json
You can submit using any of these workflows:
-
Web UI (no clone required)
- Open the results repo page: https://huggingface.co/datasets/btzsc/btzsc-results
- Go to Files and versions and upload your JSON at the required path.
- If you do not have write access, fork the repo and open a PR.
-
Git workflow (clone/fork + push)
- Clone (or fork)
btzsc/btzsc-results, add your JSON at the required path, then push. - If you pushed to a fork, open a PR to
btzsc/btzsc-results.
- Clone (or fork)
git lfs install
git clone https://huggingface.co/datasets/btzsc/btzsc-results
cd btzsc-results
# Copy your exported JSON into the correct folder
mkdir -p results/reranker
cp /path/to/my_result.json results/reranker/my-model.json
git add results/reranker/my-model.json
git commit -m "Add BTZSC results for my-model"
git push
-
API workflow (
huggingface_hub, PR-based)- Authenticate first (
huggingface-cli loginorHF_TOKEN). create_pr=Truecreates a PR branch instead of pushing directly tomain.
- Authenticate first (
from huggingface_hub import HfApi
api = HfApi()
api.upload_file(
path_or_fileobj="results/reranker/my-model.json",
path_in_repo="results/reranker/my-model.json",
repo_id="btzsc/btzsc-results",
repo_type="dataset",
commit_message="Add BTZSC results for my-model",
create_pr=True,
)
The leaderboard Space reads from this results dataset and updates as new valid entries are added.
For full submission requirements, see hf/results_repo/SUBMISSION.md.
Benchmark Protocol
BTZSC follows a strict zero-shot protocol:
- 22 English single-label datasets
- 4 task families: sentiment, topic, intent, emotion
- No BTZSC-label training or tuning on evaluation datasets
- Primary leaderboard metric: macro-F1
- Secondary metrics: accuracy, macro-precision, macro-recall
The leaderboard is continuously updated as new submissions are added.
Dataset
BTZSC benchmark data is available on Hugging Face:
https://huggingface.co/datasets/btzsc/btzsc
To load the raw paired-format rows with datasets:
from datasets import get_dataset_config_names, load_dataset
repo_id = "btzsc/btzsc"
# Each dataset is a config name (e.g. "agnews", "imdb", ...)
print(get_dataset_config_names(repo_id)[:5])
# Load one dataset's test split
ds = load_dataset(repo_id, "agnews", split="test")
print(ds.column_names)
print(ds[0])
The dataset stores rows as (text, hypothesis, labels) where labels is binary entailment.
The package reconstructs grouped multiclass samples internally for evaluation.
Citing
@inproceedings{aarab2026btzsc,
title = {BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, and Rerankers},
author = {Aarab, Ilias},
booktitle = {International Conference on Learning Representations (ICLR) 2026},
year = {2026},
note = {OpenReview PDF: https://openreview.net/pdf?id=IxMryAz2p3},
url = {https://openreview.net/forum?id=IxMryAz2p3}
}
License
Released under the MIT license.
For Developers
Developer Setup
git clone https://github.com/IliasAarab/btzsc.git
cd btzsc
uv sync --dev
Project Structure
High-level layout:
src/btzsc/benchmark.py: benchmark orchestration and result objects.src/btzsc/data.py: dataset loading and task grouping.src/btzsc/metrics.py: metric computation and summaries.src/btzsc/baselines.py: baseline loading and comparison table creation.src/btzsc/models/: model adapters (embedding,nli,reranker,llm).src/btzsc/cli.py: command-line interface.
Quality Checks
Run formatting, linting, and typing checks before opening a PR:
uv run ruff format
uv run ruff check
uv run pyright
Packaging and Release
Build locally:
uv build
Release process:
- Bump
versioninpyproject.toml. - Commit and push to
main. - Create and push a version tag, for example:
git tag v0.1.1
git push origin v0.1.1
GitHub Actions builds and publishes tagged releases to PyPI via trusted publishing.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file btzsc-0.1.2.tar.gz.
File metadata
- Download URL: btzsc-0.1.2.tar.gz
- Upload date:
- Size: 57.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e63b011f7dffe6a95836a3e6a7a5253aad76637f785d1e4e1c32f45c3cf06d1
|
|
| MD5 |
194a9e191a9002fed6d92da30b1a2a67
|
|
| BLAKE2b-256 |
0fc44ffd61d679c4a8286cdc6df38d6ce1feece8dd7fabccddcf113af197d920
|
File details
Details for the file btzsc-0.1.2-py3-none-any.whl.
File metadata
- Download URL: btzsc-0.1.2-py3-none-any.whl
- Upload date:
- Size: 66.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e91c7ba438a4913dfd3efc5d0d9d469025136533719f962b7a180c0e8136a25a
|
|
| MD5 |
dc4d2966767944c0ce587c6f53591b1e
|
|
| BLAKE2b-256 |
73df96dde5fdaa5b797e6db33a4d1d68c7242a176f2519b805cf53870d7dfc41
|