Synthetic benchmarks for evaluating Concept Bottleneck Models.

These details have not been verified by PyPI

Project links

Repository

Project description

Concept Benchmark

Concept Benchmark logo

Concept Benchmark is a Python package for benchmarking concept bottleneck models (CBMs). It provides synthetic datasets with ground-truth concept labels, allowing users to vary concept granularity, annotation quality, and the labeling rule, and measure how each factor affects model performance and the value of interventions. The package includes two benchmarks -- robot classification (decision support) and Sudoku validation (automation) -- across image, text, and tabular modalities.

Installation
Quick Start
Benchmarks
Citation

Installation

The package requires the cairo graphics library. Install it first:

# macOS
brew install cairo pkg-config

# Ubuntu / Debian
sudo apt-get install libcairo2-dev pkg-config python3-dev

# Fedora / RHEL
sudo dnf install cairo-devel pkg-config python3-devel

Then install the package:

pip install concept-benchmark

Or install from source:

git clone https://github.com/ustunb/concept-benchmark.git
cd concept-benchmark
uv sync

Verify the installation:

python3 -c "import concept_benchmark; print('OK')"

Quick Start

A CBM predicts concepts from inputs (e.g., "has pointy feet"), then predicts the label from those concepts. At test time, a user can correct mispredicted concepts -- this is called an intervention. The package lets you measure whether correcting k concepts improves the label prediction, and how that depends on concept quality and annotation noise.

Each benchmark has a pipeline script in scripts/ that runs the full experiment end-to-end:

# Robot classification (image, default 7 concepts)
python scripts/robot_pipeline.py --seed 1014

# Robot classification (subconcept variant, 12 concepts)
python scripts/robot_pipeline.py --seed 1014 --subconcept

# Sudoku validation
python scripts/sudoku_pipeline.py --seed 171

# Robot text classification
python scripts/robot_text_pipeline.py --seed 1337

Each script supports --help for the full list of flags. Use --stages to run a subset of the pipeline (e.g., --stages cbm dnn intervene to retrain models on existing data).

The pipeline scripts are also importable for programmatic use:

from concept_benchmark.config import RobotBenchmarkConfig
from concept_benchmark.models import ConceptBasedModel, ConceptDetector
from concept_benchmark.utils import create_skewed_splits_full, set_deterministic_seed
from concept_benchmark.synthetic.robot import create_synthetic_dataset

cfg = RobotBenchmarkConfig(seed=1014)
set_deterministic_seed(cfg.seed)
data = create_synthetic_dataset(**cfg.to_dict())

Benchmarks

The package includes two benchmarks. Robot classification is a decision-support task where a human corrects the model's concept predictions to improve accuracy. Sudoku validation is an automation task where the system handles routine cases and defers uncertain ones to a human.

Robot Classification

This benchmark targets decision-support settings where a human uses the model's concept predictions to improve their own decisions. The task is to predict the species of a fictional robot -- Glorp or Drent -- from its body features. Each robot has 9 binary features (mouth type, foot shape, knee presence, etc.). The default labeling rule is: Glorp if mouth is closed, foot is pointy, and robot has knees (all three); Drent otherwise. Which features matter and which are excluded (via drop_concepts) are configurable, mimicking real-world settings where the true relationship between features and labels is unknown. Available as image and text modalities.

Robot with annotated concepts

The following example uses the subconcept variant (12 concepts instead of the default 7) with intervention regimes:

# Run the full pipeline with subconcepts and expert interventions
python scripts/robot_pipeline.py --seed 1014 --subconcept --regimes baseline expert

# Run specific stages only (e.g., retrain and re-evaluate on existing data)
python scripts/robot_pipeline.py --seed 1014 --subconcept --stages cbm dnn intervene collect

# Test concept missingness (MCAR, 20% of labels masked)
python scripts/robot_pipeline.py --seed 1014 --subconcept --concept-missing 0.2

Expected results (subconcept, seed=1014, threshold=0.2):

CBM (k=0): 0.7812
 budget  accuracy
      0    0.7812
      1    0.9212
      3    0.9439

The most important parameters are listed below. For the full list, see RobotBenchmarkConfig in concept_benchmark/config.py or run python scripts/robot_pipeline.py --help.

Parameter	Default	Description
`drop_concepts`	`IDEAL_DROP`	Which concepts to exclude. Two presets are provided: `IDEAL_DROP` for 7 coarse concepts (binary foot_shape), `SUBCONCEPT_DROP` for 12 concepts (5 fine-grained foot subtypes).
`subconcept`	`False`	Shortcut that switches `drop_concepts` to `SUBCONCEPT_DROP`.
`model_features`	`{"mouth_type": "closed", "foot_shape": "pointy", "has_knees": "true"}`	Which feature values count toward the label score.
`model_weights`	`{"mouth_type": 5.0, "foot_shape": 8.0, "has_knees": -5.0}`	Concept weights for the labeling function. Score = `Σ w_i · 1[f_i = v_i] + intercept`.
`concept_missing`	`0.0`	Fraction of concept labels masked during training.
`regimes`	`["baseline"]`	How interventions are performed: `baseline` (oracle), `expert` (noisy human), `subjective` (noisy concept labels + noisy human), `machine`/`llm`/`clip` (concepts discovered via Label-Free CBM).

Remaining parameters

Parameter	Default	Description
`seed`	`1014` / `1337`	Random seed (image / text)
`size`	`"medium"`	Image resolution: `"small"` (8px), `"medium"` (32px), `"large"` (600px). Image only.
`model_type`	`"stochastic"`	Labeling function: `"deterministic"` or `"stochastic"`
`concept_missing_mech`	`"none"`	Missingness mechanism: `"none"`, `"mcar"`, or `"mnar"`
`intervention_budgets`	`[1, 3]`	Number of concepts to correct per sample
`intervention_thresholds`	`[0.2, 0.4]`	Concepts whose predicted probability is within this distance of 0.5 are candidates for intervention
`intervention_strategy`	`"kflip"`	`"kflip"` (up to k concepts) or `"exact_k"` (exactly k)
`alignment_constraints`	`{}`	Sign constraints on concept weights (e.g., `{"has_knees": 1}`). Retrains the label predictor and re-evaluates interventions.
`difficulty`	`"hard"`	Corpus difficulty (text only)
`generic_rate`	`0.7`	Fraction of test set using concept-ambiguous text (text only)

Note: The llm and clip regimes call the Gemini API at intervention time. Set your key before running:
export GEMINI_API_KEY=your_key_here

Sudoku Validation

This benchmark targets automation settings where the system handles routine cases and defers uncertain ones to a human. The task is to determine whether a 9x9 Sudoku board is valid, i.e., contains the digits 1-9 exactly once in each row, column, and block. The 27 concepts correspond to the validity of each row, column, and 3x3 block. A board is valid if and only if all 27 concepts are true (AND structure), so a single violated concept is enough to invalidate the board. When the model abstains, a human can verify specific concepts (e.g., "is row 5 valid?") to resolve the uncertainty.

Sudoku board with handwritten digits and concept annotations

The concept-supervised (CS) model -- the Sudoku equivalent of a CBM -- predicts 27 binary concepts, then a label predictor determines board validity. The selective classification stage finds a confidence threshold that achieves at least 95% accuracy on kept predictions.

# Run the full pipeline (generates boards, trains OCR + models, evaluates)
python scripts/sudoku_pipeline.py --seed 171

# Skip data regeneration (reuse existing boards), only retrain models
python scripts/sudoku_pipeline.py --seed 171 --stages cs dnn selective intervene align collect

Expected results (seed=171, target_accuracy=0.95):

model  selective_acc  selective_cov
  dnn          0.875           0.04
   cs          0.915           1.00

The most important parameters are listed below. For the full list, see SudokuBenchmarkConfig in concept_benchmark/config.py or run python scripts/sudoku_pipeline.py --help.

Parameter	Default	Description
`max_corrupt`	`9`	Number of cells corrupted in invalid boards (higher values produce subtler errors)
`data_type`	`"image"`	`"image"` evaluates on OCR-inferred digits (adds OCR stage); `"tabular"` evaluates on ground-truth digit values (no OCR). Training always uses ground-truth values.
`handwriting`	`True`	Render digits in handwritten style (only applies when `data_type="image"`)
`target_accuracy`	`0.9`	Minimum accuracy required on kept predictions

Remaining parameters

Parameter	Default	Description
`seed`	`171`	Random seed
`n_samples`	`1000`	Number of boards to generate
`valid_ratio`	`0.5`	Fraction of valid boards
`intervention_thresholds`	`[0.2, 0.4, 0.6, 0.8]`	Concept confidence thresholds that determine which concepts are candidates for verification

Citation

If you use this package in your research, please cite:

@article{skirzynski2026concept,
  title={Measuring What Matters: Synthetic Benchmarks for Concept Bottleneck Models},
  author={Skirzy\'{n}ski, Julian and Cheon, Harry and Kadekodi, Shreyas and Stewart, Meredith and Ustun, Berk},
  year={2026},
}

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.2.0

Mar 9, 2026

0.1.5

Feb 27, 2026

0.1.4

Feb 26, 2026

0.1.2

Feb 26, 2026

0.1.1

Feb 25, 2026

0.1.0

Feb 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

concept_benchmark-0.2.0.tar.gz (2.2 MB view details)

Uploaded Mar 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

concept_benchmark-0.2.0-py3-none-any.whl (161.8 kB view details)

Uploaded Mar 9, 2026 Python 3

File details

Details for the file concept_benchmark-0.2.0.tar.gz.

File metadata

Download URL: concept_benchmark-0.2.0.tar.gz
Upload date: Mar 9, 2026
Size: 2.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for concept_benchmark-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`f9d1e04a3405c5d372f8dad71de3a0418bdd1e74e14d9e98e6411b053ca6e8a0`
MD5	`c005d34cb50c980dfe04c04f6e844edd`
BLAKE2b-256	`98f59358f11c6480e9a53395fb924bf4e1909bd989d3a56cf825e2d3c6f0ec1d`

See more details on using hashes here.

File details

Details for the file concept_benchmark-0.2.0-py3-none-any.whl.

File metadata

Download URL: concept_benchmark-0.2.0-py3-none-any.whl
Upload date: Mar 9, 2026
Size: 161.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for concept_benchmark-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`20ea18b06a6ef105d3fa6f95a9014a191dba403099717c2145e963fc6e2ab46e`
MD5	`71223b94d39f084f0bffdfc4f3505210`
BLAKE2b-256	`8dbeb43aeaaf5d8f7d31a73ac147e6f4bbcb7041f6ed107d028b31792f50e157`

See more details on using hashes here.

concept-benchmark 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Concept Benchmark

Table of Contents

Installation

Quick Start

Benchmarks

Robot Classification

Sudoku Validation

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes