Python client + typed contract for the BenchHub benchmarking platform

Project description

BenchHub

Benchmark model predictions against curated datasets — with a typed contract end-to-end. Pick (or import) a dataset, define metrics and visualizations in Python, submit predictions with the client, and see how your model ranks sample-by-sample. Live at https://runbenchhub.com.

BenchHub pipeline

What makes it different

Everything hangs off one idea — a typed contract. Every field (a dataset column, a leaderboard input, a prediction) declares a kind (image, mask, depth, audio, label, label_list, bboxes, scalar, text, json, sequence, coco_detections, plus any user-registered kind). The dataset, the benchhub-client, and the metric engine all agree on kinds, so data is decoded, scored, and rendered without guessing. Kinds are defined in benchhub/types.py and listed live at /supported_types.

Features

Passwordless auth — GitHub, Google, or a one-time email code.
Self-service HuggingFace import — one button: the tabular importer (Croissant / parquet) falls back automatically to a file-tree mapper (paired files, packed .npz/.h5/archives, video clips) with a "describe the structure" role wizard, loaders for file/token/npz/json/csv/parquet/hdf5/zip/tar/gz/sequence, a decode preview, variant fan-out, draft autosave, and a split/subfolder picker for huge repos. Datasets shown before import via their HF card.
Two-tier storage — datasets cache as a cheap preview tier; each leaderboard materializes a chosen sample subset at full resolution.
Metrics & visualizations in Python — typed signatures, per-sample + aggregated, pooling (mean/median/percentile/min/max), dependency chaining.
Hardened sandbox — all user-supplied code (metrics, visualizations, and a registered type's visualize) runs in a short-lived --network=none --read-only, memory/CPU-capped container — never in-process on the server.
User-registered data types — declare a new kind (its storage + a sandboxed visualize(blob, params)) from the web (/supported_types) or the client; it joins the global kind namespace and renders in the views.
benchhub-client + dev kit — iter_samples → predict → submit; programmatic dataset creation; create_metric / create_visualization / create_datatype; and benchhub.author.test_metric / test_visualization to iterate locally before uploading.
Per-row visibility (public / unlisted / private) with dependency guards — once another user depends on your dataset/LB (binds it / submits to it), it can't be made private or deleted.
Split-bucket quotas — 50 GB public + 10 GB private per user (live usage on Account settings + the Storage-usage page).
Async processing with Celery (Redis broker); API tokens; public landing, catalog (/leaderboards, /datasets), and profiles (/u/<id>).

Quickstart (submitting to a leaderboard)

pip install -U benchhub-client      # PyPI: benchhub-client

import benchhub as bh

client = bh.Client(token="bh_...")            # token from /settings/api_tokens
sub = client.submission(LB_ID, name="my-model-v1")

for sample_name, inputs in client.iter_samples(LB_ID):
    image = inputs["image"].array             # decoded bh.Image -> (H,W,3) uint8
    pred  = my_model(image)
    sub.predict(sample_name, label_pred=bh.Label(int(pred)))

print(sub.submit(description="ResNet-50"))

Authoring metrics / visualizations / data types

import benchhub as bh

def my_iou(gt: bh.Mask, pred: bh.Mask):       # input_kinds auto-derive from annotations
    g, p = gt.array, pred.array
    inter = ((g == 1) & (p == 1)).sum(); union = ((g == 1) | (p == 1)).sum()
    return float(inter / union) if union else 1.0

bh.author.test_metric(my_iou, gt=gt_mask, pred=pred_mask)   # iterate locally
client.create_metric("my_iou", my_iou)                      # then upload (sandboxed server-side)

client.create_visualization(...) (returns a PIL.Image) and client.create_datatype(...) (a new kind + a sandboxed visualize) work the same way.

Documentation

In-app docs at /docs: overview, core concepts, importing data, data types, leaderboards, writing metrics & visualizations, submitting predictions, API/client reference, tutorials.
Architecture: docs/ARCHITECTURE.md (editable drawio source under docs/diagrams/).
Dev notes / history: CLAUDE.md (durable architecture + gotchas), the granular subsystem notes under skills/, and the dated notes under docs/.
Feature requests + bugs: GitHub issues.

Run it locally

python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

redis-server                                   # 1. broker (port 6379)
celery -A app.celery worker --loglevel=info    # 2. worker
python app.py                                  # 3. web app  -> http://localhost:6060

The database + uploads live outside the repo in a data directory — set its location with BENCHHUB_DATA_DIR=/some/path. To run user code in the container sandbox locally, install Docker, build the runner image (docker build -f runner/Dockerfile -t benchhub-runner .), and set BENCHHUB_SANDBOX_METRICS=1.

Tests

pytest tests/         # ~1080 tests; use tests/ (not bare pytest)

Deployment

Self-hosted on an Ubuntu box at runbenchhub.com (gunicorn + celery + redis under systemd, nginx + certbot, Cloudflare DNS-only). The operational runbook — push flow, .env keys, logs, rollback — is docs/SELFHOST_RUNBOOK.md.

License

MIT — see LICENSE.

Project details

Release history Release notifications | RSS feed

0.1.10

Jun 5, 2026

This version

0.1.9

Jun 5, 2026

0.1.8

Jun 5, 2026

0.1.5

May 31, 2026

0.1.4

May 31, 2026

0.1.3

May 29, 2026

0.1.2

May 28, 2026

0.1.1

May 28, 2026

0.1.0

May 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchhub_client-0.1.9.tar.gz (241.0 kB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

benchhub_client-0.1.9-py3-none-any.whl (75.9 kB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file benchhub_client-0.1.9.tar.gz.

File metadata

Download URL: benchhub_client-0.1.9.tar.gz
Upload date: Jun 5, 2026
Size: 241.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for benchhub_client-0.1.9.tar.gz
Algorithm	Hash digest
SHA256	`53fcd229573084019d4bdf47411000ca343fb4e08da9573d691d07380d78330f`
MD5	`b8a14ac2a500410031661af7ab4693c0`
BLAKE2b-256	`1d566f40bb7d4e8e12f8c39aeabcd08cc420f6bd061752211119e399fe0e1e83`

See more details on using hashes here.

File details

Details for the file benchhub_client-0.1.9-py3-none-any.whl.

File metadata

Download URL: benchhub_client-0.1.9-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 75.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for benchhub_client-0.1.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`11db8ede28d2c53caba5dbce42db1350f52b0031787d8f980a0ce9fb549ef6e9`
MD5	`f21c40b457a87e14fea3cf7ff249d1db`
BLAKE2b-256	`b3228aa60ed01744b05a6dab8a02fc74175a9d5e13d0df7c712eb7b55df6b64a`

See more details on using hashes here.

benchhub-client 0.1.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

BenchHub

What makes it different

Features

Quickstart (submitting to a leaderboard)

Authoring metrics / visualizations / data types

Documentation

Run it locally

Tests

Deployment

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes