Python client + typed contract for the BenchHub benchmarking platform
Project description
BenchHub
Benchmark model predictions against curated datasets — with a typed contract end-to-end. Pick (or import) a dataset, define metrics and visualizations in Python, submit predictions with the client, and see how your model ranks sample-by-sample. Live at https://runbenchhub.com.
What makes it different
Everything hangs off one idea — a typed contract. Every field (a dataset
column, a leaderboard input, a prediction) declares a kind
(image, mask, depth, audio, label, label_list, bboxes,
scalar, text, json, sequence, coco_detections, plus any
user-registered kind). The dataset, the benchhub-client, and the metric
engine all agree on kinds, so data is decoded, scored, and rendered without
guessing. Kinds are defined in benchhub/types.py and listed live at
/supported_types.
Features
- Passwordless auth — GitHub, Google, or a one-time email code.
- Self-service HuggingFace import — one button: the tabular importer
(Croissant / parquet) falls back automatically to a file-tree mapper
(paired files, packed
.npz/.h5/archives, video clips) with a "describe the structure" role wizard, loaders for file/token/npz/json/csv/parquet/hdf5/zip/tar/gz/sequence, a decode preview, variant fan-out, draft autosave, and a split/subfolder picker for huge repos. Datasets shown before import via their HF card. - Two-tier storage — datasets cache as a cheap preview tier; each leaderboard materializes a chosen sample subset at full resolution.
- Metrics & visualizations in Python — typed signatures, per-sample + aggregated, pooling (mean/median/percentile/min/max), dependency chaining.
- Hardened sandbox — all user-supplied code (metrics, visualizations,
and a registered type's
visualize) runs in a short-lived--network=none --read-only, memory/CPU-capped container — never in-process on the server. - User-registered data types — declare a new
kind(its storage + a sandboxedvisualize(blob, params)) from the web (/supported_types) or the client; it joins the global kind namespace and renders in the views. benchhub-client+ dev kit —iter_samples→predict→submit; programmatic dataset creation;create_metric/create_visualization/create_datatype; andbenchhub.author.test_metric/test_visualizationto iterate locally before uploading.- Per-row visibility (
public/unlisted/private) with dependency guards — once another user depends on your dataset/LB (binds it / submits to it), it can't be made private or deleted. - Split-bucket quotas — 50 GB public + 10 GB private per user (live usage on Account settings + the Storage-usage page).
- Async processing with Celery (Redis broker); API tokens; public landing,
catalog (
/leaderboards,/datasets), and profiles (/u/<id>).
Quickstart (submitting to a leaderboard)
pip install -U benchhub-client # PyPI: benchhub-client
import benchhub as bh
client = bh.Client(token="bh_...") # token from /settings/api_tokens
sub = client.submission(LB_ID, name="my-model-v1")
for sample_name, inputs in client.iter_samples(LB_ID):
image = inputs["image"].array # decoded bh.Image -> (H,W,3) uint8
pred = my_model(image)
sub.predict(sample_name, label_pred=bh.Label(int(pred)))
print(sub.submit(description="ResNet-50"))
Authoring metrics / visualizations / data types
import benchhub as bh
def my_iou(gt: bh.Mask, pred: bh.Mask): # input_kinds auto-derive from annotations
g, p = gt.array, pred.array
inter = ((g == 1) & (p == 1)).sum(); union = ((g == 1) | (p == 1)).sum()
return float(inter / union) if union else 1.0
bh.author.test_metric(my_iou, gt=gt_mask, pred=pred_mask) # iterate locally
client.create_metric("my_iou", my_iou) # then upload (sandboxed server-side)
client.create_visualization(...) (returns a PIL.Image) and
client.create_datatype(...) (a new kind + a sandboxed visualize) work the
same way.
Documentation
- In-app docs at
/docs: overview, core concepts, importing data, data types, leaderboards, writing metrics & visualizations, submitting predictions, API/client reference, tutorials. - Architecture:
docs/ARCHITECTURE.md(editable drawio source underdocs/diagrams/). - Dev notes / history:
CLAUDE.md(durable architecture + gotchas), the granular subsystem notes underskills/, and the dated notes underdocs/. - Feature requests + bugs: GitHub issues.
Run it locally
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
redis-server # 1. broker (port 6379)
celery -A app.celery worker --loglevel=info # 2. worker
python app.py # 3. web app -> http://localhost:6060
The database + uploads live outside the repo in a data directory — set its
location with BENCHHUB_DATA_DIR=/some/path. To run user code in the
container sandbox locally, install Docker, build the runner image
(docker build -f runner/Dockerfile -t benchhub-runner .), and set
BENCHHUB_SANDBOX_METRICS=1.
Tests
pytest tests/ # ~1080 tests; use tests/ (not bare pytest)
Deployment
Self-hosted on an Ubuntu box at runbenchhub.com (gunicorn + celery + redis
under systemd, nginx + certbot, Cloudflare DNS-only). The operational
runbook — push flow, .env keys, logs, rollback — is
docs/SELFHOST_RUNBOOK.md.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file benchhub_client-0.1.10.tar.gz.
File metadata
- Download URL: benchhub_client-0.1.10.tar.gz
- Upload date:
- Size: 243.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9993c0fbe3592baa8d84e6e7432c6d9d3ab2bd30034cae2d0ed9e8c24acf315f
|
|
| MD5 |
ea3d3258eb900b5021e65e97580410ce
|
|
| BLAKE2b-256 |
7848f45c8f77ca5a0188d08a6a7d558797554be7015ebbd87234a235daff5454
|
File details
Details for the file benchhub_client-0.1.10-py3-none-any.whl.
File metadata
- Download URL: benchhub_client-0.1.10-py3-none-any.whl
- Upload date:
- Size: 77.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9d37363c058d513b4c5eb03efb0ba6addade4c62d74939c0ed0c5b4e8d14373
|
|
| MD5 |
ab6748e832cd662c78edcceacdb9361d
|
|
| BLAKE2b-256 |
8b1de2e759a60830e90174e3e5aa68768f43bc695f3fbe27994d4e5877ac91ba
|