Skip to main content

Python client + typed contract for the BenchHub benchmarking platform

Project description

BenchHub

Benchmark model predictions against curated datasets — with a typed contract end-to-end. Pick (or import) a dataset, define metrics and visualizations in Python, submit predictions with the client, and see how your model ranks sample-by-sample. Live at https://runbenchhub.com.

BenchHub pipeline


What makes it different

Everything hangs off one idea — a typed contract. Every field (a dataset column, a leaderboard input, a prediction) declares a kind (image, mask, depth, audio, label, label_list, bboxes, scalar, text, json, sequence, coco_detections, plus any user-registered kind). The dataset, the benchhub-client, and the metric engine all agree on kinds, so data is decoded, scored, and rendered without guessing. Kinds are defined in benchhub/types.py and listed live at /supported_types.

Features

  • Passwordless auth — GitHub, Google, or a one-time email code.
  • Self-service HuggingFace import — one button: the tabular importer (Croissant / parquet) falls back automatically to a file-tree mapper (paired files, packed .npz/.h5/archives, video clips) with a "describe the structure" role wizard, loaders for file/token/npz/json/csv/parquet/hdf5/zip/tar/gz/sequence, a decode preview, variant fan-out, draft autosave, and a split/subfolder picker for huge repos. Datasets shown before import via their HF card.
  • Two-tier storage — datasets cache as a cheap preview tier; each leaderboard materializes a chosen sample subset at full resolution.
  • Metrics & visualizations in Python — typed signatures, per-sample + aggregated, pooling (mean/median/percentile/min/max), dependency chaining.
  • Hardened sandboxall user-supplied code (metrics, visualizations, and a registered type's visualize) runs in a short-lived --network=none --read-only, memory/CPU-capped container — never in-process on the server.
  • User-registered data types — declare a new kind (its storage + a sandboxed visualize(blob, params)) from the web (/supported_types) or the client; it joins the global kind namespace and renders in the views.
  • benchhub-client + dev kititer_samplespredictsubmit; programmatic dataset creation; create_metric / create_visualization / create_datatype; and benchhub.author.test_metric / test_visualization to iterate locally before uploading.
  • Per-row visibility (public / unlisted / private) with dependency guards — once another user depends on your dataset/LB (binds it / submits to it), it can't be made private or deleted.
  • Split-bucket quotas — 50 GB public + 10 GB private per user (live usage on Account settings + the Storage-usage page).
  • Async processing with Celery (Redis broker); API tokens; public landing, catalog (/leaderboards, /datasets), and profiles (/u/<id>).

Quickstart (submitting to a leaderboard)

pip install -U benchhub-client      # PyPI: benchhub-client
import benchhub as bh

client = bh.Client(token="bh_...")            # token from /settings/api_tokens
sub = client.submission(LB_ID, name="my-model-v1")

for sample_name, inputs in client.iter_samples(LB_ID):
    image = inputs["image"].array             # decoded bh.Image -> (H,W,3) uint8
    pred  = my_model(image)
    sub.predict(sample_name, label_pred=bh.Label(int(pred)))

print(sub.submit(description="ResNet-50"))

Authoring metrics / visualizations / data types

import benchhub as bh

def my_iou(gt: bh.Mask, pred: bh.Mask):       # input_kinds auto-derive from annotations
    g, p = gt.array, pred.array
    inter = ((g == 1) & (p == 1)).sum(); union = ((g == 1) | (p == 1)).sum()
    return float(inter / union) if union else 1.0

bh.author.test_metric(my_iou, gt=gt_mask, pred=pred_mask)   # iterate locally
client.create_metric("my_iou", my_iou)                      # then upload (sandboxed server-side)

client.create_visualization(...) (returns a PIL.Image) and client.create_datatype(...) (a new kind + a sandboxed visualize) work the same way.

Documentation

  • In-app docs at /docs: overview, core concepts, importing data, data types, leaderboards, writing metrics & visualizations, submitting predictions, API/client reference, tutorials.
  • Architecture: docs/ARCHITECTURE.md (editable drawio source under docs/diagrams/).
  • Dev notes / history: CLAUDE.md (durable architecture + gotchas), the granular subsystem notes under skills/, and the dated notes under docs/.
  • Feature requests + bugs: GitHub issues.

Run it locally

python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

redis-server                                   # 1. broker (port 6379)
celery -A app.celery worker --loglevel=info    # 2. worker
python app.py                                  # 3. web app  -> http://localhost:6060

The database + uploads live outside the repo in a data directory — set its location with BENCHHUB_DATA_DIR=/some/path. To run user code in the container sandbox locally, install Docker, build the runner image (docker build -f runner/Dockerfile -t benchhub-runner .), and set BENCHHUB_SANDBOX_METRICS=1.

Tests

pytest tests/         # ~1080 tests; use tests/ (not bare pytest)

Deployment

Self-hosted on an Ubuntu box at runbenchhub.com (gunicorn + celery + redis under systemd, nginx + certbot, Cloudflare DNS-only). The operational runbook — push flow, .env keys, logs, rollback — is docs/SELFHOST_RUNBOOK.md.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchhub_client-0.1.10.tar.gz (243.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

benchhub_client-0.1.10-py3-none-any.whl (77.0 kB view details)

Uploaded Python 3

File details

Details for the file benchhub_client-0.1.10.tar.gz.

File metadata

  • Download URL: benchhub_client-0.1.10.tar.gz
  • Upload date:
  • Size: 243.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for benchhub_client-0.1.10.tar.gz
Algorithm Hash digest
SHA256 9993c0fbe3592baa8d84e6e7432c6d9d3ab2bd30034cae2d0ed9e8c24acf315f
MD5 ea3d3258eb900b5021e65e97580410ce
BLAKE2b-256 7848f45c8f77ca5a0188d08a6a7d558797554be7015ebbd87234a235daff5454

See more details on using hashes here.

File details

Details for the file benchhub_client-0.1.10-py3-none-any.whl.

File metadata

File hashes

Hashes for benchhub_client-0.1.10-py3-none-any.whl
Algorithm Hash digest
SHA256 f9d37363c058d513b4c5eb03efb0ba6addade4c62d74939c0ed0c5b4e8d14373
MD5 ab6748e832cd662c78edcceacdb9361d
BLAKE2b-256 8b1de2e759a60830e90174e3e5aa68768f43bc695f3fbe27994d4e5877ac91ba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page