Skip to main content

Public eval + leaderboard for the Morpheus AI inference network. Drives MRC 76 (agent benchmarking) with reproducible probe sets.

Project description

hypnex-bench

Public eval + leaderboard for the Morpheus AI inference network. The off-chain implementation of MRC 76 (Agent Performance Benchmarking).

pip install hypnex-bench

What it does

Runs a small, reproducible probe set against every LLM on the Morpheus network, collects per-model pass-rates, latencies (p50/p95), and token counts, and renders a markdown leaderboard. Designed to run nightly so the data flywheel compounds.

Default suites (~19 probes total — a full run across all live LLMs is typically <$0.20 of MOR):

Suite Probes What it tests
coding 6 HumanEval-style — model writes a Python function, we exec it + assert
math 8 GSM8K-style word problems with deterministic numeric answers
json 5 Strict JSON adherence — does the model produce parseable, schema-matching JSON?

Quickstart

# 1. List available LLMs (no key needed, public registry)
hypnex-bench models

# 2. Run all suites against the default LLM set (key required, costs MOR)
HYPNEX_API_KEY=mor_xxx  hypnex-bench run

# 3. Render the leaderboard from data/latest.json
hypnex-bench leaderboard

Programmatic

from hypnex_bench import BenchRunner, all_suites, to_markdown

runner = BenchRunner(api_key="mor_...")
results = runner.run(["mistral-31-24b", "glm-5"], all_suites())
print(to_markdown(results))

CLI reference

hypnex-bench models                              # list active LLMs

hypnex-bench run [options]
    --models a,b,c           # comma-list (default: all live LLMs)
    --limit N                # only first N models (when --models omitted)
    --suite SUITE            # all | coding | math | json | a,b
    --output DIR             # output dir (default: ./data)
    --api-key KEY            # override HYPNEX_API_KEY
    --base-url URL           # override https://api.mor.org/api/v1

hypnex-bench leaderboard [options]
    --input DIR              # dir containing latest.json (default: ./data)
    --output FILE            # write to file (default: stdout)

Output

data/
  run-20260507T031502Z.jsonl      # one full run, append-only
  run-20260508T031455Z.jsonl
  ...
  latest.json                     # snapshot of the most recent run

latest.json is what the leaderboard renderer (and any future static-site generator) consumes.

Why not just use HumanEval / GSM8K / MMLU directly?

Those benchmarks have leaked into model training data. The probes here are small-set, slightly-rephrased variations chosen to be cheap (so a full run costs cents, not dollars), language-canonical (Python only for coding; ASCII + ASCII numbers for math), and verifiable without an LLM grader (deterministic evaluators that exec or regex). For canonical leaderboard claims, swap these probe sets for the official suites — the runner architecture stays the same.

Tests

pip install -e ".[dev]"
pytest                  # 17 pure-Python evaluator tests, no API key needed

Status & affiliation

Hypnex Labs draft of MRC 76. Not affiliated with the Morpheus AI Foundation. Suite definitions are MIT-licensed; submit PRs to add probes.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hypnex_bench-0.1.0.tar.gz (51.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hypnex_bench-0.1.0-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file hypnex_bench-0.1.0.tar.gz.

File metadata

  • Download URL: hypnex_bench-0.1.0.tar.gz
  • Upload date:
  • Size: 51.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for hypnex_bench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f6caf0c386a5d2cefe465032c5a8e87e49078dceaa8015211e0927fe232ed323
MD5 e799e0ec7c4c58049951746b7c90148a
BLAKE2b-256 e55917814ee1fafe4cddf31ef0b0e2b23402bcd27e0f094e58b54241c049d550

See more details on using hashes here.

File details

Details for the file hypnex_bench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: hypnex_bench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for hypnex_bench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9afac1913fc8f63d135bcb434473477ff4bad2454b3f3bb5cff372abb777b529
MD5 4e745802cdc9ddbebb68ff3919e20e73
BLAKE2b-256 dc763b1d7f3a2106e5a9db036a408a5c31946b9f2a03b3377d7022d86adb535b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page