Public eval + leaderboard for the Morpheus AI inference network. Drives MRC 76 (agent benchmarking) with reproducible probe sets.
Project description
hypnex-bench
Public eval + leaderboard for the Morpheus AI inference network. The off-chain implementation of MRC 76 (Agent Performance Benchmarking).
pip install hypnex-bench
What it does
Runs a small, reproducible probe set against every LLM on the Morpheus network, collects per-model pass-rates, latencies (p50/p95), and token counts, and renders a markdown leaderboard. Designed to run nightly so the data flywheel compounds.
Default suites (~19 probes total — a full run across all live LLMs is typically <$0.20 of MOR):
| Suite | Probes | What it tests |
|---|---|---|
coding |
6 | HumanEval-style — model writes a Python function, we exec it + assert |
math |
8 | GSM8K-style word problems with deterministic numeric answers |
json |
5 | Strict JSON adherence — does the model produce parseable, schema-matching JSON? |
Quickstart
# 1. List available LLMs (no key needed, public registry)
hypnex-bench models
# 2. Run all suites against the default LLM set (key required, costs MOR)
HYPNEX_API_KEY=mor_xxx hypnex-bench run
# 3. Render the leaderboard from data/latest.json
hypnex-bench leaderboard
Programmatic
from hypnex_bench import BenchRunner, all_suites, to_markdown
runner = BenchRunner(api_key="mor_...")
results = runner.run(["mistral-31-24b", "glm-5"], all_suites())
print(to_markdown(results))
CLI reference
hypnex-bench models # list active LLMs
hypnex-bench run [options]
--models a,b,c # comma-list (default: all live LLMs)
--limit N # only first N models (when --models omitted)
--suite SUITE # all | coding | math | json | a,b
--output DIR # output dir (default: ./data)
--api-key KEY # override HYPNEX_API_KEY
--base-url URL # override https://api.mor.org/api/v1
hypnex-bench leaderboard [options]
--input DIR # dir containing latest.json (default: ./data)
--output FILE # write to file (default: stdout)
Output
data/
run-20260507T031502Z.jsonl # one full run, append-only
run-20260508T031455Z.jsonl
...
latest.json # snapshot of the most recent run
latest.json is what the leaderboard renderer (and any future static-site generator) consumes.
Why not just use HumanEval / GSM8K / MMLU directly?
Those benchmarks have leaked into model training data. The probes here are small-set, slightly-rephrased variations chosen to be cheap (so a full run costs cents, not dollars), language-canonical (Python only for coding; ASCII + ASCII numbers for math), and verifiable without an LLM grader (deterministic evaluators that exec or regex). For canonical leaderboard claims, swap these probe sets for the official suites — the runner architecture stays the same.
Tests
pip install -e ".[dev]"
pytest # 17 pure-Python evaluator tests, no API key needed
Status & affiliation
Hypnex Labs draft of MRC 76. Not affiliated with the Morpheus AI Foundation. Suite definitions are MIT-licensed; submit PRs to add probes.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hypnex_bench-0.1.0.tar.gz.
File metadata
- Download URL: hypnex_bench-0.1.0.tar.gz
- Upload date:
- Size: 51.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6caf0c386a5d2cefe465032c5a8e87e49078dceaa8015211e0927fe232ed323
|
|
| MD5 |
e799e0ec7c4c58049951746b7c90148a
|
|
| BLAKE2b-256 |
e55917814ee1fafe4cddf31ef0b0e2b23402bcd27e0f094e58b54241c049d550
|
File details
Details for the file hypnex_bench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: hypnex_bench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9afac1913fc8f63d135bcb434473477ff4bad2454b3f3bb5cff372abb777b529
|
|
| MD5 |
4e745802cdc9ddbebb68ff3919e20e73
|
|
| BLAKE2b-256 |
dc763b1d7f3a2106e5a9db036a408a5c31946b9f2a03b3377d7022d86adb535b
|