Skip to main content

Plug any model into any major AGI eval and actually run it.

Project description

agi-evals

Plug any model into any major AGI eval and actually run it.

Open source. Six categories. 45 evals catalogued, deeply implemented over time — not a directory pretending to be a platform. The runner code is Apache-2.0; each eval's dataset keeps its original upstream license (documented per entry).

Everything runs on your machine, no account or API key required — bring your own model credentials (or none, for local models via Ollama/vLLM/MLX). A free account at agi-eval.studio adds the hosted layer: score-over-time dashboards, variant-vs-base comparison cards, a public leaderboard, and challenges.

→ Catalog, per-eval docs, and scoreboard: agi-eval.studio


The idea

Two protocols decouple what we run from what we run it on:

  • PatientAdapter — a model endpoint. Takes a prompt plus any eval-specific scenario, returns a response. Adapters ship for OpenAI, Anthropic, Grok, Ollama, vLLM, Hugging Face Transformers, MLX (Apple Silicon), and a custom-callable shim.
  • EvalRunner — an eval. Takes a patient and a case, returns a scored result with a typed failure tag.

Get these two right and adding eval #2 through #50 is incremental. Every eval and every model hangs off them — an eval never imports an adapter, an adapter never imports an eval.

catalog/evals.yaml ──┬──► website (agi-eval.studio)
                     └──► registry ──► EvalRunner ──┐
                                                    ├──► harness ──► EvalReport ──► push to scoreboard
                              PatientAdapter ───────┘

Install

pip install agi-evals                 # core + custom/ollama/vllm/grok/openai-compat
pip install 'agi-evals[openai]'       # + OpenAI SDK
pip install 'agi-evals[anthropic]'    # + Anthropic SDK
pip install 'agi-evals[hf]'           # + Transformers/torch
pip install 'agi-evals[mlx]'          # + MLX (Apple Silicon)

Quickstart — CLI

agi-evals list --status live                 # browse the catalog
agi-evals info gpqa-diamond                  # inspect one eval
agi-evals run gpqa-diamond --model echo      # offline smoke test, no keys
agi-evals download --all                     # fetch + cache full datasets
agi-evals run gpqa-diamond --model openai:gpt-4o-mini --limit 50
agi-evals run humaneval-plus --model ollama:llama3.1:8b --concurrency 4
agi-evals run math --model anthropic:claude-opus-4-8 --push   # submit to scoreboard

Every shipped eval bundles a small real-schema sample so it runs offline out of the box. agi-evals download <eval> fetches the full upstream dataset (HF datasets-server or GitHub, no heavy deps) into ~/.cache/agi-evals/, and runs use it automatically. GPQA is gated upstream: set HF_TOKEN after accepting its terms, or the runner falls back to the GPQA repo's published-password zip. An explicit data_path= always wins.

Quickstart — SDK

from agi_evals import load_runner, run_eval
from agi_evals.adapters import OpenAIAdapter, CustomAdapter

# Any of the built-in adapters...
patient = OpenAIAdapter("gpt-4o-mini")

# ...or wrap your own endpoint as a callable:
patient = CustomAdapter(lambda req: my_model(req.prompt), name="my-model")

report = run_eval(load_runner("gpqa-diamond"), patient, limit=100, concurrency=8)
print(report.score, report.pass_rate, report.failure_counts)

# Save it to your scoreboard at agi-eval.studio
from agi_evals.client import push_report
push_report(report, model="my-model")        # needs AGI_EVALS_API_KEY

Track your scores at agi-eval.studio

Local runs print a report and exit — nothing leaves your machine. To keep a history, add --push:

  1. Sign in at agi-eval.studio (GitHub OAuth).
  2. Mint a key under Settings → API keys (shown once, stored hashed).
  3. export AGI_EVALS_API_KEY=ae_...
  4. Add --push to any run or compare.

Your dashboard charts every eval over time, groups variant-vs-base comparisons into vs-cards, and lets you submit a run to a challenge or the public leaderboard — attach your GitHub repo or an endpoint so others can see what the score belongs to.

Live evals (runnable today)

Eval Category Grading Full dataset
GPQA Diamond reasoning single-letter MCQ 198
MMLU-Pro reasoning 10-choice MCQ ~12k
MATH reasoning \boxed{} answer, math-aware match 500 (MATH-500)
AIME 2024 reasoning integer exact-match 30
HumanEval+ code sandboxed test execution 164
BIG-Bench Hard reasoning normalized exact-match, 27 tasks ~6.5k
MuSR reasoning narrative MCQ 756
BFCL (simple) agent function-call AST match 400
ZebraLogic reasoning full-grid JSON, puzzle-level gated (HF_TOKEN)
JailbreakBench safety refusal rate, LLM-judged 100
LiveCodeBench code contest tests, pass@k, contamination-free recent releases (~340)
HarmBench safety behavior classifier, score = 1 − ASR 300
τ-bench agent episode reward: DB-state × outputs 165 (retail+airline)
ALFWorld embodied task success in the real TextWorld engine* 134 unseen games
ScienceWorld embodied engine score 0–100, partial credit* 30 tasks, test variations
AILuminate safety judged safe-response rate (practice set) 1,200

τ-bench is a faithful port of the Sierra Research benchmark: the original tools, databases, policy wikis, tasks, and reward function, vendored 1:1 (MIT). The simulated user is any PatientAdapter (TauBenchRunner(user=OpenAIAdapter("gpt-4o"))). Port verified by a gold-replay oracle scoring 165/165 on the real test sets.

* ALFWorld and ScienceWorld drive their original engines and need extras: pip install 'agi-evals[alfworld]' / 'agi-evals[scienceworld]' (Java required for ScienceWorld). Every other live eval runs with zero optional dependencies.

The other 29 catalogued evals across agent/tool-use, code, robotics, and safety carry status building or roadmap — browse them all, with per-eval docs (how it works, scoring, troubleshooting), at agi-eval.studio/evals.

pass@k for code evals

from agi_evals.evals import HumanEvalPlusRunner, LiveCodeBenchRunner

runner = LiveCodeBenchRunner(n_samples=10, k=5)   # 10 samples, report pass@5

Sampling uses the unbiased Chen et al. (2021) estimator; the default n_samples=1, k=1 is plain greedy pass@1.

Compare a variant against its base

agi-evals compare gpqa-diamond --model openai:my-finetune \
    --baseline openai:gpt-4o-mini --push

Paired per-case comparison on identical cases: improvements (cases the variant newly solves), regressions (cases it newly fails, listed by id), score delta, and McNemar's exact test on the discordant pairs. Infra errors on either side are excluded from pairing so endpoint flakes never read as regressions. --push lands both runs on your dashboard as a vs-card.

Typed failure taxonomy

Every result carries at most one FailureTag: WRONG_ANSWER, NO_ANSWER, REFUSED, MALFORMED_OUTPUT, TOOL_ERROR, TIMEOUT, CONTEXT_OVERFLOW, ADAPTER_ERROR, HARNESS_ERROR. Infrastructure errors (adapter/harness) are excluded from the aggregate score so a flaky endpoint never silently penalizes a model — they stay visible in failure_counts.

Safety note

HumanEval+ executes model-generated code locally in a subprocess with a timeout. Run only models and datasets you trust, or wrap it in an OS-level sandbox.

Contributing

The SDK source repository opens to contributions on July 1, 2026. Until then: bug reports, eval requests, and questions → agi-eval.studio. Adding an eval is deliberately small — implement an EvalRunner, bundle a sample, add a catalog entry — and the installed package is the reference: every live eval ships its source in agi_evals/evals/.

License

Runner code: Apache-2.0 (see LICENSE). Eval datasets retain their upstream licenses, documented per entry in the catalog.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agi_evals-0.1.0.tar.gz (104.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agi_evals-0.1.0-py3-none-any.whl (140.2 kB view details)

Uploaded Python 3

File details

Details for the file agi_evals-0.1.0.tar.gz.

File metadata

  • Download URL: agi_evals-0.1.0.tar.gz
  • Upload date:
  • Size: 104.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.12

File hashes

Hashes for agi_evals-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1949c9657f1afa7140c37d9f447a61e2bacb6833051b53320483f127e01aeb54
MD5 87933f42af61fdf55793fee735a70024
BLAKE2b-256 cdcad56c82cdbc00ebd58598973085cf17b0e43d416f6aa25ea00454953b49b2

See more details on using hashes here.

File details

Details for the file agi_evals-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agi_evals-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 140.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.12

File hashes

Hashes for agi_evals-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 17a23f1ebeb97e25ae35cfd39a884cb4ab9fb1fa6eac6453f1a66db4e6b8ddcf
MD5 6d1a5b5553ef24fc9a2b6f1148f92b39
BLAKE2b-256 7dc823468ef6860d0f8a105fbf753a4ca40698035d7d88e6051d984d0f6fca49

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page