Plug any model into any major AGI eval and actually run it.

These details have not been verified by PyPI

Project links

Project description

agi-evals

Plug any model into any major AGI eval and actually run it.

Open source, six categories, 45 evals catalogued and deeply implemented over time. This is not a directory pretending to be a platform. The runner code is Apache-2.0, and each eval's dataset keeps its original upstream license, documented per entry.

Everything runs on your machine. No account or API key is required: bring your own model credentials, or none at all for local models served through Ollama, vLLM, or MLX. A free account at agi-eval.studio adds the hosted layer: score-over-time dashboards, variant-vs-base comparison cards, a public leaderboard, and challenges.

Catalog, per-eval docs, and the scoreboard live at agi-eval.studio.

The idea

Two protocols decouple what we run from what we run it on:

PatientAdapter is a model endpoint. It takes a prompt plus any eval-specific scenario and returns a response. Adapters ship for OpenAI, Anthropic, Grok, Ollama, vLLM, Hugging Face Transformers, MLX (Apple Silicon), and a custom-callable shim.
EvalRunner is an eval. It takes a patient and a case and returns a scored result with a typed failure tag.

Get these two right and adding eval #2 through #50 is incremental. Every eval and every model hangs off them. An eval never imports an adapter, and an adapter never imports an eval.

catalog/evals.yaml ──┬──► website (agi-eval.studio)
                     └──► registry ──► EvalRunner ──┐
                                                    ├──► harness ──► EvalReport ──► push to scoreboard
                              PatientAdapter ───────┘

Install

pip install agi-eval                 # core + custom/ollama/vllm/grok/openai-compat
pip install 'agi-eval[openai]'       # + OpenAI SDK
pip install 'agi-eval[anthropic]'    # + Anthropic SDK
pip install 'agi-eval[hf]'           # + Transformers/torch
pip install 'agi-eval[mlx]'          # + MLX (Apple Silicon)

Quickstart: CLI

agi-evals list --status live                 # browse the catalog
agi-evals info gpqa-diamond                  # inspect one eval
agi-evals run gpqa-diamond --model echo      # offline smoke test, no keys
agi-evals download --all                     # fetch + cache full datasets
agi-evals run gpqa-diamond --model openai:gpt-4o-mini --limit 50
agi-evals run humaneval-plus --model ollama:llama3.1:8b --concurrency 4
agi-evals run math --model anthropic:claude-opus-4-8 --push   # submit to scoreboard

Every shipped eval bundles a small real-schema sample, so it runs offline out of the box. agi-evals download <eval> fetches the full upstream dataset (from the HF datasets-server or GitHub, with no heavy dependencies) into ~/.cache/agi-evals/, and runs pick it up automatically. GPQA is gated upstream: set HF_TOKEN after accepting its terms, or the runner falls back to the GPQA repo's published-password zip. An explicit data_path= always wins.

Quickstart: SDK

from agi_evals import load_runner, run_eval
from agi_evals.adapters import OpenAIAdapter, CustomAdapter

# Any of the built-in adapters...
patient = OpenAIAdapter("gpt-4o-mini")

# ...or wrap your own endpoint as a callable:
patient = CustomAdapter(lambda req: my_model(req.prompt), name="my-model")

report = run_eval(load_runner("gpqa-diamond"), patient, limit=100, concurrency=8)
print(report.score, report.pass_rate, report.failure_counts)

# Save it to your scoreboard at agi-eval.studio
from agi_evals.client import push_report
push_report(report, model="my-model")        # needs AGI_EVALS_API_KEY

Track your scores at agi-eval.studio

Local runs print a report and exit. Nothing leaves your machine. To keep a history, add --push:

Sign in at agi-eval.studio (GitHub OAuth).
Mint a key under Settings → API keys (shown once, stored hashed).
export AGI_EVALS_API_KEY=ae_...
Add --push to any run or compare.

Your dashboard charts every eval over time and groups variant-vs-base comparisons into vs-cards. From there you can submit a run to a challenge or the public leaderboard, and attach your GitHub repo or an endpoint so others can see what the score belongs to.

Live evals (runnable today)

Eval	Category	Grading	Full dataset
GPQA Diamond	reasoning	single-letter MCQ	198
MMLU-Pro	reasoning	10-choice MCQ	~12k
MATH	reasoning	`\boxed{}` answer, math-aware match	500 (MATH-500)
AIME 2024	reasoning	integer exact-match	30
HumanEval+	code	sandboxed test execution	164
BIG-Bench Hard	reasoning	normalized exact-match, 27 tasks	~6.5k
MuSR	reasoning	narrative MCQ	756
BFCL (simple)	agent	function-call AST match	400
ZebraLogic	reasoning	full-grid JSON, puzzle-level	gated (HF_TOKEN)
JailbreakBench	safety	refusal rate, LLM-judged	100
LiveCodeBench	code	contest tests, pass@k, contamination-free	recent releases (~340)
HarmBench	safety	behavior classifier, score = 1 − ASR	300
τ-bench	agent	episode reward: DB-state × outputs	165 (retail+airline)
ALFWorld	embodied	task success in the real TextWorld engine*	134 unseen games
ScienceWorld	embodied	engine score 0–100, partial credit*	30 tasks, test variations
AILuminate	safety	judged safe-response rate (practice set)	1,200
GAIA	agent	official exact-match scorer, FINAL ANSWER template	165 (validation, gated)
WebShop	agent	engine's attribute/option/price reward, partial credit*	500 test goals
LIBERO	robotics	success rate via PolicyAdapter (MuJoCo)*	4 suites × 10 tasks

τ-bench is a faithful port of the Sierra Research benchmark: the original tools, databases, policy wikis, tasks, and reward function, vendored 1:1 (MIT). The simulated user is any PatientAdapter (TauBenchRunner(user=OpenAIAdapter("gpt-4o"))). The port is verified by a gold-replay oracle scoring 165/165 on the real test sets.

* Engine-backed evals drive the original benchmark environments and need their engines: ALFWorld and ScienceWorld install as extras (pip install 'agi-eval[alfworld]' / 'agi-eval[scienceworld]', Java for the latter), while WebShop and LIBERO install from their upstream repos (each eval's docs page has the recipe). LIBERO evaluates robot policies, not text models: serve one over HTTP and pass --model policy:http://host:port. Every other live eval runs with zero optional dependencies.

The other 26 catalogued evals across agent/tool-use, code, robotics, and safety carry status building or roadmap. Browse them all, with per-eval docs covering how each works, how it scores, and how to troubleshoot it, at agi-eval.studio/evals.

pass@k for code evals

from agi_evals.evals import HumanEvalPlusRunner, LiveCodeBenchRunner

runner = LiveCodeBenchRunner(n_samples=10, k=5)   # 10 samples, report pass@5

Sampling uses the unbiased Chen et al. (2021) estimator. The default n_samples=1, k=1 is plain greedy pass@1.

Compare a variant against its base

agi-evals compare gpqa-diamond --model openai:my-finetune \
    --baseline openai:gpt-4o-mini --push

This runs a paired per-case comparison on identical cases: improvements (cases the variant newly solves), regressions (cases it newly fails, listed by id), the score delta, and McNemar's exact test on the discordant pairs. Infra errors on either side are excluded from pairing, so endpoint flakes never read as regressions. --push lands both runs on your dashboard as a vs-card.

Typed failure taxonomy

Every result carries at most one FailureTag: WRONG_ANSWER, NO_ANSWER, REFUSED, MALFORMED_OUTPUT, TOOL_ERROR, TIMEOUT, CONTEXT_OVERFLOW, ADAPTER_ERROR, HARNESS_ERROR. Infrastructure errors (adapter or harness) are excluded from the aggregate score, so a flaky endpoint never silently penalizes a model. They stay visible in failure_counts.

Safety note

HumanEval+ executes model-generated code locally in a subprocess with a timeout. Run only models and datasets you trust, or wrap it in an OS-level sandbox.

Contributing

Bug reports, eval requests, and questions: agi-eval.studio. Adding an eval is deliberately small. Implement an EvalRunner, bundle a sample, and add a catalog entry. The installed package is the reference: every live eval ships its source in agi_evals/evals/.

License

Runner code: Apache-2.0 (see LICENSE). Eval datasets retain their upstream licenses, documented per entry in the catalog.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Jun 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agi_eval-0.1.1.tar.gz (117.9 kB view details)

Uploaded Jun 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agi_eval-0.1.1-py3-none-any.whl (158.1 kB view details)

Uploaded Jun 7, 2026 Python 3

File details

Details for the file agi_eval-0.1.1.tar.gz.

File metadata

Download URL: agi_eval-0.1.1.tar.gz
Upload date: Jun 7, 2026
Size: 117.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.12

File hashes

Hashes for agi_eval-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`f9c9455d8a7a6083b1f2ea235e091e18206b3c6ff7a5dad0a3c920498dcfd54f`
MD5	`b3d754f37ee6f0cddeb834b2e40e0620`
BLAKE2b-256	`fef7c6beb7a7f0ec384f51a376d136e5ce3cbec1c1accaca879717662e111966`

See more details on using hashes here.

File details

Details for the file agi_eval-0.1.1-py3-none-any.whl.

File metadata

Download URL: agi_eval-0.1.1-py3-none-any.whl
Upload date: Jun 7, 2026
Size: 158.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.12

File hashes

Hashes for agi_eval-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a7127cd122dc698ec428d747e5789713edc05e384813c6a09eca3ae983f963ae`
MD5	`e8801984201645b5bd883053e7778dec`
BLAKE2b-256	`475da144e6847c2a592aca9d8a05f1c2866a50f0095630cfd78ff225e5c642b1`

See more details on using hashes here.

agi-eval 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

agi-evals

The idea

Install

Quickstart: CLI

Quickstart: SDK

Track your scores at agi-eval.studio

Live evals (runnable today)

pass@k for code evals

Compare a variant against its base

Typed failure taxonomy

Safety note

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes