Benchmark your AI agent / RAG pipeline on the AI01 leaderboard.

These details have not been verified by PyPI

Project links

Project description

ai01-eval

Benchmark your AI agent or RAG pipeline against the AI01 leaderboard.

Install

pip install ai01-eval

Quick start

from ai01_eval import AI01Eval

client = AI01Eval(api_key="your-api-key")

# 1. Browse datasets
datasets = client.datasets.list()

# 2. Download a dataset
dataset = client.datasets.get("general-single-topic-v1")

# 3. Run your pipeline (ground-truth references are looked up server-side)
results = []
for item in dataset:
    answer = your_agent.run(item["query"], item.get("context"))
    results.append({
        "id":     item["id"],
        "query":  item["query"],
        "answer": answer,
    })

# 4. Submit — metrics are computed server-side
run = client.submit(
    dataset="general-single-topic-v1",
    results=results,
    agent_name="My RAG Agent v1",
)
print(run.scores)
# {'exact_match': 0.71, 'f1': 0.84, 'faithfulness': 0.79}
print(run.report_url)
# https://ai01.dev/benchmark?run=a3f2b1c9

Configuration

API key

# Explicit
client = AI01Eval(api_key="your-api-key")

# Via environment variable
# export AI01_API_KEY=your-api-key
client = AI01Eval()

API Reference

`AI01Eval`

Main entry point for the package.

client = AI01Eval(api_key="...", base_url="https://api.ai01.dev")

Parameter	Type	Default	Description
`api_key`	`str`	env `AI01_API_KEY`	Your AI01 API key. Get one at ai01.dev.

The client exposes three sub-clients:

Attribute	Type	Description
`client.datasets`	`DatasetClient`	List and download datasets.
`client.runs`	`RunsClient`	Retrieve past submission reports.
`client.submit(...)`	method	Shortcut to submit results (see below).

`client.datasets`

`client.datasets.list()`

Returns metadata for all available datasets.

datasets = client.datasets.list()
# [
#   {"id": "general-single-topic-v1", "name": "...", "num_queries": 120, "metrics": [...]},
#   ...
# ]

Returns: list[dict]

`client.datasets.get(dataset_id)`

Downloads a dataset by ID and returns a Dataset object you can iterate over.

dataset = client.datasets.get("general-single-topic-v1")
print(dataset.id)          # "general-single-topic-v1"
print(dataset.name)        # human-readable name
print(dataset.num_queries) # total number of items
print(dataset.metrics)     # ["exact_match", "f1", "faithfulness"]
print(len(dataset))        # number of items downloaded

for item in dataset:
    print(item["id"])       # unique item identifier
    print(item["query"])    # the question to answer
    print(item["context"])  # grounding document (RAG datasets only)

Note: Ground-truth references are not included in downloaded items — they are looked up server-side when you submit results.

Returns: Dataset

`Dataset`

An iterable container for dataset items.

Property / Method	Type	Description
`.id`	`str`	Dataset ID.
`.name`	`str`	Human-readable dataset name.
`.num_queries`	`int`	Total number of queries in the dataset.
`.metrics`	`list[str]`	Metrics this dataset is evaluated on.
`len(dataset)`	`int`	Number of items downloaded.
`for item in dataset`	`dict`	Iterate over items. Each item has `id`, `query`, and optionally `context`.

`client.submit(...)`

Submits your results to the AI01 server. Metrics are computed server-side.

run = client.submit(
    dataset="general-single-topic-v1",
    results=results,
    agent_name="My RAG Agent v1",
    submitter="your-username",       # optional, defaults to "anonymous"
    experiment_name="RAG baseline",  # optional
    description="First run",         # optional
    duration_seconds=t["duration_seconds"],  # optional
)

Parameters:

Parameter	Type	Default	Description
`dataset`	`str`	required	Dataset ID you ran against.
`results`	`list[dict]`	required	List of result dicts (see format below).
`agent_name`	`str`	required	Display name shown on the leaderboard.
`submitter`	`str`	`"anonymous"`	Your username or team name.
`experiment_name`	`str`	`None`	Label for this experiment run.
`description`	`str`	`None`	Free-text notes about this run.
`duration_seconds`	`float`	`None`	Pipeline wall-clock time.

Each dict in results must contain:

Key	Type	Description
`id`	`str`	Item ID from the dataset.
`query`	`str`	The original query string.
`answer`	`str`	Your agent's answer.

Returns: RunReport

`RunReport`

Returned by client.submit(...) and client.runs.get(run_id).

Property	Type	Description
`.id`	`str`	Unique run ID.
`.scores`	`dict[str, float]`	Metric scores, e.g. `{"f1": 0.84, "exact_match": 0.71}`.
`.report_url`	`str`	URL to the full report on the AI01 leaderboard.
`.duration_seconds`	`float \| None`	Pipeline duration if provided at submit time.
`.submitted_at`	`str`	ISO-8601 timestamp of the submission.

print(run.id)          # "a3f2b1c9"
print(run.scores)      # {'exact_match': 0.71, 'f1': 0.84}
print(run.report_url)  # "https://ai01.dev/benchmark?run=a3f2b1c9"

`client.runs`

`client.runs.get(run_id)`

Retrieves a past submission report by run ID.

run = client.runs.get("a3f2b1c9")
print(run.scores)

Returns: RunReport

Error handling

All errors raised by this library inherit from AI01Error:

from ai01_eval import AI01Error, AI01AuthError, AI01NotFoundError, AI01RateLimitError

try:
    dataset = client.datasets.get("unknown-dataset")
except AI01AuthError:
    print("Check your API key.")
except AI01NotFoundError:
    print("Dataset not found.")
except AI01RateLimitError:
    print("Slow down — rate limit hit.")
except AI01Error as exc:
    print(f"Unexpected error: {exc}")

Exception	HTTP status	Cause
`AI01AuthError`	401	Invalid or missing API key
`AI01NotFoundError`	404	Dataset or run ID not found
`AI01RateLimitError`	429	Too many requests
`AI01ServerError`	5xx / other 4xx	Unexpected server error

Available datasets

ID	Description	Queries	Metrics
`general-single-topic-v1`	RAG QA over a single shared corpus	120	exact_match, F1, faithfulness
`general-knowledge-v1`	Factual QA, no context	10	exact_match, F1, BLEU

Metrics

All metrics are computed server-side to ensure fairness:

Metric	Description
`exact_match`	Normalised string equality (lowercased, punctuation stripped).
`f1`	Token-level F1 overlap between answer and reference.
`bleu`	Unigram BLEU with brevity penalty.
`faithfulness`	Whether the answer is grounded in the provided context (LLM judge).

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

Mar 25, 2026

0.2.0

Mar 25, 2026

0.1.0

Mar 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai01_eval-0.2.1.tar.gz (15.9 kB view details)

Uploaded Mar 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ai01_eval-0.2.1-py3-none-any.whl (10.3 kB view details)

Uploaded Mar 25, 2026 Python 3

File details

Details for the file ai01_eval-0.2.1.tar.gz.

File metadata

Download URL: ai01_eval-0.2.1.tar.gz
Upload date: Mar 25, 2026
Size: 15.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.0 {"installer":{"name":"uv","version":"0.11.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ai01_eval-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`2018ca84cd3b11586b37ea5511ebb3d80bf443f63aa2ff4dcc5f20d81fedafe1`
MD5	`a0595352ecba54dc2e71240c38aa4a9a`
BLAKE2b-256	`04dc2e4f6b5a83d8f9a46d4d04df01c0415d9f261c263642a21fc722c89028b5`

See more details on using hashes here.

File details

Details for the file ai01_eval-0.2.1-py3-none-any.whl.

File metadata

Download URL: ai01_eval-0.2.1-py3-none-any.whl
Upload date: Mar 25, 2026
Size: 10.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.0 {"installer":{"name":"uv","version":"0.11.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ai01_eval-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6596ffc7fd7cc261c519e313be381d718fa7a2de35ffc75bc29d1f6095c5ee18`
MD5	`0e5316877e530032044a0f0c487de0dd`
BLAKE2b-256	`942af578772930226a2de9b65c26a4a1bc175e9fcfe51c64e616b646b5714354`

See more details on using hashes here.

ai01-eval 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ai01-eval

Install

Quick start

Configuration

API key

API Reference

AI01Eval

client.datasets

client.datasets.list()

client.datasets.get(dataset_id)

Dataset

client.submit(...)

RunReport

client.runs

client.runs.get(run_id)

Error handling

Available datasets

Metrics

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`AI01Eval`

`client.datasets`

`client.datasets.list()`

`client.datasets.get(dataset_id)`

`Dataset`

`client.submit(...)`

`RunReport`

`client.runs`

`client.runs.get(run_id)`