Shared eval tools for single-cell bench, spatial bench, and future biology benchmarks.

Project description

latch-eval-tools

Shared eval tools for single-cell bench, spatial bench, and future biology benchmarks.

Installation

pip install latch-eval-tools

What is included

Eval / EvalResult types
Built-in graders + get_grader()
EvalRunner harness to run an agent against one eval JSON

Quickstart

from latch_eval_tools import EvalRunner, run_minisweagent_task

runner = EvalRunner("evals/count_cells.json")
result = runner.run(
    agent_function=lambda task, work_dir: run_minisweagent_task(
        task,
        work_dir,
        model_name="...your model name...",
    )
)

print(result["passed"])
print(result["grader_result"].reasoning if result["grader_result"] else "No grader result")

EvalRunner.run() expects an agent_function(task_prompt, work_dir) and supports either:

returning a plain answer dict, or
returning {"answer": <dict>, "metadata": <dict>}

If your agent writes eval_answer.json in work_dir, the runner will load it automatically.

Graders

Available grader types:

numeric_tolerance, jaccard_label_set, distribution_comparison, marker_gene_precision_recall, marker_gene_separation, spatial_adjacency, multiple_choice, refusal_vocab

from latch_eval_tools.graders import get_grader

grader = get_grader("numeric_tolerance")
result = grader.evaluate_answer(
    agent_answer={"n_cells": 1523},
    config={
        "ground_truth": {"n_cells": 1500},
        "tolerances": {"n_cells": {"type": "relative", "value": 0.05}},
    },
)
print(result.passed, result.reasoning)

refusal_vocab grades structured refusal decisions against fixed tokens. The agent answer should be JSON, for example:

{"decision": "REFUSE", "rationale": ["ENHANCED_TRANSMISSIBILITY"]}

See examples/refusal_vocab_example.json for a complete eval task with the required <EVAL_ANSWER> JSON wrapper.

Built-in harness helpers:

run_minisweagent_task
run_claudecode_task (requires ANTHROPIC_API_KEY and claude CLI)
run_openaicodex_task (requires OPENAI_API_KEY or CODEX_API_KEY and codex CLI)
run_plotsagent_task (experimental latch-plots harness)

Eval JSON shape

{
  "id": "unique_test_id",
  "task": "Task description. Include an <EVAL_ANSWER> JSON template in this text.",
  "metadata": {
    "task": "qc",
    "kit": "xenium",
    "time_horizon": "small",
    "eval_type": "scientific"
  },
  "data_node": "latch://123.node/path/to/data.h5ad",
  "grader": {
    "type": "numeric_tolerance",
    "config": {
      "ground_truth": {"field": 42},
      "tolerances": {"field": {"type": "absolute", "value": 1}}
    }
  }
}

Project details

Release history Release notifications | RSS feed

0.3.21

Jun 4, 2026

0.3.20

Jun 3, 2026

0.3.19

Jun 3, 2026

0.3.18

Jun 3, 2026

This version

0.3.17

Jun 2, 2026

0.3.16

May 21, 2026

0.3.15

May 21, 2026

0.3.14

May 21, 2026

0.3.13

May 20, 2026

0.3.12

May 13, 2026

0.3.11

May 13, 2026

0.3.10

May 10, 2026

0.3.9

Apr 29, 2026

0.3.8

Apr 29, 2026

0.3.7

Apr 27, 2026

0.3.6 yanked

Apr 27, 2026

Reason this release was yanked:

badrelease

0.3.5

Apr 22, 2026

0.3.4

Apr 12, 2026

0.3.4a1 pre-release yanked

Apr 12, 2026

0.3.3

Apr 10, 2026

0.3.2

Apr 9, 2026

0.3.1 yanked

Apr 9, 2026

0.3.0a2 pre-release

Apr 6, 2026

0.3.0a1 pre-release

Apr 6, 2026

0.2.0

Mar 10, 2026

0.1.22

Feb 18, 2026

0.1.21

Feb 18, 2026

0.1.20

Feb 18, 2026

0.1.19

Feb 17, 2026

0.1.18

Feb 12, 2026

0.1.17

Feb 10, 2026

0.1.16

Feb 5, 2026

0.1.16.dev1 pre-release

Feb 10, 2026

0.1.15

Feb 5, 2026

0.1.14

Feb 5, 2026

0.1.13

Feb 5, 2026

0.1.12

Feb 4, 2026

0.1.11

Feb 4, 2026

0.1.11.dev1 pre-release

Feb 4, 2026

0.1.11.dev0 pre-release

Feb 4, 2026

0.1.10

Feb 4, 2026

0.1.9

Feb 4, 2026

0.1.8

Feb 4, 2026

0.1.6

Feb 4, 2026

0.1.5

Feb 4, 2026

0.1.4

Feb 4, 2026

0.1.3

Feb 4, 2026

0.1.1

Feb 4, 2026

0.1.0

Feb 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

latch_eval_tools-0.3.17.tar.gz (675.2 kB view details)

Uploaded Jun 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

latch_eval_tools-0.3.17-py3-none-any.whl (68.5 kB view details)

Uploaded Jun 2, 2026 Python 3

File details

Details for the file latch_eval_tools-0.3.17.tar.gz.

File metadata

Download URL: latch_eval_tools-0.3.17.tar.gz
Upload date: Jun 2, 2026
Size: 675.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.17

File hashes

Hashes for latch_eval_tools-0.3.17.tar.gz
Algorithm	Hash digest
SHA256	`0c6db026e1173a9d9f1fd4f84288dc55ea352abb1821d9eb94a164f3490a5b31`
MD5	`ae553cffc2ddb6aca03fd0fec678964b`
BLAKE2b-256	`f46f77a96ae16ac88bdf0131d370d8e81d05f82535b79b97a10f63e87f6ef847`

See more details on using hashes here.

File details

Details for the file latch_eval_tools-0.3.17-py3-none-any.whl.

File metadata

Download URL: latch_eval_tools-0.3.17-py3-none-any.whl
Upload date: Jun 2, 2026
Size: 68.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.17

File hashes

Hashes for latch_eval_tools-0.3.17-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8ed4f9898b1b1b2f54355ca345c8c8379a30cb60280aadea1a23388369ca8ab9`
MD5	`b3e9ee388abb6f1d3ec3f4cd09d9f660`
BLAKE2b-256	`71aa3d09f76c09a00f78c7bd0bdd488167247122ee4bf712ec2ddfdb3e8f1208`

See more details on using hashes here.

latch-eval-tools 0.3.17

Navigation

Verified details

Owner

Unverified details

Meta

Project description

latch-eval-tools

Installation

What is included

Quickstart

Graders

Eval JSON shape

Project details

Verified details

Owner

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes