Universal evaluation harness for AI agents (local + embeddable).
Project description
sentient-evals
sentient-evals is a plug-and-play evaluation harness for AI agents (any framework) that can run locally/offline and can also be embedded by the Sentient platform.
Goals
- Run tasks concurrently with multiple trials to reduce variance.
- Capture trajectories (ATIF) and outcomes (final environment state).
- Support code-based graders (tests, static checks, tool-call verification), model-based graders (LLM-as-judge, multi-judge voting + calibration), and human review hooks for calibration.
- Produce a Harbor-style jobs directory for debuggability.
- Keep the core harness framework-agnostic; integrate frameworks via adapters.
This design is aligned with Anthropic’s definitions of task, trial, transcript, outcome, grader, harness, and suites (Demystifying evals for AI agents).
Install
pip install sentient-evals
For model-based graders:
pip install "sentient-evals[llm]"
CLI (local mode)
sentient-evals --help
sentient-evals run --help
Installed CLI agents (Harbor-style)
Built-in installed adapters (run via --adapter <name>) mirror Harbor's CLI agents:
claude-codecodexopencodecursor-clicline-cligemini-cligooseqwen-coderopenhandsswe-agentmini-swe-agentaider
These adapters install the CLI inside the trial environment at runtime. For sandboxed runs, use Docker, Daytona, or E2B environments so the agent can be installed in an isolated container.
Plug-and-play graders via config
For real evaluations, pass an adapter and grader config (TOML or JSON). The CLI can also default to verifier_script for task bundles when tests/test.sh exists.
Plug-and-play custom agents via --agent-file
If you have a local Python agent (LangChain, CrewAI, custom loop, etc.), you can point the CLI at a file without packaging your repo:
sentient-evals run --tasks-dir ./tasks --env local_python --agent-file ./path/to/agent.py:my_agent --config eval.toml
The :my_agent attribute can be:
- an adapter object with an async
run(task, instruction, seed, env, artifacts)method - a factory function that returns such an adapter
- a plain function (sync/async) that returns
dict/str(it will be wrapped automatically)
Example (TOML):
[adapter]
type = "import"
import_path = "my_project.my_adapter:build_adapter"
[[graders]]
type = "verifier_script"
[[graders]]
type = "static_analysis"
config = { checks = [{ name = "ruff", cmd = "ruff check ." }] }
Run:
sentient-evals run --tasks-dir path/to/tasks --env docker_cli --config eval.toml --concurrency 4
Task bundles (directory format)
For production agent evals, prefer task bundles (directories) over plain JSON tasks.
Bundle layout:
task.toml(task id, inputs, env config)instruction.md(agent-facing instruction)files/(payload copied into the trial workspace)environment/(Dockerfile/build context for container backends)tests/(optional verifier scripts and fixtures)
Run a directory of task bundles:
sentient-evals run --tasks-dir path/to/tasks --env docker_cli
Supported --env values:
local_pythondocker_clidocker_sdk(requiressentient-evals[docker])podman_cli(requirespodmaninstalled)daytona(requiressentient-evals[daytona]and Daytona configured)e2b(requiressentient-evals[e2b]andE2B_API_KEY)modal(requiressentient-evals[modal],MODAL_TOKEN_ID, andMODAL_TOKEN_SECRET)
Notes:
- Docker-based runs require Docker (or Podman) installed and running.
- For parallel sandboxed evals, prefer
docker_clilocally or Daytona/E2B in the cloud. - Modal is a strong cloud alternative when Daytona/E2B networking or runtime installation constraints block runs.
- If Daytona access is blocked (for example, client-side IP restrictions), use
--env e2bas a cloud fallback. - Cloud backends currently assume single-container tasks (multi-container orchestration is not yet supported).
- You can throttle cloud provider concurrency per task bundle via
environment.provider_concurrencyintask.toml. - E2B does not build per-task Dockerfiles at runtime. For container tasks on E2B, set
[environment].imageto a valid E2B template id (for examplebase). - Datasets that rely on Docker image parity (for example many SWE-bench style tasks with
FROM swebench/...) should run ondocker_cliordaytonaunless you provide mapped E2B templates.
Install E2B support:
pip install "sentient-evals[e2b]"
export E2B_API_KEY=your_api_key
Example:
sentient-evals run --tasks-dir path/to/tasks --env e2b --adapter cursor-cli --config eval.toml
Install Modal support:
pip install "sentient-evals[modal]"
export MODAL_TOKEN_ID=your_token_id
export MODAL_TOKEN_SECRET=your_token_secret
Example:
sentient-evals run --tasks-dir path/to/tasks --env modal --adapter cursor-cli --config eval.toml
Output layout (local runs)
By default, results are written under jobs/<run_id>/:
jobs/<run_id>/run_config.jsonjobs/<run_id>/run_result.jsonjobs/<run_id>/trials/<trial_id>/trial_config.jsonjobs/<run_id>/trials/<trial_id>/trajectory.jsonjobs/<run_id>/trials/<trial_id>/outcome.jsonjobs/<run_id>/trials/<trial_id>/trial_result.jsonjobs/<run_id>/trials/<trial_id>/judge/(optional judge artifacts)jobs/<run_id>/trials/<trial_id>/verifier/(optional verifier artifacts)
Artifact anatomy
Example directory structure for a run with one trial using an LLM-as-judge grader:
jobs/my-run-2025-01-20/
├── run_config.json # Run-level configuration
├── run_result.json # Run-level aggregated results
└── trials/
└── task1__0/
├── trial_config.json # Trial configuration (seed, adapter, etc.)
├── trajectory.json # ATIF trajectory (full run)
├── outcome.json # Final environment state snapshot
├── trial_result.json # Trial-level grader results
├── judge/ # LLM judge artifacts (when using LLM graders)
│ ├── prompt.txt # Judge prompt sent to LLM
│ ├── response.json # Raw LLM response (full API response)
│ ├── response.txt # Extracted text content
│ └── verdict.json # Parsed verdict (passed, score, model)
└── verifier/ # Code-based grader outputs (optional)
└── test_output.txt # Example: test stdout/stderr
Example: jobs/my-run-2025-01-20/run_config.json
{
"schema_version": "v1",
"run_id": "my-run-2025-01-20",
"suite": {
"schema_version": "v1",
"id": "default",
"trials_per_task": 1,
"concurrency": 1,
"seeds": [123]
},
"adapter": "json_echo",
"started_at": "2025-01-20T10:00:00Z",
"harness_version": "0.0.1",
"model": null
}
Example: jobs/my-run-2025-01-20/run_result.json
{
"schema_version": "v1",
"run_id": "my-run-2025-01-20",
"suite_id": "default",
"started_at": "2025-01-20T10:00:00Z",
"finished_at": "2025-01-20T10:00:05Z",
"task_count": 1,
"trial_count": 1,
"passed_trials": 1,
"failed_trials": 0,
"avg_score": 1.0
}
Example: jobs/my-run-2025-01-20/trials/task1__0/judge/verdict.json
{
"passed": true,
"score": 1.0,
"judge_model": "gpt-4",
"raw_head": "PASS"
}
License
Apache-2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sentient_evals-0.1.8.tar.gz.
File metadata
- Download URL: sentient_evals-0.1.8.tar.gz
- Upload date:
- Size: 125.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3cf0f15fded3bfa92220485d878bda368f2481fb13fa55e6429d6073b0c6e77b
|
|
| MD5 |
12a538d8ce2147fd55b7745a26e8785d
|
|
| BLAKE2b-256 |
2f9abd9f3f998656ce21819de78ea471a12b502e74351a79a1a4aec49677ccc3
|
File details
Details for the file sentient_evals-0.1.8-py3-none-any.whl.
File metadata
- Download URL: sentient_evals-0.1.8-py3-none-any.whl
- Upload date:
- Size: 149.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5eb205adf555d4230b883077eaf739ae8fd0950196e01ee1dac3c97c6cd82d6d
|
|
| MD5 |
d9094b40b8af72e0b629cf1ed83d1c92
|
|
| BLAKE2b-256 |
9bbeaf2c2b6b00f50f9450e8d187150e83b570cf822ea76e71012b9a2a280451
|