Agent Flow Test (AgentFT) - A simple evaluation harness for AI agents.
Project description
AgentFT
Agent Flow Test - Pytest for AI agents
AgentFT (Agent Flow Test) is a pytest-style evaluation framework for AI agents. It provides composable primitives for tasks, scenarios, agents, and judges, with production-ready infrastructure for concurrency, retries, observability, and regression control.
What's New
- Secure declarative config mode:
--config-json --strict-config - Distributed local orchestration:
aft orchestratefor sharded multi-process runs + merge - Retry policy by error type with richer error taxonomy
- Task-level vs judge-level aggregation with configurable task pass rules
- Ranking support:
aft rank(Elo) - Artifact migration tooling:
aft migrate - Columnar export API (DuckDB/Parquet)
- Environment/episode scenario toolkit and additional preset benchmark suites
- Structured external event sinks for run lifecycle telemetry
- Interactive report drilldown with filtering
Features
- Pytest-like simplicity: Minimal boilerplate, clear abstractions
- Production-ready execution: Async runner, retries, rate limiting, timeouts, fail-fast modes
- Flexible parallelism: Independent controls for agents, tasks, and judges
- Deterministic reproducibility: Seeded shuffle + deterministic sharding
- Strong artifact model: JSONL or SQLite backends with schema versioning and checkpoints
- Regression workflows: Compare, gate, rejudge, analyze, rank, merge
- Extensibility: Plugin registry for agents, scenarios, and judges
- Scenario breadth: List/CSV/JSONL/HuggingFace + rollout-based episode scenarios
- Governance support: API stability policy and governance/compliance guidance
Quick Start
Install from PyPI:
pip install agentft
Run an example:
git clone https://github.com/Geddydukes/agentflowtest
cd agentflowtest
aft run --config examples/config_example.py
Or use in code:
from agentft import RunConfig, run
from agentft.presets import build_math_basic_scenario, ExactMatchJudge
class MyAgent:
name = "my_agent"
version = "0.1.0"
provider_key = None
async def setup(self):
pass
async def reset(self):
pass
async def teardown(self):
pass
async def run_task(self, task, context=None):
return {"response": "42"}
config = RunConfig(
name="quick_test",
agents=[MyAgent()],
scenarios=[build_math_basic_scenario()],
judges=[ExactMatchJudge()],
)
results = run(config)
print(f"Passed: {sum(r.passed for r in results)}/{len(results)}")
Core CLI Commands
aft run
aft run --config path/to/config.py
Strict non-executable mode:
aft run --config-json path/to/config.json --strict-config
Useful overrides:
aft run --config examples/config_example.py \
--max-tasks-parallel 8 \
--max-agents-parallel 2 \
--max-judges-parallel 4 \
--artifact-backend sqlite \
--shard-count 4 --shard-index 1
aft summary
aft summary --run-dir runs/my_run/
Task-level rollup mode:
aft summary --run-dir runs/my_run/ --aggregation-level task --task-pass-rule majority
aft compare
aft compare --run-a runs/base/ --run-b runs/candidate/
aft gate
aft gate --run-a runs/base/ --run-b runs/candidate/ --max-regressions 0
aft analyze
aft analyze --run-dir runs/my_run/ --output-json runs/my_run/analytics.json
aft rejudge
aft rejudge --run-dir runs/my_run/ --config examples/config_example.py
aft merge
aft merge --run-dirs runs/shard0/ runs/shard1/ --output-dir runs/merged/
aft orchestrate
aft orchestrate --config examples/config_example.py --shards 4 --output-dir runs/merged_orchestrated/
aft rank
aft rank --run-dir runs/my_run/
aft migrate
aft migrate --run-dir runs/legacy_run/ --target-schema 1.1.0
Run Artifacts
Each run creates runs/<run_id>/ containing:
results.jsonlorartifacts.dbtraces.jsonlrun_metadata.jsonrun_checkpoint.jsonagent_outputs.jsonl(JSONL backend cache, optional)cached_outputstable inartifacts.db(SQLite backend cache, optional)report.html
Smoke Test Visualization (2026-02-11)
Recent 5-run local validation matrix:
| Run | Mode | Path | Result | Notes |
|---|---|---|---|---|
| R1 | Baseline | runs/smoke_r1_baseline-5a5f7d3c |
4/4 (100%) | Control run |
| R2 | Parallel + Shuffle | runs/smoke_r2_parallel-0e22f470 |
4/4 (100%) | max_agents_parallel=2, max_tasks_parallel=4, seed=123 |
| R3 | SQLite + Cache | runs/smoke_r3_sqlite_cache-5890a078 |
4/4 (100%) | SQLite artifacts and cached outputs |
| R4 | Resume/Checkpoint | runs/smoke_r4_resume-e3bb2645 |
1/1 -> 4/4 (100%) | Early stop then resumed to completion |
| R5 | Orchestrated Shards | runs/smoke_r5_orchestrated |
4/4 (100%) | 2 shards + merged output |
Validation checks:
summaryon R1: pass rate 100%compareR1 vs R2: delta 0.00%, regressions 0gateR1 vs R2: passedanalyzeon R3: pass rate 100%, macro scenario rate 100%rejudgeon cached outputs: 4/4 passed
Pass-rate bars:
- R1:
██████████100% - R2:
██████████100% - R3:
██████████100% - R4:
██████████100% (after resume) - R5:
██████████100%
Extended Functionality Guide
A comprehensive guide for all updated functionality is available at:
docs/UPDATED_FUNCTIONALITY.md
Testing
AgentFT includes 90+ automated tests across runner behavior, storage backends, reporting, CLI, presets, plugins, and resilience checks.
pip install -e ".[dev]"
PYTHONPATH=src pytest -q
Optional columnar export dependency:
pip install -e ".[columnar]"
Project Status
AgentFT is in active development.
Current version: 0.1.0
Links
- PyPI: https://pypi.org/project/agentft/
- GitHub: https://github.com/Geddydukes/agentflowtest
- API Stability:
docs/API_STABILITY.md - Governance:
docs/GOVERNANCE.md - Docs Map:
docs/DOCS_UNIFICATION.md
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentft-0.2.0.tar.gz.
File metadata
- Download URL: agentft-0.2.0.tar.gz
- Upload date:
- Size: 67.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b46a26291ee12757f5a3f4662e0aa253ae85a05c348eb753165a02b029c5c97
|
|
| MD5 |
85a046bb089fabc79fb6bbadee3d6251
|
|
| BLAKE2b-256 |
252d71ce68f07aa29af9422a5869cb0cfe710c9e89547b0289c8e1e863e42491
|
File details
Details for the file agentft-0.2.0-py3-none-any.whl.
File metadata
- Download URL: agentft-0.2.0-py3-none-any.whl
- Upload date:
- Size: 51.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72fe0c5191afbe7b84fd7fc239892f762b5d8d70d95fd1f916a0e2c43227335a
|
|
| MD5 |
590825f049d366767190925218473a65
|
|
| BLAKE2b-256 |
06c8e04bc75c3e2cccf9a11c3da69547ef000a7e1a3b4d13f816f0ab05aeda45
|