Skip to main content

Agent Flow Test (AgentFT) - A simple evaluation harness for AI agents.

Project description

AgentFT

Agent Flow Test - Pytest for AI agents

PyPI Python 3.11+ License: MIT

AgentFT (Agent Flow Test) is a pytest-style evaluation framework for AI agents. It provides composable primitives for tasks, scenarios, agents, and judges, with production-ready infrastructure for concurrency, retries, observability, and regression control.

What's New

  • Secure declarative config mode: --config-json --strict-config
  • Distributed local orchestration: aft orchestrate for sharded multi-process runs + merge
  • Retry policy by error type with richer error taxonomy
  • Task-level vs judge-level aggregation with configurable task pass rules
  • Ranking support: aft rank (Elo)
  • Artifact migration tooling: aft migrate
  • Columnar export API (DuckDB/Parquet)
  • Environment/episode scenario toolkit and additional preset benchmark suites
  • Structured external event sinks for run lifecycle telemetry
  • Interactive report drilldown with filtering

Features

  • Pytest-like simplicity: Minimal boilerplate, clear abstractions
  • Production-ready execution: Async runner, retries, rate limiting, timeouts, fail-fast modes
  • Flexible parallelism: Independent controls for agents, tasks, and judges
  • Deterministic reproducibility: Seeded shuffle + deterministic sharding
  • Strong artifact model: JSONL or SQLite backends with schema versioning and checkpoints
  • Regression workflows: Compare, gate, rejudge, analyze, rank, merge
  • Extensibility: Plugin registry for agents, scenarios, and judges
  • Scenario breadth: List/CSV/JSONL/HuggingFace + rollout-based episode scenarios
  • Governance support: API stability policy and governance/compliance guidance

Quick Start

Install from PyPI:

pip install agentft

Run an example:

git clone https://github.com/Geddydukes/agentflowtest
cd agentflowtest
aft run --config examples/config_example.py

Or use in code:

from agentft import RunConfig, run
from agentft.presets import build_math_basic_scenario, ExactMatchJudge

class MyAgent:
    name = "my_agent"
    version = "0.1.0"
    provider_key = None

    async def setup(self):
        pass

    async def reset(self):
        pass

    async def teardown(self):
        pass

    async def run_task(self, task, context=None):
        return {"response": "42"}

config = RunConfig(
    name="quick_test",
    agents=[MyAgent()],
    scenarios=[build_math_basic_scenario()],
    judges=[ExactMatchJudge()],
)

results = run(config)
print(f"Passed: {sum(r.passed for r in results)}/{len(results)}")

Core CLI Commands

aft run

aft run --config path/to/config.py

Strict non-executable mode:

aft run --config-json path/to/config.json --strict-config

Useful overrides:

aft run --config examples/config_example.py \
  --max-tasks-parallel 8 \
  --max-agents-parallel 2 \
  --max-judges-parallel 4 \
  --artifact-backend sqlite \
  --shard-count 4 --shard-index 1

aft summary

aft summary --run-dir runs/my_run/

Task-level rollup mode:

aft summary --run-dir runs/my_run/ --aggregation-level task --task-pass-rule majority

aft compare

aft compare --run-a runs/base/ --run-b runs/candidate/

aft gate

aft gate --run-a runs/base/ --run-b runs/candidate/ --max-regressions 0

aft analyze

aft analyze --run-dir runs/my_run/ --output-json runs/my_run/analytics.json

aft rejudge

aft rejudge --run-dir runs/my_run/ --config examples/config_example.py

aft merge

aft merge --run-dirs runs/shard0/ runs/shard1/ --output-dir runs/merged/

aft orchestrate

aft orchestrate --config examples/config_example.py --shards 4 --output-dir runs/merged_orchestrated/

aft rank

aft rank --run-dir runs/my_run/

aft migrate

aft migrate --run-dir runs/legacy_run/ --target-schema 1.1.0

Run Artifacts

Each run creates runs/<run_id>/ containing:

  • results.jsonl or artifacts.db
  • traces.jsonl
  • run_metadata.json
  • run_checkpoint.json
  • agent_outputs.jsonl (JSONL backend cache, optional)
  • cached_outputs table in artifacts.db (SQLite backend cache, optional)
  • report.html

Smoke Test Visualization (2026-02-11)

Recent 5-run local validation matrix:

Run Mode Path Result Notes
R1 Baseline runs/smoke_r1_baseline-5a5f7d3c 4/4 (100%) Control run
R2 Parallel + Shuffle runs/smoke_r2_parallel-0e22f470 4/4 (100%) max_agents_parallel=2, max_tasks_parallel=4, seed=123
R3 SQLite + Cache runs/smoke_r3_sqlite_cache-5890a078 4/4 (100%) SQLite artifacts and cached outputs
R4 Resume/Checkpoint runs/smoke_r4_resume-e3bb2645 1/1 -> 4/4 (100%) Early stop then resumed to completion
R5 Orchestrated Shards runs/smoke_r5_orchestrated 4/4 (100%) 2 shards + merged output

Validation checks:

  • summary on R1: pass rate 100%
  • compare R1 vs R2: delta 0.00%, regressions 0
  • gate R1 vs R2: passed
  • analyze on R3: pass rate 100%, macro scenario rate 100%
  • rejudge on cached outputs: 4/4 passed

Pass-rate bars:

  • R1: ██████████ 100%
  • R2: ██████████ 100%
  • R3: ██████████ 100%
  • R4: ██████████ 100% (after resume)
  • R5: ██████████ 100%

Extended Functionality Guide

A comprehensive guide for all updated functionality is available at:

  • docs/UPDATED_FUNCTIONALITY.md

Testing

AgentFT includes 90+ automated tests across runner behavior, storage backends, reporting, CLI, presets, plugins, and resilience checks.

pip install -e ".[dev]"
PYTHONPATH=src pytest -q

Optional columnar export dependency:

pip install -e ".[columnar]"

Project Status

AgentFT is in active development.

Current version: 0.1.0

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentft-0.2.0.tar.gz (67.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentft-0.2.0-py3-none-any.whl (51.9 kB view details)

Uploaded Python 3

File details

Details for the file agentft-0.2.0.tar.gz.

File metadata

  • Download URL: agentft-0.2.0.tar.gz
  • Upload date:
  • Size: 67.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for agentft-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7b46a26291ee12757f5a3f4662e0aa253ae85a05c348eb753165a02b029c5c97
MD5 85a046bb089fabc79fb6bbadee3d6251
BLAKE2b-256 252d71ce68f07aa29af9422a5869cb0cfe710c9e89547b0289c8e1e863e42491

See more details on using hashes here.

File details

Details for the file agentft-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: agentft-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 51.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for agentft-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 72fe0c5191afbe7b84fd7fc239892f762b5d8d70d95fd1f916a0e2c43227335a
MD5 590825f049d366767190925218473a65
BLAKE2b-256 06c8e04bc75c3e2cccf9a11c3da69547ef000a7e1a3b4d13f816f0ab05aeda45

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page