Agent Flow Test (AgentFT) - A simple evaluation harness for AI agents.

These details have not been verified by PyPI

Project links

Project description

AgentFT

Agent Flow Test - Pytest for AI agents

AgentFT (Agent Flow Test) is a pytest-style evaluation framework for AI agents. It provides composable primitives for tasks, scenarios, agents, and judges, with production-ready infrastructure for concurrency, retries, observability, and regression control.

What's New

Secure declarative config mode: --config-json --strict-config
Distributed local orchestration: aft orchestrate for sharded multi-process runs + merge
Retry policy by error type with richer error taxonomy
Task-level vs judge-level aggregation with configurable task pass rules
Ranking support: aft rank (Elo)
Artifact migration tooling: aft migrate
Columnar export API (DuckDB/Parquet)
Environment/episode scenario toolkit and additional preset benchmark suites
Structured external event sinks for run lifecycle telemetry
Interactive report drilldown with filtering

Features

Pytest-like simplicity: Minimal boilerplate, clear abstractions
Production-ready execution: Async runner, retries, rate limiting, timeouts, fail-fast modes
Flexible parallelism: Independent controls for agents, tasks, and judges
Deterministic reproducibility: Seeded shuffle + deterministic sharding
Strong artifact model: JSONL or SQLite backends with schema versioning and checkpoints
Regression workflows: Compare, gate, rejudge, analyze, rank, merge
Extensibility: Plugin registry for agents, scenarios, and judges
Scenario breadth: List/CSV/JSONL/HuggingFace + rollout-based episode scenarios
Governance support: API stability policy and governance/compliance guidance

Quick Start

Install from PyPI:

pip install agentft

Run an example:

git clone https://github.com/Geddydukes/agentflowtest
cd agentflowtest
aft run --config examples/config_example.py

Or use in code:

from agentft import RunConfig, run
from agentft.presets import build_math_basic_scenario, ExactMatchJudge

class MyAgent:
    name = "my_agent"
    version = "0.1.0"
    provider_key = None

    async def setup(self):
        pass

    async def reset(self):
        pass

    async def teardown(self):
        pass

    async def run_task(self, task, context=None):
        return {"response": "42"}

config = RunConfig(
    name="quick_test",
    agents=[MyAgent()],
    scenarios=[build_math_basic_scenario()],
    judges=[ExactMatchJudge()],
)

results = run(config)
print(f"Passed: {sum(r.passed for r in results)}/{len(results)}")

Core CLI Commands

`aft run`

aft run --config path/to/config.py

Strict non-executable mode:

aft run --config-json path/to/config.json --strict-config

Useful overrides:

aft run --config examples/config_example.py \
  --max-tasks-parallel 8 \
  --max-agents-parallel 2 \
  --max-judges-parallel 4 \
  --artifact-backend sqlite \
  --shard-count 4 --shard-index 1

`aft summary`

aft summary --run-dir runs/my_run/

Task-level rollup mode:

aft summary --run-dir runs/my_run/ --aggregation-level task --task-pass-rule majority

`aft compare`

aft compare --run-a runs/base/ --run-b runs/candidate/

`aft gate`

aft gate --run-a runs/base/ --run-b runs/candidate/ --max-regressions 0

`aft analyze`

aft analyze --run-dir runs/my_run/ --output-json runs/my_run/analytics.json

`aft rejudge`

aft rejudge --run-dir runs/my_run/ --config examples/config_example.py

`aft merge`

aft merge --run-dirs runs/shard0/ runs/shard1/ --output-dir runs/merged/

`aft orchestrate`

aft orchestrate --config examples/config_example.py --shards 4 --output-dir runs/merged_orchestrated/

`aft rank`

aft rank --run-dir runs/my_run/

`aft migrate`

aft migrate --run-dir runs/legacy_run/ --target-schema 1.1.0

Run Artifacts

Each run creates runs/<run_id>/ containing:

results.jsonl or artifacts.db
traces.jsonl
run_metadata.json
run_checkpoint.json
agent_outputs.jsonl (JSONL backend cache, optional)
cached_outputs table in artifacts.db (SQLite backend cache, optional)
report.html

Smoke Test Visualization (2026-02-11)

Recent 5-run local validation matrix:

Run	Mode	Path	Result	Notes
R1	Baseline	`runs/smoke_r1_baseline-5a5f7d3c`	4/4 (100%)	Control run
R2	Parallel + Shuffle	`runs/smoke_r2_parallel-0e22f470`	4/4 (100%)	`max_agents_parallel=2`, `max_tasks_parallel=4`, seed=123
R3	SQLite + Cache	`runs/smoke_r3_sqlite_cache-5890a078`	4/4 (100%)	SQLite artifacts and cached outputs
R4	Resume/Checkpoint	`runs/smoke_r4_resume-e3bb2645`	1/1 -> 4/4 (100%)	Early stop then resumed to completion
R5	Orchestrated Shards	`runs/smoke_r5_orchestrated`	4/4 (100%)	2 shards + merged output

Validation checks:

summary on R1: pass rate 100%
compare R1 vs R2: delta 0.00%, regressions 0
gate R1 vs R2: passed
analyze on R3: pass rate 100%, macro scenario rate 100%
rejudge on cached outputs: 4/4 passed

Pass-rate bars:

R1: ██████████ 100%
R2: ██████████ 100%
R3: ██████████ 100%
R4: ██████████ 100% (after resume)
R5: ██████████ 100%

Extended Functionality Guide

A comprehensive guide for all updated functionality is available at:

docs/UPDATED_FUNCTIONALITY.md

Testing

AgentFT includes 90+ automated tests across runner behavior, storage backends, reporting, CLI, presets, plugins, and resilience checks.

pip install -e ".[dev]"
PYTHONPATH=src pytest -q

Optional columnar export dependency:

pip install -e ".[columnar]"

Project Status

AgentFT is in active development.

Current version: 0.1.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Feb 12, 2026

0.1.0

Jan 7, 2026

0.0.1

Jan 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentft-0.2.0.tar.gz (67.0 kB view details)

Uploaded Feb 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentft-0.2.0-py3-none-any.whl (51.9 kB view details)

Uploaded Feb 12, 2026 Python 3

File details

Details for the file agentft-0.2.0.tar.gz.

File metadata

Download URL: agentft-0.2.0.tar.gz
Upload date: Feb 12, 2026
Size: 67.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for agentft-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`7b46a26291ee12757f5a3f4662e0aa253ae85a05c348eb753165a02b029c5c97`
MD5	`85a046bb089fabc79fb6bbadee3d6251`
BLAKE2b-256	`252d71ce68f07aa29af9422a5869cb0cfe710c9e89547b0289c8e1e863e42491`

See more details on using hashes here.

File details

Details for the file agentft-0.2.0-py3-none-any.whl.

File metadata

Download URL: agentft-0.2.0-py3-none-any.whl
Upload date: Feb 12, 2026
Size: 51.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for agentft-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`72fe0c5191afbe7b84fd7fc239892f762b5d8d70d95fd1f916a0e2c43227335a`
MD5	`590825f049d366767190925218473a65`
BLAKE2b-256	`06c8e04bc75c3e2cccf9a11c3da69547ef000a7e1a3b4d13f816f0ab05aeda45`

See more details on using hashes here.

agentft 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AgentFT

What's New

Features

Quick Start

Core CLI Commands

aft run

aft summary

aft compare

aft gate

aft analyze

aft rejudge

aft merge

aft orchestrate

aft rank

aft migrate

Run Artifacts

Smoke Test Visualization (2026-02-11)

Extended Functionality Guide

Testing

Project Status

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`aft run`

`aft summary`

`aft compare`

`aft gate`

`aft analyze`

`aft rejudge`

`aft merge`

`aft orchestrate`

`aft rank`

`aft migrate`