Local Codex plugin for iterative Agent tuning with guided Skills, reusable runner templates, versioned results, and static validation.
Project description
Agent Tune Kit
English | 简体中文
Agent Tune Kit is a local Codex plugin for evaluating and tuning your own local Agent.
If you already have a working Agent but do not know where it fails, why it fails, or what to change next, Agent Tune Kit helps you run the full loop: batch test the Agent, find failure cases, generate a report, let Codex tune the Agent, and verify the next run.
Architecture
Who It Is For
Use it if you have, or want Codex to help you fill in:
- A local Agent, chatbot, tool-using Agent, or RAG Agent.
- A small evaluation dataset, preferably CSV; 5 to 20 rows are enough to start.
- Inputs, expected answers, or human-checkable results.
- A desire to let Codex help locate weak spots and tune prompts, code, parameters, or tool configuration.
Install
One-command install:
uvx --from agent-tune-kit atk install
To keep the atk command available:
uv tool install agent-tune-kit
atk install
Or use pipx:
pipx install agent-tune-kit
atk install
After installation, open the plugin list in Codex:
/plugins
Select and enable Agent Tune Kit. If $atk-status and other completions do not appear immediately after enabling, restart Codex or reopen the current project session.
Minimal Tuning Loop
Run these commands in your Agent project, not in this repository.
Ideally, you already have a local Agent project that Codex can inspect and edit, plus an evaluation dataset. CSV is recommended, but column names do not need to follow a strict schema; Codex will infer inputs, expected results, and evaluation shape from the data. If either piece is missing, start with step 0. If both already exist, go straight to step 1.
0. Optional: Fill In the Dataset or Agent
If you only have a business description, examples, or acceptance rules, run:
$atk-build-dataset <your business description, examples, or rules>
Codex asks 1-3 questions when information is insufficient, prioritizing input fields, expected output or acceptance criteria, and key business scenarios. The result is written directly to .atk/datasets/dataset.csv with atk_id; if that file already exists, Codex asks before overwriting it. The dataset focuses on main flow, boundary input, missing or ambiguous information, refusal/uncertainty, output format constraints, and business risks you describe.
If you already have an evaluation dataset but do not have an Agent project yet, generate a small runnable Python Agent that uses an OpenAI-compatible API:
$atk-new-agent dataset is data/eval.csv
Codex inspects the dataset, clarifies your intent, generates a minimal Agent project, and writes the interview and design notes to .atk/specs/agent_spec.md. This step does not write .atk/datasets/dataset.csv; $atk-init still owns dataset validation, normalization, and runner generation when connecting the Agent.
1. Initialize
Tell Codex where your Agent starts and where the evaluation data lives:
$atk-init My Agent entrypoint is scripts/agent.py and the evaluation dataset is data/eval.csv
Codex generates:
.atk/runner/eval_runner.py
If the Agent was created by ATK new Agent, the next command is usually:
$atk-init Agent entrypoint is agent.py run_agent and the evaluation dataset is data/eval.csv
2. Run Evaluation
$atk-run
Results are written to:
.atk/results/v1/eval_results.csv
3. Find Failures
Let Codex judge which rows failed:
$atk-find-failures
If you already have a clear rule, create the rule script first and then apply it:
$atk-init-failure-rule rule: mark a row as failed when expected differs from agent_output
$atk-find-failures-by-rule
Failure cases are written to:
.atk/results/v1/failure_cases.csv
4. Generate Report
$atk-report
The report is written to:
.atk/results/v1/report.md
It summarizes results, failure cases, likely causes, and recommended tuning priorities.
5. Optional: Browse Failures
$atk-visualize-failures
This creates a local HTML page:
.atk/results/v1/failure_cases.html
Use it to search, filter, and manually review failure cases.
6. Let Codex Tune the Agent
$atk-tune
Codex edits your Agent based on the report and records the tuning plan:
.atk/results/v1/tuning_plan.md
Verify Improvement
After tuning, run another loop:
$atk-run --only-failures
$atk-find-failures
$atk-report
New results are written to .atk/results/v2/. --only-failures maps the prior failure_cases.csv back to .atk/datasets/dataset.csv by atk_id and reruns only those rows. Starting with the second loop, the report compares against the previous tuning_plan.md and tells you whether the target issues were resolved, partially resolved, unresolved, or impossible to judge.
Output Structure
.atk/
├── datasets/
│ └── dataset.csv # ATK runnable dataset with atk_id
├── runner/
│ ├── eval_runner.py
│ └── failure_rule.py
└── results/
├── v1/
│ ├── eval_results.csv
│ ├── failure_cases.csv
│ ├── failure_cases.html
│ ├── report.md
│ └── tuning_plan.md
└── v2/
└── ...
Common output files:
eval_results.csv: actual Agent output for each row.failure_cases.csv: rows selected as failures.failure_cases.html: optional failure review page.report.md: analysis and tuning recommendations.tuning_plan.md: what Codex changed and why.
Common Skills
$atk-status: inspect progress and suggest the next step.$atk-build-dataset: build.atk/datasets/dataset.csvfrom business context, examples, or rules.$atk-new-agent: create a lightweight OpenAI-compatible Agent when you only have a dataset.$atk-init: generate the test runner.$atk-run: run evaluation and create a new result version.$atk-find-failures: let Codex identify failure cases.$atk-init-failure-rule: create or update the failure rule.$atk-find-failures-by-rule: apply the rule to identify failures.$atk-report: generate analysis and cross-loop validation.$atk-visualize-failures: generate the failure review HTML page.$atk-tune: tune the Agent based on the report.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_tune_kit-0.4.2.tar.gz.
File metadata
- Download URL: agent_tune_kit-0.4.2.tar.gz
- Upload date:
- Size: 3.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf7bdbe4c0258da6ed75e3d98bb3db8b97bc75a4041e80b7759f364ffbf4348c
|
|
| MD5 |
bc4d2fda61af4cad68b4566b2891bd28
|
|
| BLAKE2b-256 |
abe28dd43f603a26bfc0c06ba5323b6d30b8ebdc794c78e78a0635df493df756
|
File details
Details for the file agent_tune_kit-0.4.2-py3-none-any.whl.
File metadata
- Download URL: agent_tune_kit-0.4.2-py3-none-any.whl
- Upload date:
- Size: 3.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
174bdd41e8833b86458a4b4b8f693f79aa348c57106684bb4a25102fe14d687b
|
|
| MD5 |
d716ffb829c8f03b009a5bbf4963d9cb
|
|
| BLAKE2b-256 |
3d047738da1b33ca46d514068b01809af5e0cebebb1bf119daf8ed61454eb097
|