Local Codex plugin for iterative Agent tuning with guided Skills, reusable runner templates, versioned results, and static validation.
Project description
Agent Tune Kit
English | 简体中文
Agent Tune Kit is a local Codex plugin that helps you evaluate and improve the quality of your own local Agent.
If you already have a working Agent but are not sure where it fails, why it fails, or what to tune next, this project lets Codex help you run a complete loop: batch test the Agent, find failure cases, write an analysis report, tune the Agent, and verify the next run.
Its main advantage is a low-friction start. You do not need to design a complex evaluation schema or expose a universal Agent interface first. Bring a local Agent project and a small evaluation dataset; Codex reads the code and data samples, then generates the project-specific runner and tuning workflow.
Who it is for
Use this if you have:
- a local Agent, chatbot, tool-using Agent, or RAG Agent;
- a few test questions, sample inputs, expected answers, or human-judgable results;
- a need to quickly find weak spots and let Codex help tune prompts, code, parameters, or tool configuration;
- a desire to keep each tuning loop traceable with result files and reports.
You do not need a full evaluation platform to start. For the first validation, 5 to 20 CSV rows are enough.
Prerequisites
You only need:
- Codex with local plugin/Skill support.
- Python 3.
- A local Agent project that Codex can inspect and edit.
- A simple evaluation dataset, preferably CSV. Column names do not need to follow a strict Schema; Codex will infer inputs and expected results where possible.
Create a git checkpoint before tuning if you want an easy rollback path. Agent Tune Kit does not automate Agent tuning rollback; installer rollback only restores local marketplace/plugin-store install state.
Quickstart: install the plugin
No repository clone is needed for normal use. Run the packaged installer directly with uvx:
uvx --from agent-tune-kit atk install
For a persistent command, install the tool first, then run atk:
uv tool install agent-tune-kit
atk install
If you prefer pipx:
pipx install agent-tune-kit
atk install
The installer validates the packaged plugin manifest, adds the plugin to the Personal marketplace, writes or updates ~/.agents/plugins/marketplace.json, copies the packaged payload into ~/plugins/agent-tune-kit, and runs local smoke/status checks by default. It proves local files and marketplace state only; it does not bypass or modify hidden Codex UI enablement state.
Useful helper commands:
atk preview --smoke # preview only; no writes
atk status # read local install status and next steps
atk rollback --backup <backup-id> # restore installer-managed local install state only
When an existing marketplace/plugin-store conflict is found, interactive terminals prompt before replacement. Noninteractive replacement requires --yes --force; destructive replacement creates a backup first and prints a rollback command. The installer supports explicit subcommands only and does not keep old entry points; use preview for no-write preview.
Contributor checkout path, for editing this repository itself:
git clone git@github.com:hustyichi/agent-tune-kit.git
cd agent-tune-kit
uv sync
uv run atk install
# or: python3 scripts/install_plugin.py install
After install, Agent Tune Kit should be visible/available in /plugins.
You still need to enable it in Codex:
/plugins
Select Agent Tune Kit in the plugin list and follow the UI prompt to install/enable it. After you enable it in the UI, $atk-status and the other Skill commands should appear in autocomplete.
If the plugin is enabled in /plugins but $atk-status still does not appear in the current session, that is expected: Codex usually loads plugin Skills when a session starts, so newly enabled plugins may not be hot-loaded into an already running session. Restart Codex, or close the current Codex session and reopen this project, then type $atk-status again to verify.
If your environment cannot use local plugins, do not split-copy individual skills/* directories; this repository now treats the local Codex plugin install path as the only recommended entry point.
Maintainer release to PyPI
The release scripts follow the two-step release gate/publish shape used by agent-tune-cli: default mode is a dry run, and uploads only happen with an explicit --publish.
Run the full local release gate first. It checks version alignment, static validation, tests, uv build --no-sources, archive contents, and packaged atk smoke installs outside the repository:
UV_NO_CONFIG=1 uv run python scripts/check-release.py
Prepare clean dist/ artifacts without uploading:
UV_NO_CONFIG=1 uv run python scripts/publish-release.py
Publish to TestPyPI first:
export UV_PUBLISH_TOKEN='pypi-your-testpypi-token'
UV_NO_CONFIG=1 uv run python scripts/publish-release.py --repository testpypi --publish
After TestPyPI install validation, publish to PyPI:
export UV_PUBLISH_TOKEN='pypi-your-pypi-token'
UV_NO_CONFIG=1 uv run python scripts/publish-release.py --repository pypi --publish
The publish script checks whether the current project.name + project.version already exists before uploading. If it exists, bump the version in pyproject.toml, .codex-plugin/plugin.json, and src/agent_tune_kit/__init__.py first. Never commit or paste PyPI tokens.
For the fixed production PyPI path, you can run the zero-argument wrapper:
scripts/publish-pypi.sh
It is equivalent to UV_NO_CONFIG=1 uv run python scripts/publish-release.py --repository pypi --publish, but checks that UV_PUBLISH_TOKEN is set first.
Minimal tuning loop
Run these steps in your Agent repository, not in this Agent Tune Kit repository.
1. Generate a test runner
Run:
$atk-init
Point Codex to your Agent entrypoint and evaluation dataset. Codex generates:
.atk/runner/eval_runner.py
The runner keeps your original dataset columns and adds the Agent's actual output as agent_output. It also adds agent_output_log_path; when trustworthy Python logging capture is configured, this column points to row-specific files such as logs/row_000001.log for serial or same-process concurrent runs.
$atk-init first snapshots the provided dataset into .atk/datasets/, and the generated runner reads that project-local copy. If a same-name snapshot already exists with identical content, it is reused; if the name exists with different content, ATK uses readable incrementing names such as dataset_2.csv and dataset_3.csv.
2. Run the Agent on the dataset
Run:
$atk-run
This writes:
.atk/results/v1/eval_results.csv
If row logging is active, the same version also contains .atk/results/v1/logs/row_*.log. Row logs are generated for configured same-process Python logging capture in serial runs and, when CONCURRENT_ROW_LOGGING_ENABLED remains enabled, with --concurrency > 1. The runner only writes records emitted while an ATK row context is active; stdout/stderr, subprocess, multiprocess, and post-row background logs remain out of scope. If concurrent row logging is disabled, concurrent runs visibly downgrade to app.log/CSV evidence instead of creating row logs.
3. Find failing cases
For the simplest path, let Codex judge which cases failed:
$atk-find-failures
If you already have a clear rule, first create or update the reusable rule script:
$atk-init-failure-rule rule: mark a row as failed when the expected field differs from agent_output
Codex uses the rule you provide in the command to generate the rule script at:
.atk/runner/failure_rule.py
Then execute that rule script to write the failing cases:
$atk-find-failures-by-rule
If .atk/runner/failure_rule.py is missing, $atk-find-failures-by-rule stops and tells you to run $atk-init-failure-rule first.
The failing cases are written to:
.atk/results/v1/failure_cases.csv
4. Generate the analysis report
Run:
$atk-report
Codex writes:
.atk/results/v1/report.md
The report summarizes test results, failure cases, likely causes, and recommended tuning priorities.
5. Optionally review failures in HTML
Run:
$atk-visualize-failures
Codex writes:
.atk/results/v1/failure_cases.html
This optional browser can run any time failure_cases.csv exists. If same-version report.md exists, it is used as best-effort, non-blocking context; missing or unparseable report context does not block the visualization. The Skill uses a fixed plugin-owned stdlib generator script, so output is deterministic and dependency-free while still offering expected-vs-actual review, search/filter/pagination, schema-adaptive role switching, and safe relative log links.
6. Let Codex tune the Agent
Run:
$atk-tune
Codex edits the Agent based on the report and records the tuning plan in:
.atk/results/v1/tuning_plan.md
Verify that tuning worked
After tuning, run the test again:
$atk-run
This creates .atk/results/v2/eval_results.csv. Then run:
$atk-find-failures
$atk-report
Starting with the second loop, the report reads the previous tuning_plan.md and tells you whether the target failures were resolved, partially resolved, unresolved, or impossible to judge.
Expected output
.atk/
├── datasets/
│ └── service_source_codes.csv
├── runner/
│ ├── eval_runner.py
│ └── failure_rule.py
└── results/
├── v1/
│ ├── eval_results.csv
│ ├── logs/ # optional row logs
│ │ └── row_000001.log
│ ├── failure_cases.csv
│ ├── failure_cases.html # optional failure browser
│ ├── report.md
│ └── tuning_plan.md
└── v2/
└── ...
Most users only need to read eval_results.csv, failure_cases.csv, optional failure_cases.html, report.md, and row logs linked from agent_output_log_path when available. Version directories are managed automatically.
Available Skills
$atk-status: inspect progress and recommend the next step.$atk-init: generate a test runner for the current Agent.$atk-run: run the test runner and create the current result version.$atk-find-failures: let Codex identify failing cases.$atk-init-failure-rule: create or update.atk/runner/failure_rule.py.$atk-find-failures-by-rule: execute.atk/runner/failure_rule.pyto identify failing cases with explicit rules.$atk-report: generate analysis and cross-loop validation.$atk-visualize-failures: generate optional.atk/results/vN/failure_cases.htmlfrom currentfailure_cases.csv.$atk-tune: tune the Agent and record the tuning plan.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_tune_kit-0.3.6.tar.gz.
File metadata
- Download URL: agent_tune_kit-0.3.6.tar.gz
- Upload date:
- Size: 93.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7e415113addff8168791d0bde58df0880467af605d3f11bb1e3905012d5da5a
|
|
| MD5 |
fb5a1263ee0a5b898c54ad16f1bb5c56
|
|
| BLAKE2b-256 |
2425a47f0db65353402c7c1546983d8d7b4659b179138daf06403a0a4a16a06b
|
File details
Details for the file agent_tune_kit-0.3.6-py3-none-any.whl.
File metadata
- Download URL: agent_tune_kit-0.3.6-py3-none-any.whl
- Upload date:
- Size: 101.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f13c042503a1f41f8d9d50a473cf1f4196698d65c88bc56b563b625b48b8eaf7
|
|
| MD5 |
2f72896b77d79fda73d65992b9e5ac39
|
|
| BLAKE2b-256 |
5e3345aaf8cf7266af2f77146533dc2ca09959a6b9a2f1b11d724fb122cf2378
|