Local Codex plugin for iterative Agent tuning with guided Skills, reusable runner templates, versioned results, and static validation.

These details have not been verified by PyPI

Project links

Project description

Agent Tune Kit

English | 简体中文

Agent Tune Kit is a local Codex plugin that helps you evaluate and improve the quality of your own local Agent.

If you already have a working Agent but are not sure where it fails, why it fails, or what to tune next, this project lets Codex help you run a complete loop: batch test the Agent, find failure cases, write an analysis report, tune the Agent, and verify the next run.

Its main advantage is a low-friction start. You do not need to design a complex evaluation schema or expose a universal Agent interface first. Bring a local Agent project and a small evaluation dataset; Codex reads the code and data samples, then generates the project-specific runner and tuning workflow.

Who it is for

Use this if you have:

a local Agent, chatbot, tool-using Agent, or RAG Agent;
a few test questions, sample inputs, expected answers, or human-judgable results;
a need to quickly find weak spots and let Codex help tune prompts, code, parameters, or tool configuration;
a desire to keep each tuning loop traceable with result files and reports.

You do not need a full evaluation platform to start. For the first validation, 5 to 20 CSV rows are enough.

Prerequisites

You only need:

Codex with local plugin/Skill support.
Python 3.
A local Agent project that Codex can inspect and edit.
A simple evaluation dataset, preferably CSV. Column names do not need to follow a strict Schema; Codex will infer inputs and expected results where possible.

Create a git checkpoint before tuning if you want an easy rollback path. Agent Tune Kit does not automate Agent tuning rollback; installer rollback only restores local marketplace/plugin-store install state.

Quickstart: install the plugin

No repository clone is needed for normal use. Run the packaged installer directly with uvx:

uvx --from agent-tune-kit atk install

For a persistent command, install the tool first, then run atk:

uv tool install agent-tune-kit
atk install

If you prefer pipx:

pipx install agent-tune-kit
atk install

The installer validates the packaged plugin manifest, adds the plugin to the Personal marketplace, writes or updates ~/.agents/plugins/marketplace.json, copies the packaged payload into ~/plugins/agent-tune-kit, and runs local smoke/status checks by default. It proves local files and marketplace state only; it does not bypass or modify hidden Codex UI enablement state.

Useful helper commands:

atk preview --smoke   # preview only; no writes
atk status            # read local install status and next steps
atk rollback --backup <backup-id>  # restore installer-managed local install state only

When an existing marketplace/plugin-store conflict is found, interactive terminals prompt before replacement. Noninteractive replacement requires --yes --force; destructive replacement creates a backup first and prints a rollback command. The installer supports explicit subcommands only and does not keep old entry points; use preview for no-write preview.

Contributor checkout path, for editing this repository itself:

git clone git@github.com:hustyichi/agent-tune-kit.git
cd agent-tune-kit
uv sync
uv run atk install
# or: python3 scripts/install_plugin.py install

After install, Agent Tune Kit should be visible/available in /plugins.

You still need to enable it in Codex:

/plugins

Select Agent Tune Kit in the plugin list and follow the UI prompt to install/enable it. After you enable it in the UI, $atk-status and the other Skill commands should appear in autocomplete.

If the plugin is enabled in /plugins but $atk-status still does not appear in the current session, that is expected: Codex usually loads plugin Skills when a session starts, so newly enabled plugins may not be hot-loaded into an already running session. Restart Codex, or close the current Codex session and reopen this project, then type $atk-status again to verify.

If your environment cannot use local plugins, do not split-copy individual skills/* directories; this repository now treats the local Codex plugin install path as the only recommended entry point.

Maintainer release to PyPI

The release scripts follow the two-step release gate/publish shape used by agent-tune-cli: default mode is a dry run, and uploads only happen with an explicit --publish.

Run the full local release gate first. It checks version alignment, static validation, tests, uv build --no-sources, archive contents, and packaged atk smoke installs outside the repository:

UV_NO_CONFIG=1 uv run python scripts/check-release.py

Prepare clean dist/ artifacts without uploading:

UV_NO_CONFIG=1 uv run python scripts/publish-release.py

Publish to TestPyPI first:

export UV_PUBLISH_TOKEN='pypi-your-testpypi-token'
UV_NO_CONFIG=1 uv run python scripts/publish-release.py --repository testpypi --publish

After TestPyPI install validation, publish to PyPI:

export UV_PUBLISH_TOKEN='pypi-your-pypi-token'
UV_NO_CONFIG=1 uv run python scripts/publish-release.py --repository pypi --publish

The publish script checks whether the current project.name + project.version already exists before uploading. If it exists, bump the version in pyproject.toml, .codex-plugin/plugin.json, and src/agent_tune_kit/__init__.py first. Never commit or paste PyPI tokens.

For the fixed production PyPI path, you can run the zero-argument wrapper:

scripts/publish-pypi.sh

It is equivalent to UV_NO_CONFIG=1 uv run python scripts/publish-release.py --repository pypi --publish, but checks that UV_PUBLISH_TOKEN is set first.

Minimal tuning loop

Run these steps in your Agent repository, not in this Agent Tune Kit repository.

1. Generate a test runner

Run:

$atk-init

Point Codex to your Agent entrypoint and evaluation dataset. Codex generates:

.atk/runner/eval_runner.py

The runner keeps your original dataset columns and adds the Agent's actual output as agent_output. It also adds agent_output_log_path; when trustworthy Python logging capture is configured, this column points to row-specific files such as logs/row_000001.log for serial or same-process concurrent runs.

$atk-init first snapshots the provided dataset into .atk/datasets/, and the generated runner reads that project-local copy. If a same-name snapshot already exists with identical content, it is reused; if the name exists with different content, ATK uses readable incrementing names such as dataset_2.csv and dataset_3.csv.

2. Run the Agent on the dataset

Run:

$atk-run

This writes:

.atk/results/v1/eval_results.csv

If row logging is active, the same version also contains .atk/results/v1/logs/row_*.log. Row logs are generated for configured same-process Python logging capture in serial runs and, when CONCURRENT_ROW_LOGGING_ENABLED remains enabled, with --concurrency > 1. The runner only writes records emitted while an ATK row context is active; stdout/stderr, subprocess, multiprocess, and post-row background logs remain out of scope. If concurrent row logging is disabled, concurrent runs visibly downgrade to app.log/CSV evidence instead of creating row logs.

3. Find failing cases

For the simplest path, let Codex judge which cases failed:

$atk-find-failures

If you already have a clear rule, first create or update the reusable rule script:

$atk-init-failure-rule rule: mark a row as failed when the expected field differs from agent_output

Codex uses the rule you provide in the command to generate the rule script at:

.atk/runner/failure_rule.py

Then execute that rule script to write the failing cases:

$atk-find-failures-by-rule

If .atk/runner/failure_rule.py is missing, $atk-find-failures-by-rule stops and tells you to run $atk-init-failure-rule first.

The failing cases are written to:

.atk/results/v1/failure_cases.csv

4. Generate the analysis report

Run:

$atk-report

Codex writes:

.atk/results/v1/report.md

The report summarizes test results, failure cases, likely causes, and recommended tuning priorities.

5. Optionally review failures in HTML

Run:

$atk-visualize-failures

Codex writes:

.atk/results/v1/failure_cases.html

This optional browser can run any time failure_cases.csv exists. If same-version report.md exists, it is used as best-effort, non-blocking context; missing or unparseable report context does not block the visualization. The Skill uses a fixed plugin-owned stdlib generator script, so output is deterministic and dependency-free while still offering expected-vs-actual review, search/filter/pagination, schema-adaptive role switching, and safe relative log links.

6. Let Codex tune the Agent

Run:

$atk-tune

Codex edits the Agent based on the report and records the tuning plan in:

.atk/results/v1/tuning_plan.md

Verify that tuning worked

After tuning, run the test again:

$atk-run

This creates .atk/results/v2/eval_results.csv. Then run:

$atk-find-failures
$atk-report

Starting with the second loop, the report reads the previous tuning_plan.md and tells you whether the target failures were resolved, partially resolved, unresolved, or impossible to judge.

Expected output

.atk/
├── datasets/
│   └── service_source_codes.csv
├── runner/
│   ├── eval_runner.py
│   └── failure_rule.py
└── results/
    ├── v1/
    │   ├── eval_results.csv
    │   ├── logs/                    # optional row logs
    │   │   └── row_000001.log
    │   ├── failure_cases.csv
    │   ├── failure_cases.html       # optional failure browser
    │   ├── report.md
    │   └── tuning_plan.md
    └── v2/
        └── ...

Most users only need to read eval_results.csv, failure_cases.csv, optional failure_cases.html, report.md, and row logs linked from agent_output_log_path when available. Version directories are managed automatically.

Available Skills

$atk-status: inspect progress and recommend the next step.
$atk-init: generate a test runner for the current Agent.
$atk-run: run the test runner and create the current result version.
$atk-find-failures: let Codex identify failing cases.
$atk-init-failure-rule: create or update .atk/runner/failure_rule.py.
$atk-find-failures-by-rule: execute .atk/runner/failure_rule.py to identify failing cases with explicit rules.
$atk-report: generate analysis and cross-loop validation.
$atk-visualize-failures: generate optional .atk/results/vN/failure_cases.html from current failure_cases.csv.
$atk-tune: tune the Agent and record the tuning plan.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.2

Jun 3, 2026

0.4.1

Jun 3, 2026

0.4.0

Jun 1, 2026

0.3.9

May 29, 2026

0.3.8

May 27, 2026

This version

0.3.7

May 26, 2026

0.3.6

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_tune_kit-0.3.7.tar.gz (93.2 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_tune_kit-0.3.7-py3-none-any.whl (101.6 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file agent_tune_kit-0.3.7.tar.gz.

File metadata

Download URL: agent_tune_kit-0.3.7.tar.gz
Upload date: May 26, 2026
Size: 93.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agent_tune_kit-0.3.7.tar.gz
Algorithm	Hash digest
SHA256	`7dac6d48f07fcd2c2b1420742fc3441f9c28e489efdc6951732dae1cb31a80b5`
MD5	`885e04a271254713f09d0ccc37f5d1f5`
BLAKE2b-256	`0c6395d7035d6ebad2290be0563e235d7f248491319e4ad3143f18692bc8cd96`

See more details on using hashes here.

File details

Details for the file agent_tune_kit-0.3.7-py3-none-any.whl.

File metadata

Download URL: agent_tune_kit-0.3.7-py3-none-any.whl
Upload date: May 26, 2026
Size: 101.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agent_tune_kit-0.3.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6070086df4b7bb273cf2afee6de633f71a7492d81824087fa387a760eb65673`
MD5	`bda55e4b93a0219893b8a9828d7d9c93`
BLAKE2b-256	`e5e2e36acb26e4756d32871e38c36d075e4d6af4334ff4169175aebf847f475b`

See more details on using hashes here.

agent-tune-kit 0.3.7

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Agent Tune Kit

Who it is for

Prerequisites

Quickstart: install the plugin

Maintainer release to PyPI

Minimal tuning loop

1. Generate a test runner

2. Run the Agent on the dataset

3. Find failing cases

4. Generate the analysis report

5. Optionally review failures in HTML

6. Let Codex tune the Agent

Verify that tuning worked

Expected output

Available Skills

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes