Skip to main content

Local Codex plugin for iterative Agent tuning with guided Skills, reusable runner templates, versioned results, and static validation.

Project description

Agent Tune Kit

English | 简体中文

PyPI

Agent Tune Kit is a local Codex plugin that helps you evaluate and improve the quality of your own local Agent.

If you already have a working Agent but are not sure where it fails, why it fails, or what to tune next, this project lets Codex help you run a complete loop: batch test the Agent, find failure cases, write an analysis report, tune the Agent, and verify the next run.

Its main advantage is a low-friction start. You do not need to design a complex evaluation schema or expose a universal Agent interface first. Bring a local Agent project and a small evaluation dataset; Codex reads the code and data samples, then generates the project-specific runner and tuning workflow.

Who it is for

Use this if you have:

  • a local Agent, chatbot, tool-using Agent, or RAG Agent;
  • a few test questions, sample inputs, expected answers, or human-judgable results;
  • a need to quickly find weak spots and let Codex help tune prompts, code, parameters, or tool configuration;
  • a desire to keep each tuning loop traceable with result files and reports.

You do not need a full evaluation platform to start. For the first validation, 5 to 20 CSV rows are enough.

Prerequisites

You only need:

  • Codex with local plugin/Skill support.
  • Python 3.
  • A local Agent project that Codex can inspect and edit.
  • A simple evaluation dataset, preferably CSV. Column names do not need to follow a strict Schema; Codex will infer inputs and expected results where possible.

Create a git checkpoint before tuning if you want an easy rollback path. Agent Tune Kit does not automate Agent tuning rollback; installer rollback only restores local marketplace/plugin-store install state.

Quickstart: install the plugin

No repository clone is needed for normal use. Run the packaged installer directly with uvx:

uvx --from agent-tune-kit atk install

For a persistent command, install the tool first, then run atk:

uv tool install agent-tune-kit
atk install

If you prefer pipx:

pipx install agent-tune-kit
atk install

The installer validates the packaged plugin manifest, adds the plugin to the Personal marketplace, writes or updates ~/.agents/plugins/marketplace.json, copies the packaged payload into ~/plugins/agent-tune-kit, and runs local smoke/status checks by default. It proves local files and marketplace state only; it does not bypass or modify hidden Codex UI enablement state.

Useful helper commands:

atk preview --smoke   # preview only; no writes
atk status            # read local install status and next steps
atk rollback --backup <backup-id>  # restore installer-managed local install state only

When an existing marketplace/plugin-store conflict is found, interactive terminals prompt before replacement. Noninteractive replacement requires --yes --force; destructive replacement creates a backup first and prints a rollback command. The installer supports explicit subcommands only and does not keep old entry points; use preview for no-write preview.

Contributor checkout path, for editing this repository itself:

git clone git@github.com:hustyichi/agent-tune-kit.git
cd agent-tune-kit
uv sync
uv run atk install
# or: python3 scripts/install_plugin.py install

After install, Agent Tune Kit should be visible/available in /plugins.

You still need to enable it in Codex:

/plugins

Select Agent Tune Kit in the plugin list and follow the UI prompt to install/enable it. After you enable it in the UI, $atk-status and the other Skill commands should appear in autocomplete.

If the plugin is enabled in /plugins but $atk-status still does not appear in the current session, that is expected: Codex usually loads plugin Skills when a session starts, so newly enabled plugins may not be hot-loaded into an already running session. Restart Codex, or close the current Codex session and reopen this project, then type $atk-status again to verify.

If your environment cannot use local plugins, do not split-copy individual skills/* directories; this repository now treats the local Codex plugin install path as the only recommended entry point.

Maintainer release to PyPI

The release scripts follow the two-step release gate/publish shape used by agent-tune-cli: default mode is a dry run, and uploads only happen with an explicit --publish.

Run the full local release gate first. It checks version alignment, static validation, tests, uv build --no-sources, archive contents, and packaged atk smoke installs outside the repository:

UV_NO_CONFIG=1 uv run python scripts/check-release.py

Prepare clean dist/ artifacts without uploading:

UV_NO_CONFIG=1 uv run python scripts/publish-release.py

Publish to TestPyPI first:

export UV_PUBLISH_TOKEN='pypi-your-testpypi-token'
UV_NO_CONFIG=1 uv run python scripts/publish-release.py --repository testpypi --publish

After TestPyPI install validation, publish to PyPI:

export UV_PUBLISH_TOKEN='pypi-your-pypi-token'
UV_NO_CONFIG=1 uv run python scripts/publish-release.py --repository pypi --publish

The publish script checks whether the current project.name + project.version already exists before uploading. If it exists, bump the version in pyproject.toml, .codex-plugin/plugin.json, and src/agent_tune_kit/__init__.py first. Never commit or paste PyPI tokens.

For the fixed production PyPI path, you can run the zero-argument wrapper:

scripts/publish-pypi.sh

It is equivalent to UV_NO_CONFIG=1 uv run python scripts/publish-release.py --repository pypi --publish, but checks that UV_PUBLISH_TOKEN is set first.

Minimal tuning loop

Run these steps in your Agent repository, not in this Agent Tune Kit repository.

1. Generate a test runner

Run:

$atk-init

Point Codex to your Agent entrypoint and evaluation dataset. Codex generates:

.atk/runner/eval_runner.py

The runner keeps your original dataset columns and adds the Agent's actual output as agent_output. It also adds agent_output_log_path; when trustworthy Python logging capture is configured, this column points to row-specific files such as logs/row_000001.log for serial or same-process concurrent runs.

$atk-init first snapshots the provided dataset into .atk/datasets/, and the generated runner reads that project-local copy. If a same-name snapshot already exists with identical content, it is reused; if the name exists with different content, ATK uses readable incrementing names such as dataset_2.csv and dataset_3.csv.

2. Run the Agent on the dataset

Run:

$atk-run

This writes:

.atk/results/v1/eval_results.csv

If row logging is active, the same version also contains .atk/results/v1/logs/row_*.log. Row logs are generated for configured same-process Python logging capture in serial runs and, when CONCURRENT_ROW_LOGGING_ENABLED remains enabled, with --concurrency > 1. The runner only writes records emitted while an ATK row context is active; stdout/stderr, subprocess, multiprocess, and post-row background logs remain out of scope. If concurrent row logging is disabled, concurrent runs visibly downgrade to app.log/CSV evidence instead of creating row logs.

3. Find failing cases

For the simplest path, let Codex judge which cases failed:

$atk-find-failures

If you already have a clear rule, first create or update the reusable rule script:

$atk-init-failure-rule rule: mark a row as failed when the expected field differs from agent_output

Codex uses the rule you provide in the command to generate the rule script at:

.atk/runner/failure_rule.py

Then execute that rule script to write the failing cases:

$atk-find-failures-by-rule

If .atk/runner/failure_rule.py is missing, $atk-find-failures-by-rule stops and tells you to run $atk-init-failure-rule first.

The failing cases are written to:

.atk/results/v1/failure_cases.csv

4. Generate the analysis report

Run:

$atk-report

Codex writes:

.atk/results/v1/report.md

The report summarizes test results, failure cases, likely causes, and recommended tuning priorities.

5. Optionally review failures in HTML

Run:

$atk-visualize-failures

Codex writes:

.atk/results/v1/failure_cases.html

This optional browser can run any time failure_cases.csv exists. If same-version report.md exists, it is used as best-effort, non-blocking context; missing or unparseable report context does not block the visualization. The Skill uses a fixed plugin-owned stdlib generator script, so output is deterministic and dependency-free while still offering expected-vs-actual review, search/filter/pagination, schema-adaptive role switching, and safe relative log links.

6. Let Codex tune the Agent

Run:

$atk-tune

Codex edits the Agent based on the report and records the tuning plan in:

.atk/results/v1/tuning_plan.md

Verify that tuning worked

After tuning, run the test again:

$atk-run

This creates .atk/results/v2/eval_results.csv. Then run:

$atk-find-failures
$atk-report

Starting with the second loop, the report reads the previous tuning_plan.md and tells you whether the target failures were resolved, partially resolved, unresolved, or impossible to judge.

Expected output

.atk/
├── datasets/
│   └── service_source_codes.csv
├── runner/
│   ├── eval_runner.py
│   └── failure_rule.py
└── results/
    ├── v1/
    │   ├── eval_results.csv
    │   ├── logs/                    # optional row logs
    │   │   └── row_000001.log
    │   ├── failure_cases.csv
    │   ├── failure_cases.html       # optional failure browser
    │   ├── report.md
    │   └── tuning_plan.md
    └── v2/
        └── ...

Most users only need to read eval_results.csv, failure_cases.csv, optional failure_cases.html, report.md, and row logs linked from agent_output_log_path when available. Version directories are managed automatically.

Available Skills

  • $atk-status: inspect progress and recommend the next step.
  • $atk-init: generate a test runner for the current Agent.
  • $atk-run: run the test runner and create the current result version.
  • $atk-find-failures: let Codex identify failing cases.
  • $atk-init-failure-rule: create or update .atk/runner/failure_rule.py.
  • $atk-find-failures-by-rule: execute .atk/runner/failure_rule.py to identify failing cases with explicit rules.
  • $atk-report: generate analysis and cross-loop validation.
  • $atk-visualize-failures: generate optional .atk/results/vN/failure_cases.html from current failure_cases.csv.
  • $atk-tune: tune the Agent and record the tuning plan.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_tune_kit-0.3.7.tar.gz (93.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_tune_kit-0.3.7-py3-none-any.whl (101.6 kB view details)

Uploaded Python 3

File details

Details for the file agent_tune_kit-0.3.7.tar.gz.

File metadata

  • Download URL: agent_tune_kit-0.3.7.tar.gz
  • Upload date:
  • Size: 93.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agent_tune_kit-0.3.7.tar.gz
Algorithm Hash digest
SHA256 7dac6d48f07fcd2c2b1420742fc3441f9c28e489efdc6951732dae1cb31a80b5
MD5 885e04a271254713f09d0ccc37f5d1f5
BLAKE2b-256 0c6395d7035d6ebad2290be0563e235d7f248491319e4ad3143f18692bc8cd96

See more details on using hashes here.

File details

Details for the file agent_tune_kit-0.3.7-py3-none-any.whl.

File metadata

  • Download URL: agent_tune_kit-0.3.7-py3-none-any.whl
  • Upload date:
  • Size: 101.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agent_tune_kit-0.3.7-py3-none-any.whl
Algorithm Hash digest
SHA256 e6070086df4b7bb273cf2afee6de633f71a7492d81824087fa387a760eb65673
MD5 bda55e4b93a0219893b8a9828d7d9c93
BLAKE2b-256 e5e2e36acb26e4756d32871e38c36d075e4d6af4334ff4169175aebf847f475b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page