A TRL-like training library for end-to-end skill optimization of frozen LLM agents (based on Microsoft SkillOpt).

These details have not been verified by PyPI

Project links

Project description

skillrl

A TRL-like training library for end-to-end skill optimization of frozen LLM agents. Implements the core algorithm of Microsoft SkillOpt (project page, repo) as a clean, modular Python package — designed to grow into the TRL of skill / prompt-space optimization.

1. What is this?

Modern LLM agents are usually improved either by fine-tuning weights (expensive, opaque) or by hand-tweaking prompts (cheap, brittle, ad-hoc).

SkillOpt proposes a third path: treat a natural-language skill document (a markdown file of guidelines, heuristics, do/don'ts) as the trainable state. Both the target LLM (the agent that uses the skill) and the optimizer LLM (the model that critiques and rewrites the skill) stay frozen. Gradient descent is replaced by a textual analogue:

SGD on weights	SkillOpt on skill text
Forward pass	Rollout: target agent runs the skill on a batch
Backward pass (∂L/∂θ)	Reflect: optimizer LLM analyses success/failure → candidate edits
Gradient accumulation	Aggregate: hierarchical merge → one coherent patch
Gradient clipping / `learning_rate`	Select: rank edits, keep top-`L` (the edit budget)
`optimizer.step()`	Update: deterministically apply edits to the skill doc
Validation	Evaluate: hold-out gate — accept iff strictly better

skillrl packages this 6-stage pipeline as a TRL-style library, so you can write:

trainer = SkillOptTrainer(config=cfg, env=env, optimizer_client=..., target_client=...)
summary = trainer.train()

…just like you'd write PPOTrainer(...).train() in 🤗 TRL.

2. Why TRL-like?

🤗 TRL	skillrl
`PPOConfig` (dataclass)	`SkillOptConfig` (dataclass)
`PPOTrainer.train()`	`SkillOptTrainer.train()`
Reward model	`SkillEnv` (`rollout_one` returns `hard`/`soft`/`fail_reason`)
Policy model (trainable)	Skill document (markdown, trainable)
Reference / value model	Frozen target_client
Optimizer (Adam)	Frozen optimizer_client + edit budget scheduler
Learning rate	`edit_budget` (max edits per step) + `lr_scheduler` (constant/linear/cosine)
Gradient clipping	LLM-based ranking — keeps top-`L` edits
Validation reward	`selection_split` gate (`hard` / `soft` / `mixed`)

3. Installation

# editable install from this repo
pip install -e .

# or with dev extras (pytest)
pip install -e .[dev]

Requirements: Python ≥ 3.10, openai>=1.40.0. Any OpenAI-compatible endpoint (vLLM / Together / Azure / Moonshot / DeepSeek / …) works out of the box.

4. Quick start

A minimal end-to-end example using the bundled SimpleQAEnv:

from skillrl import SkillOptConfig, SkillOptTrainer
from skillrl.envs.qa import SimpleQAEnv
from skillrl.llm.openai_client import OpenAIChatClient

# 1. Data
train = [
    {"id": "1", "question": "Capital of France?",      "answers": ["Paris"]},
    {"id": "2", "question": "Largest ocean on Earth?", "answers": ["Pacific Ocean", "Pacific"]},
    # ... 30+ items recommended
]
val  = [{"id": "v1", "question": "Capital of Japan?", "answers": ["Tokyo"]}]
test = [{"id": "t1", "question": "Capital of Italy?", "answers": ["Rome"]}]

env = SimpleQAEnv(train_items=train, val_items=val, test_items=test)

# 2. Backends (optimizer_client = strong; target_client = the agent under training)
optimizer = OpenAIChatClient(model="gpt-4o")           # critic / rewriter
target    = OpenAIChatClient(model="gpt-4o-mini")      # the frozen agent

# 3. Config (paper-default protocol)
cfg = SkillOptConfig(
    num_epochs=2,
    batch_size=8,
    minibatch_size=4,
    edit_budget=4,
    lr_scheduler="cosine",
    gate_metric="hard",
    out_root="outputs/qa_demo",
)

# 4. Train
trainer = SkillOptTrainer(
    config=cfg, env=env,
    optimizer_client=optimizer, target_client=target,
    initial_skill="You are a concise QA assistant. Answer in one short phrase.",
)
summary = trainer.train()
print(summary["best_selection_score"], summary["test_hard"])

A runnable version lives at examples/train_qa.py.

After training, the output directory contains everything you need to inspect the run:

outputs/qa_demo/
├── config.json                 # resolved config
├── best_skill.md               # all-time best skill (deploy this)
├── current_skill.md            # last accepted skill
├── history.json                # per-step records
├── runtime_state.json          # for auto-resume
├── summary.json                # final report
├── skills/skill_v0001.md ...   # per-step snapshots
├── steps/step_0000/
│   ├── rollout_results.json
│   ├── raw_patches.json
│   ├── merged_patch.json
│   ├── ranked_patch.json
│   ├── candidate_skill.md
│   ├── edit_apply_report.json
│   ├── selection_eval/        # validation rollouts on this candidate
│   └── step_record.json
├── test_eval_baseline/
└── test_eval_best/

If you re-launch with the same out_root, training auto-resumes from the last completed step.

5. The 6-stage pipeline

   ┌──────────────────────────────────────────────────────────────┐
   │                   one optimization step                      │
   │                                                              │
   │  current_skill.md                                            │
   │         │                                                    │
   │   ① ROLLOUT     env.rollout_one(item, skill, target_client)  │
   │         │  (parallel, n=batch_size)                          │
   │         ▼                                                    │
   │   trajectories  →  hard / soft / fail_reason                 │
   │         │                                                    │
   │   ② REFLECT     analyse failure & success minibatches        │
   │         │  optimizer_client → JSON {reasoning, edits}        │
   │         ▼                                                    │
   │   raw_patches  (failure-tagged + success-tagged)             │
   │         │                                                    │
   │   ③ AGGREGATE   hierarchical merge, failure-first            │
   │         │  optimizer_client → one coherent patch             │
   │         ▼                                                    │
   │   merged_patch                                               │
   │         │                                                    │
   │   ④ SELECT      LLM ranks edits, keep top-L (edit_budget)    │
   │         │  ≈ "gradient clipping" in text space               │
   │         ▼                                                    │
   │   ranked_patch                                               │
   │         │                                                    │
   │   ⑤ UPDATE      apply_patch(skill, ranked_patch)             │
   │         │  deterministic, append/insert/replace/delete       │
   │         ▼                                                    │
   │   candidate_skill.md                                         │
   │         │                                                    │
   │   ⑥ EVALUATE    rollout on selection_split → gate            │
   │         │  accept iff strictly better than current_score     │
   │         ▼                                                    │
   │   if accept:   current_skill := candidate                    │
   │                if also > best_score: best_skill := candidate │
   │   else:        keep current_skill                            │
   │                                                              │
   └──────────────────────────────────────────────────────────────┘

Edit budget = textual learning rate. The cap on edits applied per step is decayed (constant / linear / cosine) over the entire training horizon, exactly as the SkillOpt paper does.

Validation gate is strict: candidates must strictly beat current_score. A separate best_skill is tracked in parallel, so the artifact you ship is always the all-time best.

6. Library structure

skillrl/
├── __init__.py            # public exports
├── config.py              # SkillOptConfig (dataclass)
├── types.py               # Edit / Patch / RawPatch / RolloutResult / GateResult
├── trainer.py             # SkillOptTrainer — the main loop
│
├── core/
│   ├── editor.py          # apply_edit / apply_patch (5-Update)
│   ├── scheduler.py       # constant / linear / cosine edit-budget schedulers
│   ├── gate.py            # validation gate (hard / soft / mixed)
│   └── utils.py           # extract_json, compute_score, skill_hash
│
├── llm/
│   ├── base.py            # BaseLLMClient interface
│   └── openai_client.py   # OpenAI / Azure / OpenAI-compatible
│
├── pipeline/
│   ├── rollout.py         # 1-Rollout (parallel)
│   ├── reflect.py         # 2-Reflect (failure / success minibatches)
│   ├── aggregate.py       # 3-Aggregate (hierarchical merge, failure-first)
│   └── select.py          # 4-Select (LLM rank + top-L clip)
│
├── prompts/               # bundled markdown prompt templates
│   ├── analyst_error.md
│   ├── analyst_success.md
│   ├── merge_failure.md
│   ├── merge_success.md
│   ├── merge_final.md
│   └── ranking.md
│
└── envs/
    ├── base.py            # SkillEnv abstract class
    └── qa.py              # SimpleQAEnv (reference implementation)

7. Writing your own environment

To train a skill on your task, subclass SkillEnv:

from skillrl.envs.base import SkillEnv
from skillrl.types import RolloutResult

class MyEnv(SkillEnv):
    name = "my_env"

    def get_initial_skill(self) -> str:
        return "You are an expert XYZ agent..."

    def get_items(self, split: str) -> list[dict]:
        return self._splits[split]            # train / val / test

    def rollout_one(self, *, item, skill, target_client) -> RolloutResult:
        # 1) build the conversation; the *skill* is typically the system prompt.
        # 2) call target_client.chat(...) one or more times (multi-turn allowed).
        # 3) score the outcome:  hard ∈ {0,1},  soft ∈ [0,1].
        # 4) return RolloutResult(...).
        ...

That's it — drop it into SkillOptTrainer and you're training.

Tip. For multi-turn / tool-using agents, return the full conversation list and a meaningful fail_reason. The Reflect stage uses both to localise why the skill failed and what to change.

8. Customising prompts

All optimizer-LLM prompts live in skillrl/prompts/*.md. Override any of them per-trainer without modifying the package:

trainer = SkillOptTrainer(
    config=cfg, env=env,
    optimizer_client=opt, target_client=tgt,
    prompt_overrides={
        "analyst_error": open("my_prompts/analyst_error.md").read(),
        "ranking":       open("my_prompts/ranking.md").read(),
    },
)

Available keys: analyst_error, analyst_success, merge_failure, merge_success, merge_final, ranking.

9. Reproducibility & observability

Determinism. The same seed, batch_size, minibatch_size, dataset and backends produce the same minibatch shuffles and analyst groupings.
Auto-resume. Re-running with the same out_root skips already-completed steps (rebuilds the selection cache from history.json).
Per-step artifacts. Every stage's input/output is dumped — easy to diff between steps and reproduce any single step locally.
Selection cache. Identical candidate skills (by skill_hash) reuse cached selection-split scores — saves a lot of money on long runs.

10. What's NOT in 1.0 (yet)

skillrl 1.0 ships the core algorithm as faithfully as possible. The following SkillOpt features are intentionally deferred to future minor releases:

slow_update (skill momentum / EMA over accepted skills)
meta_skill (a meta-document guiding how to edit the skill)
Autonomous LR (online edit-budget tuning)
Gradient accumulation across steps
rewrite / full_rewrite_minibatch update modes
Codex / Claude-Code / Qwen / MiniMax execution backends
Ray-based distributed rollouts
WebUI

PRs welcome.

11. License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Jun 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skillrl-1.0.0.tar.gz (41.4 kB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

skillrl-1.0.0-py3-none-any.whl (46.7 kB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file skillrl-1.0.0.tar.gz.

File metadata

Download URL: skillrl-1.0.0.tar.gz
Upload date: Jun 5, 2026
Size: 41.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for skillrl-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`54b192e98d682658b162792832d4af368fb2da221f91a6bdce4cca79c7b44dac`
MD5	`9574c910f4e6ec1874d798e8f9a4d878`
BLAKE2b-256	`812e9e4650b4dd463dbfdda8f2453ac9919d640da277084099052dc6659c9e57`

See more details on using hashes here.

File details

Details for the file skillrl-1.0.0-py3-none-any.whl.

File metadata

Download URL: skillrl-1.0.0-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 46.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for skillrl-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`be701dfb76448c853f2f80221620d2bd29022f9a84c50040048af0aaba10a5e3`
MD5	`9f475811fe1c282944e77111b7deb410`
BLAKE2b-256	`237fdd149c780a29ac4f4a75aff7022995fb1ef9327975b18c3155087ca56d3f`

See more details on using hashes here.

skillrl 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

skillrl

1. What is this?

2. Why TRL-like?

3. Installation

4. Quick start

5. The 6-stage pipeline

6. Library structure

7. Writing your own environment

8. Customising prompts

9. Reproducibility & observability

10. What's NOT in 1.0 (yet)

11. License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes