GRASP — self-improvement via a regression-gated skill library learned from an agent's own failure traces
Project description
GRASP
GRASP is a self-improvement method that learns a small, regression-gated skill library from an agent's own failure traces. A proposed skill is kept only when it improves performance on a held-out probe set — so the library grows by keeping what demonstrably helps and discarding what doesn't.
This repository is two things:
- A reusable method + framework (
grasp/) — apply GRASP to your own agent and tasks, and benchmark your own self-improvement method against GRASP and five baselines through a small plug-in interface. - The full paper artifact — four benchmark families (
benchmarks/) and all released results behind the paper (results/).
Install
pip install -e . # core depends only on PyYAML
Quickstart (no Docker, no server)
Watch GRASP learn skills on a laptop in minutes, on a self-contained slice of MedAgentBench's read-only FHIR lookup tasks served by an in-process mock:
# point the 'local' backend at any OpenAI-compatible endpoint
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"
export GRASP_MODEL="your-model-name"
python -m examples.quickstart.run --agent local
It writes a val-accuracy learning curve and the learned skill library under
examples/quickstart/runs/. See examples/quickstart/.
Use GRASP on your own agent
Implement a Task (how to sample, run, and score your environment) and run GRASP
on it:
from grasp import run_grasp
run_grasp(MyTask(), "config.yaml", agent="local")
Task—samples(),rollout(sample, agent),evaluate(sample, output), plus optionalfailure_tags/protocol_hook/updater_*hooks.Method— GRASP is the referenceMethod; subclass it to benchmark your own self-improvement method on the same tasks.
| Read this | For |
|---|---|
docs/method.md |
how GRASP works — the loop and the regression gate |
docs/add_a_task.md |
plug in your own environment |
docs/add_a_method.md |
benchmark your own method vs. GRASP + 5 baselines |
Benchmarks (the paper artifact)
Each benchmark is self-contained under benchmarks/, with its own README for
environment setup (conda, Docker, data) and a run_all.sh <backend> [run_name]
helper.
| Directory | Benchmark | Role in paper | Setup |
|---|---|---|---|
benchmarks/MedAgentBench/ |
FHIR reads/writes against a live FHIR server | primary (clinical) | Docker |
benchmarks/MedAgentBench-v2/ |
Harder FHIR tasks: multi-step decisions, coordinated writes | primary (clinical) | Docker |
benchmarks/FHIR-AgentBench/ |
Structured clinical QA / tool use on an independent FHIR store | supporting (clinical) | GCP Healthcare API |
benchmarks/AgentBench/ |
Four non-clinical environments: OS, DBBench, WebShop, ALFWorld | supporting (generality) | Docker |
The paper compares GRASP against a no-skills baseline and five self-improvement
methods, all implemented in each benchmark directory: grasp (GRASP, ours),
memory_cycle (Sequential memory), batch_memory_cycle (Batch memory),
expel_cycle (ExpeL), evo_memory_cycle (Evo-MedAgent), skillx_cycle (SkillX).
The executing agent and skill-writer use the same model; five backends are
selectable at run time (gptoss, deepseek, gemini, gpt5, gpt4, or a
generic local OpenAI-compatible endpoint). No secrets are stored in the
repository — presets read endpoints and keys from environment variables. See
each benchmark's configs/agents/README.md.
Released results
All numbers behind the paper live under results/ — per-seed
validation, test, and OOD accuracies for every cell of Tables 1–5, the learned
skill libraries, the frozen transfer libraries, and the run configurations.
Reproduce the headline tables directly:
python results/reproduce_tables.py # Table 1 (all models) + Table 5
python results/reproduce_tables.py gpt-oss-120b # one model
See results/README.md for the full directory↔cell map.
License
MIT (see LICENSE) for the GRASP core, examples, and docs. Vendored
benchmark code under benchmarks/AgentBench/ and benchmarks/FHIR-AgentBench/
retains its own upstream license.
Citation
See CITATION.cff.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file grasp_skills-0.1.0.tar.gz.
File metadata
- Download URL: grasp_skills-0.1.0.tar.gz
- Upload date:
- Size: 42.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76789ff783ff6067364b015ff0c1c2284ed5803b79fb6c453e34fdfeb5cda538
|
|
| MD5 |
29ec7706e778d89598b113040c310dbf
|
|
| BLAKE2b-256 |
34feb4578612c170972fb6304fafdce856d3df7ef0b3410556752af645ed81dc
|
File details
Details for the file grasp_skills-0.1.0-py3-none-any.whl.
File metadata
- Download URL: grasp_skills-0.1.0-py3-none-any.whl
- Upload date:
- Size: 44.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a4a02600df39a1b4b2a0181322a3f8687be00d7cc2d5bcba9ba7fcb86c6c810
|
|
| MD5 |
f621e7810353d7914aeddb434b4e3609
|
|
| BLAKE2b-256 |
3704abaa3f637947858a93bd548f60a75413971f7bceb4053d9350f98eaa5979
|