Skip to main content

GRASP — self-improvement via a regression-gated skill library learned from an agent's own failure traces

Project description

GRASP

GRASP is a self-improvement method that learns a small, regression-gated skill library from an agent's own failure traces. A proposed skill is kept only when it improves performance on a held-out probe set — so the library grows by keeping what demonstrably helps and discarding what doesn't.

This repository is two things:

  1. A reusable method + framework (grasp/) — apply GRASP to your own agent and tasks, and benchmark your own self-improvement method against GRASP and five baselines through a small plug-in interface.
  2. The full paper artifact — four benchmark families (benchmarks/) and all released results behind the paper (results/).

Install

pip install -e .          # core depends only on PyYAML

Quickstart (no Docker, no server)

Watch GRASP learn skills on a laptop in minutes, on a self-contained slice of MedAgentBench's read-only FHIR lookup tasks served by an in-process mock:

# point the 'local' backend at any OpenAI-compatible endpoint
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"
export GRASP_MODEL="your-model-name"

python -m examples.quickstart.run --agent local

It writes a val-accuracy learning curve and the learned skill library under examples/quickstart/runs/. See examples/quickstart/.

Use GRASP on your own agent

Implement a Task (how to sample, run, and score your environment) and run GRASP on it:

from grasp import run_grasp
run_grasp(MyTask(), "config.yaml", agent="local")
  • Tasksamples(), rollout(sample, agent), evaluate(sample, output), plus optional failure_tags / protocol_hook / updater_* hooks.
  • Method — GRASP is the reference Method; subclass it to benchmark your own self-improvement method on the same tasks.
Read this For
docs/method.md how GRASP works — the loop and the regression gate
docs/add_a_task.md plug in your own environment
docs/add_a_method.md benchmark your own method vs. GRASP + 5 baselines

Benchmarks (the paper artifact)

Each benchmark is self-contained under benchmarks/, with its own README for environment setup (conda, Docker, data) and a run_all.sh <backend> [run_name] helper.

Directory Benchmark Role in paper Setup
benchmarks/MedAgentBench/ FHIR reads/writes against a live FHIR server primary (clinical) Docker
benchmarks/MedAgentBench-v2/ Harder FHIR tasks: multi-step decisions, coordinated writes primary (clinical) Docker
benchmarks/FHIR-AgentBench/ Structured clinical QA / tool use on an independent FHIR store supporting (clinical) GCP Healthcare API
benchmarks/AgentBench/ Four non-clinical environments: OS, DBBench, WebShop, ALFWorld supporting (generality) Docker

The paper compares GRASP against a no-skills baseline and five self-improvement methods, all implemented in each benchmark directory: grasp (GRASP, ours), memory_cycle (Sequential memory), batch_memory_cycle (Batch memory), expel_cycle (ExpeL), evo_memory_cycle (Evo-MedAgent), skillx_cycle (SkillX).

The executing agent and skill-writer use the same model; five backends are selectable at run time (gptoss, deepseek, gemini, gpt5, gpt4, or a generic local OpenAI-compatible endpoint). No secrets are stored in the repository — presets read endpoints and keys from environment variables. See each benchmark's configs/agents/README.md.

Released results

All numbers behind the paper live under results/ — per-seed validation, test, and OOD accuracies for every cell of Tables 1–5, the learned skill libraries, the frozen transfer libraries, and the run configurations. Reproduce the headline tables directly:

python results/reproduce_tables.py                 # Table 1 (all models) + Table 5
python results/reproduce_tables.py gpt-oss-120b     # one model

See results/README.md for the full directory↔cell map.


License

MIT (see LICENSE) for the GRASP core, examples, and docs. Vendored benchmark code under benchmarks/AgentBench/ and benchmarks/FHIR-AgentBench/ retains its own upstream license.

Citation

See CITATION.cff.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grasp_skills-0.1.0.tar.gz (42.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grasp_skills-0.1.0-py3-none-any.whl (44.7 kB view details)

Uploaded Python 3

File details

Details for the file grasp_skills-0.1.0.tar.gz.

File metadata

  • Download URL: grasp_skills-0.1.0.tar.gz
  • Upload date:
  • Size: 42.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for grasp_skills-0.1.0.tar.gz
Algorithm Hash digest
SHA256 76789ff783ff6067364b015ff0c1c2284ed5803b79fb6c453e34fdfeb5cda538
MD5 29ec7706e778d89598b113040c310dbf
BLAKE2b-256 34feb4578612c170972fb6304fafdce856d3df7ef0b3410556752af645ed81dc

See more details on using hashes here.

File details

Details for the file grasp_skills-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: grasp_skills-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 44.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for grasp_skills-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7a4a02600df39a1b4b2a0181322a3f8687be00d7cc2d5bcba9ba7fcb86c6c810
MD5 f621e7810353d7914aeddb434b4e3609
BLAKE2b-256 3704abaa3f637947858a93bd548f60a75413971f7bceb4053d9350f98eaa5979

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page