Training-free defense for multi-turn safety risks in tool-using AI agents
Project description
Quickstart | Pre-generated | Generate Your Own | Extend | Benchmark | Citation
ToolShield is a training-free, tool-agnostic defense for AI agents that use MCP tools. It works by letting the agent explore tool functionality in a sandbox, learning from its own executions, and distilling safety guidelines before deployment. Reduces attack success rate by 30% on average โ with zero fine-tuning.
Quickstart
pip install toolshield
Use Pre-generated Experiences
We ship safety experiences for 6 models across 5 tools, with plug-and-play support for 5 coding agents. Inject them in one command:
# For Claude Code
toolshield import \
--exp-file experiences/claude-sonnet-4.5/experience_list_claude-sonnet-4.5_postgres.json \
--agent claude_code
# For Codex
toolshield import \
--exp-file experiences/claude-sonnet-4.5/experience_list_claude-sonnet-4.5_postgres.json \
--agent codex
# For OpenClaw
toolshield import \
--exp-file experiences/claude-sonnet-4.5/experience_list_claude-sonnet-4.5_postgres.json \
--agent openclaw
# For Cursor (writes to global user rules via SQLite)
toolshield import \
--exp-file experiences/claude-sonnet-4.5/experience_list_claude-sonnet-4.5_postgres.json \
--agent cursor
# For OpenHands (creates a microagent)
toolshield import \
--exp-file experiences/claude-sonnet-4.5/experience_list_claude-sonnet-4.5_postgres.json \
--agent openhands
This appends safety guidelines to your agent's context file (~/.claude/CLAUDE.md, ~/.codex/AGENTS.md, ~/.openclaw/workspace/AGENTS.md, Cursor's global user rules, or ~/.openhands/microagents/toolshield.md). To remove them:
toolshield unload --agent claude_code
Available experiences in experiences/:
| Model | ๐ Filesystem | ๐ PostgreSQL | ๐ป Terminal | ๐ญ Playwright | ๐ Notion |
|---|---|---|---|---|---|
claude-sonnet-4.5 |
โ | โ | โ | โ | โ |
gpt-5.2 |
โ | โ | โ | โ | โ |
deepseek-v3.2 |
โ | โ | โ | โ | โ |
gemini-3-flash-preview |
โ | โ | โ | โ | โ |
qwen3-coder-plus |
โ | โ | โ | โ | โ |
seed-1.6 |
โ | โ | โ | โ | โ |
More plug-and-play experiences for additional tools coming soon (including Toolathlon support)! Have a tool you'd like covered? Open an issue.
Generate Your Own
Point ToolShield at any running MCP server to generate custom safety experiences:
export TOOLSHIELD_MODEL_NAME="anthropic/claude-sonnet-4.5"
export OPENROUTER_API_KEY="your-key"
# Full pipeline: inspect โ generate safety tree โ test โ distill โ inject
toolshield \
--mcp_name postgres \
--mcp_server http://localhost:9091 \
--output_path output/postgres \
--agent codex
Or generate without injecting (useful for review):
toolshield generate \
--mcp_name postgres \
--mcp_server http://localhost:9091 \
--output_path output/postgres
Extend to New Tools
ToolShield works with any MCP server that has an SSE endpoint:
toolshield generate \
--mcp_name your_custom_tool \
--mcp_server http://localhost:PORT \
--output_path output/your_custom_tool
MT-AgentRisk Benchmark
We also release MT-AgentRisk, a benchmark of 365 harmful tasks across 5 MCP tools, transformed into multi-turn attack sequences. See agentrisk/README.md for full evaluation setup.
Quick evaluation:
# 1. Download benchmark tasks
git clone https://huggingface.co/datasets/CHATS-Lab/MT-AgentRisk
cp -r MT-AgentRisk/workspaces/* workspaces/
# 2. Run a single task (requires OpenHands setup โ see agentrisk/README.md)
python agentrisk/run_eval.py \
--task-path workspaces/terminal/multi_turn_tasks/multi-turn_root-remove \
--agent-llm-config agent \
--env-llm-config env \
--outputs-path output/eval \
--server-hostname localhost
Add --use-experience <path> to evaluate with ToolShield defense.
Repository Layout
ToolShield/
โโโ toolshield/ # pip-installable defense package
โโโ agentrisk/ # evaluation framework (see agentrisk/README.md)
โโโ experiences/ # pre-generated safety experiences (6 models ร 5 tools)
โโโ workspaces/ # MT-AgentRisk task data (from HuggingFace)
โโโ docker/ # Dockerfiles and compose
โโโ scripts/ # experiment reproduction guides
Acknowledgments
We thank the authors of the following projects for their contributions:
Citation
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file toolshield-0.1.0.tar.gz.
File metadata
- Download URL: toolshield-0.1.0.tar.gz
- Upload date:
- Size: 3.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
271946318233bfdedda0eb73faa00c4b7f118696ec0ca56450d7d9ef84fb22ce
|
|
| MD5 |
de059bc256c5f944e281d2501ef2e4aa
|
|
| BLAKE2b-256 |
db4558b5abb3ae764ccad0aa5076bc30145062209140642ed923323e1357fb94
|
File details
Details for the file toolshield-0.1.0-py3-none-any.whl.
File metadata
- Download URL: toolshield-0.1.0-py3-none-any.whl
- Upload date:
- Size: 57.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50bf760537ac95ea4609460a6438a2e8c43593516c36bb66a6e43393e06f1798
|
|
| MD5 |
d671b972451baa7ab3182b757b9712fa
|
|
| BLAKE2b-256 |
c94543b99fc7ed47fadb1dac53d7b7a19e003917af6fb5e0b25c2eed5e5d89f6
|