Skip to main content

DecodingTrust Agent Platform (DTap) — A controllable and interactive red-teaming platform for AI agents

Project description

DecodingTrust-Agent

DecodingTrust-Agent Platform (DTAP)

A Controllable and Interactive Red-Teaming Platform for AI Agents

GitHub Repo stars GitHub last commit GitHub contributors License Website Paper

14 real-world domains · 50+ simulation environments · autonomous red-teaming · verifiable judges


About

DTAP is the first controllable and interactive red-teaming platform for AI agents, spanning 14 real-world domains and 50+ simulation environments that replicate widely-used systems such as Google Workspace, PayPal, and Slack. To scale the risk assessment of agents, we further propose DTAP-RED, the first autonomous red-teaming agent that systematically explores diverse injection vectors (prompt, tool, skill, environment, and combinations) and autonomously discovers effective attack strategies tailored to varying malicious goals. Using DTAP-RED, we curate DTAP-BENCH, a large-scale red-teaming dataset of high-quality instances across domains, each paired with a verifiable judge that automatically validates attack outcomes against the corresponding security policy.

Zhaorun Chen, Xun Liu, Haibo Tong, Chengquan Guo, Yuzhou Nie, Jiawei Zhang, Mintong Kang, Chejian Xu, Qichang Liu, Xiaogeng Liu, Tianneng Shi, Chaowei Xiao, Sanmi Koyejo, Percy Liang, Wenbo Guo, Dawn Song, Bo Li. Virtue AI · University of Chicago · UIUC · UCSB · Johns Hopkins · UC Berkeley · Stanford

Website: https://decodingtrust-agent.com


Table of Contents


Installation

git clone https://github.com/AI-secure/DecodingTrust-Agent.git
cd DecodingTrust-Agent
pip install -r requirements.txt
pip install -e .

Set the API key for your backbone model (only the providers you actually use are required):

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=...

Docker is required: each task spins up isolated MCP servers and Docker-based environments through TaskExecutor.


Quick Start

A single benign CRM task with the OpenAI Agents SDK backbone:

python eval/evaluation.py \
  --task-list benchmark/crm/benign.jsonl \
  --agent-type openaisdk \
  --model gpt-4o \
  --max-parallel 4

Results are written to results/benchmark/<agent_type>/<model>/<domain>/<type>/<task_id>/ (overridable with EVAL_RESULTS_ROOT).


Running the Default Benchmark

DTAP-BENCH ships pre-curated task lists under benchmark/. Each domain has three files:

File Threat model What it contains
benign.jsonl Benign workflow tasks (utility evaluation)
direct.jsonl Direct prompt injection Adversary controls the user prompt
indirect.jsonl Indirect (tool / env / data) Payload arrives via tool output, document, email, etc.

Each line in these files is a single task record, e.g.:

{"domain": "crm", "type": "benign", "task_id": "1"}
{"domain": "crm", "type": "malicious", "threat_model": "direct",   "risk_category": "dangerous-actions", "task_id": "1"}
{"domain": "crm", "type": "malicious", "threat_model": "indirect", "risk_category": "dangerous-actions", "task_id": "1"}

These records resolve to:

  • Benign → dataset/<domain>/benign/<task_id>/
  • Malicious → dataset/<domain>/malicious/<threat_model>/<risk_category>/<task_id>/

A single domain

Run every benign + direct + indirect task in a domain by pointing --task-list at the domain folder:

python eval/evaluation.py \
  --task-list benchmark/finance \
  --agent-type openaisdk \
  --model gpt-4o \
  --max-parallel 8

eval/evaluation.py auto-discovers every *.jsonl under that directory and runs them in one pass with environment reuse.

A single task list (benign / direct / indirect)

# Benign utility only
python eval/evaluation.py --task-list benchmark/crm/benign.jsonl --agent-type openaisdk --model gpt-4o

# Direct prompt injection only
python eval/evaluation.py --task-list benchmark/crm/direct.jsonl --agent-type claudesdk --model claude-sonnet-4-20250514

# Indirect injection only
python eval/evaluation.py --task-list benchmark/crm/indirect.jsonl --agent-type googleadk --model gemini-2.0-flash

You can further narrow execution with filters that operate on the loaded records:

python eval/evaluation.py \
  --task-list benchmark/crm \
  --task-type malicious \
  --threat-model indirect \
  --risk-category data-exfiltration \
  --agent-type openaisdk --model gpt-4o

The entire benchmark

Point --task-list at the top-level benchmark/ directory to run all 14 domains:

python eval/evaluation.py \
  --task-list benchmark \
  --agent-type openaisdk \
  --model gpt-4o \
  --max-parallel 16 \
  --skip-existing

--skip-existing skips any task that already has a judge_result.json, so the same command resumes a partially-finished run.


Customizing Your Own Task List

Any JSONL file with the schema below is a valid --task-list. Pick a subset of tasks, mix domains, or focus on one risk category — there is no separate registration step.

Schema:

Field Required Description
domain always One of the supported domains (see below)
type always "benign" or "malicious"
task_id always Identifier; matches the directory name under dataset/…/<task_id>/
threat_model malicious only "direct" or "indirect"
risk_category malicious only e.g. "data-exfiltration", "dangerous-actions", "social-scoring", "manipulative-use"

Example — my_subset.jsonl (mix of domains and risk categories, useful for smoke tests):

{"domain": "crm",     "type": "benign",    "task_id": "1"}
{"domain": "finance", "type": "malicious", "threat_model": "direct",   "risk_category": "dangerous-actions",    "task_id": "5"}
{"domain": "gmail",   "type": "malicious", "threat_model": "indirect", "risk_category": "data-exfiltration",    "task_id": "12"}
{"domain": "legal",   "type": "malicious", "threat_model": "indirect", "risk_category": "manipulative-use",     "task_id": "3"}

Run it like any built-in task list:

python eval/evaluation.py --task-list my_subset.jsonl --agent-type openaisdk --model gpt-4o --max-parallel 4

A few practical patterns:

# Curate from an existing file
grep '"risk_category": "data-exfiltration"' benchmark/crm/indirect.jsonl > my_crm_exfil.jsonl
python eval/evaluation.py --task-list my_crm_exfil.jsonl --agent-type openaisdk --model gpt-4o

# Try just one task end-to-end
echo '{"domain": "crm", "type": "benign", "task_id": "1"}' > one_task.jsonl
python eval/evaluation.py --task-list one_task.jsonl --agent-type openaisdk --model gpt-4o

For per-task internals (config.yaml, judge.py, setup.sh), see docs/quickstart.md.


eval/evaluation.py CLI Reference

Argument Default Description
--task-list required Path to a JSONL file or a directory containing JSONL files
--domain None Keep only records whose domain matches
--task-type None benign or malicious
--threat-model None direct or indirect
--risk-category None e.g. data-exfiltration
--max-parallel 5 Concurrent tasks (environments are reused across tasks)
--agent-type openaisdk openaisdk, claudesdk, googleadk, langchain, pocketflow, openclaw
--model gpt-4o Backbone model identifier
--temperature None Sampling temperature (model default if unset)
--port-range None Dynamic MCP port range, e.g. "10000-12000"
--direct-prompt off For direct threat model, use the malicious goal as-is instead of replaying attack turns
--disallowed-tools None Native tools to disable (filesystem tasks have sensible defaults)
--skip-existing off Skip tasks that already have judge_result.json
--skip-mcp off Skip MCP server startup (debug)
--skip-judge off Skip judge after task completion (debug)
--keep-envs off Keep Docker envs running after the run (debug)
--debug off Save extra trajectory diagnostics

Relevant environment variables:

Variable Description
OPENAI_API_KEY / ANTHROPIC_API_KEY / GOOGLE_API_KEY Provider keys
EVAL_RESULTS_ROOT Output root (default results/); layout {root}/benchmark/{agent_type}/{model}/{dataset_path}/
DT_PORT_RANGE / DT_PORT_DB Port allocation for MCP / Docker

Supported Domains

The 14 domains shipped with DTAP-BENCH:

browser · code · crm · customer-service · finance · legal · macos · medical · os-filesystem · research · telecom · travel · windows · workflow

Each domain mounts one or more MCP-backed environments (e.g. Salesforce, Gmail, PayPal, Slack, Atlassian, BigQuery, Snowflake, Databricks, GitHub, Google Docs, Booking, hospital, telecom, terminal, Windows / macOS shells). The full mapping lives in dt_arena/config/mcp.yaml.


Integrating a Custom Agent

DTAP supports two integration modes.

Option A — Use a pre-built native agent (recommended)

If you already have an agent written in one of the supported SDKs (OpenAI Agents SDK, LangChain, Claude SDK, Google ADK), wrap it with build_agent. Your existing tools / MCP servers stay attached; DTAP only adds the benchmark MCP servers (and any tool injections defined by the task) to a clone of your agent — the original is preserved.

import asyncio
from agents import Agent as OpenAIAgent          # your SDK
from utils.agent_helpers import build_agent
from dt_arena.src.types.agent import AgentConfig, RuntimeConfig

async def main():
    # 1. Your pre-built agent (any extra tools / MCP servers are kept untouched)
    native = OpenAIAgent(
        name="MyAgent",
        instructions="You are a helpful CRM assistant.",
        model="gpt-4o",
        mcp_servers=[my_custom_server],
    )

    # 2. Load the benchmark task config (adds salesforce, gmail, etc.)
    agent_cfg   = AgentConfig.from_yaml("dataset/crm/benign/1/config.yaml")
    runtime_cfg = RuntimeConfig(model="gpt-4o", temperature=0.1, max_turns=200,
                                output_dir="./results")

    # 3. Wrap — auto-detects OpenAI SDK / LangChain / Claude SDK / Google ADK
    agent = build_agent(native_agent=native, agent_cfg=agent_cfg, runtime_cfg=runtime_cfg)

    async with agent:
        result = await agent.run("List all leads in my account",
                                 metadata={"task_id": "demo", "domain": "crm"})
        print(result.final_output)

    assert agent.native_agent is native  # original is untouched

asyncio.run(main())

Option B — Implement a new agent backend

To plug a brand-new framework into eval/evaluation.py, subclass Agent and add a directory under agent/ (mirroring agent/openaisdk/, agent/claudesdk/, etc.):

from typing import Any, Union, List
from dt_arena.src.types.agent import Agent, MCPServerConfig

class MyNewAgent(Agent):
    async def initialize(self) -> None:
        await self.load_mcp_servers()       # connects to all task-declared MCP servers

    def _create_mcp_server(self, cfg: MCPServerConfig) -> Any:
        ...                                  # return your SDK-specific MCP client

    async def run(self, user_input: Union[str, List[str]], metadata=None) -> Any:
        inputs = [user_input] if isinstance(user_input, str) else user_input
        for turn in inputs:
            ...                              # execute one turn, append to trajectory
        return self.get_result()

    async def cleanup(self) -> None:
        self.reset_conversation()

Register the new framework name (e.g. mynewagent) in agent/__init__.py and the evaluation driver will accept --agent-type mynewagent. Full interface, multi-turn semantics, and trajectory schema are documented in agent/README.md.


Tool-Call Hooks

Every MCP tool call made by every supported framework flows through a shared pre/post hook system in dt_arena. Hooks are configured in a single JSON file and auto-loaded by every agent — no framework wiring, no CLI flags.

Enable hooks by editing dt_arena/src/hooks/hooks.json:

{
  "hooks": [
    "dt_arena.src.hooks.audit_log:AuditHook"
  ]
}

Write a hook by dropping a file under dt_arena/src/hooks/:

from dt_arena.src.types.hooks import ToolCallContext, ToolCallResult

class MyHook:
    async def on_pre_tool_call(self, ctx: ToolCallContext):
        print(f"calling {ctx.framework}/{ctx.server}/{ctx.tool_name}")

    async def on_post_tool_call(self, ctx: ToolCallContext, result: ToolCallResult):
        print(f"done in {result.duration:.3f}s, error={result.is_error}")

See dt_arena/src/hooks/audit_log.py for a complete working example, and docs/hooks.md for the protocol, parallel-dispatch behavior, and known caveats.


Citation

If you find DTAP useful for your research, please cite:

@article{chen2026decodingtrust,
  title={DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents},
  author={Chen, Zhaorun and Liu, Xun and Tong, Haibo and Guo, Chengquan and Nie, Yuzhou and Zhang, Jiawei and Kang, Mintong and Xu, Chejian and Liu, Qichang and Liu, Xiaogeng and others},
  journal={arXiv preprint arXiv:2605.04808},
  year={2026}
}

License

See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

decodingtrust_agent_sdk-0.1.0.tar.gz (807.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

decodingtrust_agent_sdk-0.1.0-py3-none-any.whl (947.7 kB view details)

Uploaded Python 3

File details

Details for the file decodingtrust_agent_sdk-0.1.0.tar.gz.

File metadata

  • Download URL: decodingtrust_agent_sdk-0.1.0.tar.gz
  • Upload date:
  • Size: 807.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for decodingtrust_agent_sdk-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d9e293463f6c77b53ff884f10eb11cf0c1d6214348cc6e31e98dc5e9dd75ab79
MD5 09d35bb39b0df9c5d43685989a49f67b
BLAKE2b-256 e392bc881b84eca78e74c66fb31f7a2484731efc85da93f90b7e59bdb4e7814c

See more details on using hashes here.

Provenance

The following attestation bundles were made for decodingtrust_agent_sdk-0.1.0.tar.gz:

Publisher: publish.yml on AI-secure/DecodingTrust-Agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file decodingtrust_agent_sdk-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for decodingtrust_agent_sdk-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 66568eba629850dd65033dc216ccd916ac6fda2a89dce95473a92f4e59398774
MD5 b249a6c3b33f1c083a8eac3e49aae00f
BLAKE2b-256 6abae5758ed4c6d2225f69b8e5f71366791227d553dc3f18c1853bffba12e00a

See more details on using hashes here.

Provenance

The following attestation bundles were made for decodingtrust_agent_sdk-0.1.0-py3-none-any.whl:

Publisher: publish.yml on AI-secure/DecodingTrust-Agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page