Safety drift prediction for AI agents — the first open-source implementation of the SafetyDrift framework (arXiv:2603.27148)
Project description
SafetyDrift 🛡️
The first open-source implementation of the SafetyDrift framework.
"When an agent reads a confidential file, writes a summary, then emails it externally — no single step is unsafe, but the sequence is a data leak." — SafetyDrift, arXiv:2603.27148 (March 2026)
DriftGuard predicts when individually safe AI agent actions are about to compound into a safety violation, and intervenes before it happens. It models agent safety trajectories as absorbing Markov chains — giving you a P(violation within N steps) score after every tool call.
The problem it solves
Every major AI agent framework (LangChain, AutoGen, CrewAI, Claude Code) trusts the agent. They check output content. They don't check accumulated authority or trajectory risk.
| Situation | Traditional guardrails | DriftGuard |
|---|---|---|
| Agent reads a secret | ✅ Allowed | ✅ Allowed (P=low) |
| Agent reads a secret, then opens customer CSV | ✅ Allowed | ⚠️ Warns (P rising) |
| Agent reads a secret, opens CSV, sends email | ✅ Allowed | 🚫 Blocked (P=87%) |
The SafetyDrift paper found: in communication-capable agents, reaching even a mild risk state gives an 85% probability of a safety violation within 5 steps. DriftGuard makes that prediction in real time, before the violation occurs.
Quick start
pip install driftguard
from driftguard import Session, InterventionAction
session = Session(task_type="default") # or: communication, technical, autonomous
# Call before EVERY tool execution
result = session.gate("read_file", {"path": "/workspace/customer_data.csv"})
if result.action == InterventionAction.BLOCK:
raise RuntimeError(f"DriftGuard blocked: {result.reason}")
elif result.action == InterventionAction.PAUSE:
approved = ask_human_for_approval(result.to_dict())
if not approved:
raise RuntimeError("Human rejected action")
# Safe to proceed
actually_read_file(...)
How it works
DriftGuard tracks three cumulative safety dimensions per session:
| Dimension | Description | Levels |
|---|---|---|
| Data Exposure | Sensitivity of data accessed | None → Public → Internal → Confidential → Sensitive |
| Tool Escalation | Capability level reached | None → Read → Write → Network → External |
| Reversibility | Can actions be undone? | Fully → Mostly → Mixed → Mostly Not → Irreversible |
State is monotonic: it only ever increases. After each tool call, DriftGuard:
- Classifies the call into risk dimensions
- Projects the cumulative state forward
- Runs Markov chain absorption analysis: P(violation within N steps)
- Applies the configured policy (WARN / PAUSE / BLOCK)
Markov chain model
From the SafetyDrift paper, safety violations follow absorbing Markov chain dynamics:
SAFE → LOW → MODERATE → HIGH → CRITICAL → [VIOLATION]
Every agent will eventually reach a violation if left unsupervised — the practical question is when, not if. DriftGuard computes the finite-horizon absorption probability:
P(violation | state, horizon) = [T^horizon][state, violation_state]
where T is a task-type-calibrated transition matrix.
Demo output
Step 1: web_search Risk: MODERATE P(viol): 31.1% ✅ WARN
Step 2: read_file (config) Risk: MODERATE P(viol): 31.5% ✅ WARN
Step 3: read_file (customers) Risk: MODERATE P(viol): 31.9% ✅ WARN
Step 4: write_file (summary) Risk: HIGH P(viol): 54.8% ⏸ PAUSE
Step 5: send_email Risk: CRITICAL P(viol): 86.8% 🚫 BLOCK
Each step looks safe in isolation. The sequence is a data leak. DriftGuard catches it at step 4 (pause) and blocks it at step 5.
Use as an MCP server
DriftGuard ships as a stdio MCP server. Any MCP-compatible agent (Claude Code, Cursor, GitHub Copilot) can call it directly.
Add to your mcp.json:
{
"driftguard": {
"command": "python",
"args": ["-m", "driftguard"]
}
}
Available MCP tools:
| Tool | Description |
|---|---|
dg_gate |
Evaluate a tool call. Returns action: ALLOW / WARN / PAUSE / BLOCK |
dg_session_state |
Get current cumulative risk state |
dg_summary |
Get session stats (blocks, pauses, step count) |
dg_reset |
Reset session for a new task |
Example system prompt addition for Claude Code:
Before executing any tool that reads files, makes network requests, or
writes data, call dg_gate with the tool name and arguments. If the result
action is BLOCK, do not proceed. If PAUSE, describe the action and ask the
user for approval before continuing.
Configuration
from driftguard import Session
from driftguard.policy import PolicyConfig, PolicyThreshold
from driftguard.types import InterventionAction
config = PolicyConfig(
horizon=5, # steps ahead to evaluate
task_type="communication", # higher baseline risk
thresholds=[
PolicyThreshold(0.90, InterventionAction.BLOCK, "Critical risk — blocked"),
PolicyThreshold(0.60, InterventionAction.PAUSE, "High risk — approval needed"),
PolicyThreshold(0.30, InterventionAction.WARN, "Elevated risk — warned"),
PolicyThreshold(0.00, InterventionAction.LOG_ONLY, "Safe — logged"),
],
always_block={
"send_mass_email",
"delete_production_database",
"wipe_all_data",
},
)
session = Session(config=config, task_type="communication")
Custom classifier rules
from driftguard import add_rule, ClassifierRule
from driftguard.types import DataExposure, ToolEscalation, Reversibility
# Add your own tool patterns
add_rule(ClassifierRule(
pattern=r"jira.*create.*ticket",
data_exposure=DataExposure.INTERNAL,
tool_escalation=ToolEscalation.EXTERNAL,
reversibility=Reversibility.MOSTLY,
description="Create Jira ticket — external but recoverable",
))
Task types
| Task type | Typical use | Baseline violation rate |
|---|---|---|
technical |
Code editing, local file ops | Very low (~1–5% per step) |
information |
Research, browsing, summarising | Low (~8–15%) |
default |
General-purpose agents | Medium (~8%) |
autonomous |
Multi-step autonomous tasks | Medium-high (~12%) |
communication |
Email, messaging, posting agents | High (~18%) |
Integration examples
LangChain
from driftguard import Session, InterventionAction
session = Session(task_type="default")
class GuardedTool(BaseTool):
def _run(self, *args, **kwargs):
result = session.gate(self.name, kwargs)
if result.action == InterventionAction.BLOCK:
raise ToolException(f"DriftGuard: {result.reason}")
return self._actual_run(*args, **kwargs)
OpenAI Agents SDK
from agents import function_tool
from driftguard import Session, InterventionAction
session = Session(task_type="autonomous")
def guarded(fn):
def wrapper(**kwargs):
r = session.gate(fn.__name__, kwargs)
if r.action == InterventionAction.BLOCK:
return f"[BLOCKED by DriftGuard: {r.reason}]"
return fn(**kwargs)
return function_tool(wrapper)
Background: the SafetyDrift paper
This library implements the framework from:
SafetyDrift: Predicting When AI Agents Cross the Line Before They Actually Do Aditya Dhodapkar, Farhaan Pishori (March 2026) arXiv:2603.27148
Key findings we implement:
- Safety state modelled as an absorbing Markov chain across 3 dimensions
- Every agent has absorption probability 1.0 — the question is when, not if
- Communication tasks: 85% violation probability within 5 steps from mild-risk state
- Technical tasks: below 5% from any state
- "Points of no return" are sharply task-dependent
Contributing
The highest-value contributions right now:
- Real trace data — If you have agent session traces (with ground truth on whether violations occurred), they can calibrate the transition matrices far better than our heuristic approximation
- Framework adapters — LangGraph, CrewAI, AutoGen, Google Genkit
- CI/CD integration — GitHub Actions workflow that gates agent PRs
See CONTRIBUTING.md for details.
License
MIT — use it in your agent pipelines, commercial or otherwise.
Inspired by and implementing SafetyDrift (arXiv:2603.27148). This project is not affiliated with the paper's authors.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file safetydrift-0.1.0.tar.gz.
File metadata
- Download URL: safetydrift-0.1.0.tar.gz
- Upload date:
- Size: 19.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28f16df14ddd1a906d1b7acc06164d76144a05794ecfc68bac220996b8d02ac7
|
|
| MD5 |
b7f78ea6105ce476ab22d14023e46ff8
|
|
| BLAKE2b-256 |
b2ca204e5af9e3298ffb2adce24de590d57613827460e4d7a722f7c000cb274c
|
File details
Details for the file safetydrift-0.1.0-py3-none-any.whl.
File metadata
- Download URL: safetydrift-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e92582e9cdf15aa79a5522269bdf86e5db34821a12b874d87b75c7413ac1fbf
|
|
| MD5 |
262f7a442e21a01098af5b2daa678264
|
|
| BLAKE2b-256 |
39c9261e985a1b83c21242832dbd88b858d0f2daaffc97b1af6081ce4a58bc47
|