Skip to main content

Safety drift prediction for AI agents — the first open-source implementation of the SafetyDrift framework (arXiv:2603.27148)

Project description

SafetyDrift 🛡️

The first open-source implementation of the SafetyDrift framework.

"When an agent reads a confidential file, writes a summary, then emails it externally — no single step is unsafe, but the sequence is a data leak." — SafetyDrift, arXiv:2603.27148 (March 2026)

DriftGuard predicts when individually safe AI agent actions are about to compound into a safety violation, and intervenes before it happens. It models agent safety trajectories as absorbing Markov chains — giving you a P(violation within N steps) score after every tool call.


The problem it solves

Every major AI agent framework (LangChain, AutoGen, CrewAI, Claude Code) trusts the agent. They check output content. They don't check accumulated authority or trajectory risk.

Situation Traditional guardrails DriftGuard
Agent reads a secret ✅ Allowed ✅ Allowed (P=low)
Agent reads a secret, then opens customer CSV ✅ Allowed ⚠️ Warns (P rising)
Agent reads a secret, opens CSV, sends email ✅ Allowed 🚫 Blocked (P=87%)

The SafetyDrift paper found: in communication-capable agents, reaching even a mild risk state gives an 85% probability of a safety violation within 5 steps. DriftGuard makes that prediction in real time, before the violation occurs.


Quick start

pip install driftguard
from driftguard import Session, InterventionAction

session = Session(task_type="default")  # or: communication, technical, autonomous

# Call before EVERY tool execution
result = session.gate("read_file", {"path": "/workspace/customer_data.csv"})

if result.action == InterventionAction.BLOCK:
    raise RuntimeError(f"DriftGuard blocked: {result.reason}")
elif result.action == InterventionAction.PAUSE:
    approved = ask_human_for_approval(result.to_dict())
    if not approved:
        raise RuntimeError("Human rejected action")

# Safe to proceed
actually_read_file(...)

How it works

DriftGuard tracks three cumulative safety dimensions per session:

Dimension Description Levels
Data Exposure Sensitivity of data accessed None → Public → Internal → Confidential → Sensitive
Tool Escalation Capability level reached None → Read → Write → Network → External
Reversibility Can actions be undone? Fully → Mostly → Mixed → Mostly Not → Irreversible

State is monotonic: it only ever increases. After each tool call, DriftGuard:

  1. Classifies the call into risk dimensions
  2. Projects the cumulative state forward
  3. Runs Markov chain absorption analysis: P(violation within N steps)
  4. Applies the configured policy (WARN / PAUSE / BLOCK)

Markov chain model

From the SafetyDrift paper, safety violations follow absorbing Markov chain dynamics:

SAFE → LOW → MODERATE → HIGH → CRITICAL → [VIOLATION]

Every agent will eventually reach a violation if left unsupervised — the practical question is when, not if. DriftGuard computes the finite-horizon absorption probability:

P(violation | state, horizon) = [T^horizon][state, violation_state]

where T is a task-type-calibrated transition matrix.


Demo output

Step 1: web_search           Risk: MODERATE  P(viol): 31.1%  ✅ WARN
Step 2: read_file (config)   Risk: MODERATE  P(viol): 31.5%  ✅ WARN
Step 3: read_file (customers) Risk: MODERATE P(viol): 31.9%  ✅ WARN
Step 4: write_file (summary) Risk: HIGH      P(viol): 54.8%  ⏸ PAUSE
Step 5: send_email           Risk: CRITICAL  P(viol): 86.8%  🚫 BLOCK

Each step looks safe in isolation. The sequence is a data leak. DriftGuard catches it at step 4 (pause) and blocks it at step 5.


Use as an MCP server

DriftGuard ships as a stdio MCP server. Any MCP-compatible agent (Claude Code, Cursor, GitHub Copilot) can call it directly.

Add to your mcp.json:

{
  "driftguard": {
    "command": "python",
    "args": ["-m", "driftguard"]
  }
}

Available MCP tools:

Tool Description
dg_gate Evaluate a tool call. Returns action: ALLOW / WARN / PAUSE / BLOCK
dg_session_state Get current cumulative risk state
dg_summary Get session stats (blocks, pauses, step count)
dg_reset Reset session for a new task

Example system prompt addition for Claude Code:

Before executing any tool that reads files, makes network requests, or
writes data, call dg_gate with the tool name and arguments. If the result
action is BLOCK, do not proceed. If PAUSE, describe the action and ask the
user for approval before continuing.

Configuration

from driftguard import Session
from driftguard.policy import PolicyConfig, PolicyThreshold
from driftguard.types import InterventionAction

config = PolicyConfig(
    horizon=5,                  # steps ahead to evaluate
    task_type="communication",  # higher baseline risk
    thresholds=[
        PolicyThreshold(0.90, InterventionAction.BLOCK,  "Critical risk — blocked"),
        PolicyThreshold(0.60, InterventionAction.PAUSE,  "High risk — approval needed"),
        PolicyThreshold(0.30, InterventionAction.WARN,   "Elevated risk — warned"),
        PolicyThreshold(0.00, InterventionAction.LOG_ONLY, "Safe — logged"),
    ],
    always_block={
        "send_mass_email",
        "delete_production_database",
        "wipe_all_data",
    },
)

session = Session(config=config, task_type="communication")

Custom classifier rules

from driftguard import add_rule, ClassifierRule
from driftguard.types import DataExposure, ToolEscalation, Reversibility

# Add your own tool patterns
add_rule(ClassifierRule(
    pattern=r"jira.*create.*ticket",
    data_exposure=DataExposure.INTERNAL,
    tool_escalation=ToolEscalation.EXTERNAL,
    reversibility=Reversibility.MOSTLY,
    description="Create Jira ticket — external but recoverable",
))

Task types

Task type Typical use Baseline violation rate
technical Code editing, local file ops Very low (~1–5% per step)
information Research, browsing, summarising Low (~8–15%)
default General-purpose agents Medium (~8%)
autonomous Multi-step autonomous tasks Medium-high (~12%)
communication Email, messaging, posting agents High (~18%)

Integration examples

LangChain

from driftguard import Session, InterventionAction

session = Session(task_type="default")

class GuardedTool(BaseTool):
    def _run(self, *args, **kwargs):
        result = session.gate(self.name, kwargs)
        if result.action == InterventionAction.BLOCK:
            raise ToolException(f"DriftGuard: {result.reason}")
        return self._actual_run(*args, **kwargs)

OpenAI Agents SDK

from agents import function_tool
from driftguard import Session, InterventionAction

session = Session(task_type="autonomous")

def guarded(fn):
    def wrapper(**kwargs):
        r = session.gate(fn.__name__, kwargs)
        if r.action == InterventionAction.BLOCK:
            return f"[BLOCKED by DriftGuard: {r.reason}]"
        return fn(**kwargs)
    return function_tool(wrapper)

Background: the SafetyDrift paper

This library implements the framework from:

SafetyDrift: Predicting When AI Agents Cross the Line Before They Actually Do Aditya Dhodapkar, Farhaan Pishori (March 2026) arXiv:2603.27148

Key findings we implement:

  • Safety state modelled as an absorbing Markov chain across 3 dimensions
  • Every agent has absorption probability 1.0 — the question is when, not if
  • Communication tasks: 85% violation probability within 5 steps from mild-risk state
  • Technical tasks: below 5% from any state
  • "Points of no return" are sharply task-dependent

Contributing

The highest-value contributions right now:

  1. Real trace data — If you have agent session traces (with ground truth on whether violations occurred), they can calibrate the transition matrices far better than our heuristic approximation
  2. Framework adapters — LangGraph, CrewAI, AutoGen, Google Genkit
  3. CI/CD integration — GitHub Actions workflow that gates agent PRs

See CONTRIBUTING.md for details.


License

MIT — use it in your agent pipelines, commercial or otherwise.


Inspired by and implementing SafetyDrift (arXiv:2603.27148). This project is not affiliated with the paper's authors.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safetydrift-0.1.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

safetydrift-0.1.0-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file safetydrift-0.1.0.tar.gz.

File metadata

  • Download URL: safetydrift-0.1.0.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for safetydrift-0.1.0.tar.gz
Algorithm Hash digest
SHA256 28f16df14ddd1a906d1b7acc06164d76144a05794ecfc68bac220996b8d02ac7
MD5 b7f78ea6105ce476ab22d14023e46ff8
BLAKE2b-256 b2ca204e5af9e3298ffb2adce24de590d57613827460e4d7a722f7c000cb274c

See more details on using hashes here.

File details

Details for the file safetydrift-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: safetydrift-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for safetydrift-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2e92582e9cdf15aa79a5522269bdf86e5db34821a12b874d87b75c7413ac1fbf
MD5 262f7a442e21a01098af5b2daa678264
BLAKE2b-256 39c9261e985a1b83c21242832dbd88b858d0f2daaffc97b1af6081ce4a58bc47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page