Idiomatic, zero-dependency, provider-agnostic Python safety layer for AI agents: least-privilege permissions, prompt/input/output guards, sandbox + SSRF guards, quotas/rate/deadline/concurrency/risk budgets, loop detection, human-in-the-loop approval, constitutional rules, action previews, honeytokens, audit/tracing, and Claude/OpenAI/Gemini/MCP tool-call integration.
Project description
agent_safety
Least-privilege runtime control for AI agents — governing what they're allowed to do, not just what they say.
Content scanners answer "is this text safe?" Agents need a different question:
"is this agent allowed to do this?" — read that file, call that API, spend
that budget. agent_safety is the action layer: an agent can only invoke the
capabilities you granted, on the budgets you set, and even a successful prompt
injection can't reach a tool it was never given. Think least-privilege / IAM for
agents — expressed with constructs Python developers already know (a with
block and a decorator), not a config DSL.
It's complementary to content guardrails, not a competitor: plug Presidio /
LLM Guard / Lakera in behind the same Guard interface for detection, while this
governs the actions. See POSITIONING.md.
Measured, not asserted: the attack scorecard (CI-gated) shows 13/13 known agent attacks contained, 4/4 legitimate actions allowed. Pure standard library (zero dependencies), provider- and MCP-agnostic, and ~12 µs per tool call (benchmark) — negligible against a model round-trip.
Install
pip install agent-safety # zero runtime dependencies
# from source: pip install -e ".[dev]"
Start here
Two ideas. Mark a function with @tool, then run it inside a safely(...) block
that says, in plain words, what's allowed:
from agent_safety import tool, safely
@tool
def read_file(path):
return open(path).read()
with safely(allow="read_file", calls=10, hide_secrets=True):
text = read_file("notes.txt") # allowed, budget-counted, secrets scrubbed
# anything you didn't allow simply can't run here
Every option is a plain keyword — reach for one when you need it, ignore the rest:
with safely(
allow=["read_file", "search"], # what the code may do (or allow="everything")
deny="delete", # ...except this (deny always wins)
calls=25, # most tool calls
per_second=5, # speed limit
seconds=30, # time budget
at_most=4, # most tool calls running at once
hide_secrets=True, # scrub emails / API keys from results
block_injections=True, # reject "ignore previous instructions" inputs
no_repeats=3, # stop a runaway loop
ask=True, # ask you (y/n) before each action
explain=True, # require a rationale="..." with each call
monitor=True, # dry run: don't block, just log what WOULD block
log=True, # print every decision
):
...
Already have tools? Wrap them in bulk with guard — no edits — and reach for
a ready-made profile instead of assembling settings:
from agent_safety import guard, safely, Profiles
safe_search, safe_fetch = guard(search, fetch) # your existing functions, now guarded
with safely(allow=["search", "fetch"], **Profiles.hardened()): # paranoid defaults
safe_search("agent safety")
That's the whole beginner surface. When you outgrow it, every keyword maps to a real object you can use directly — read on.
New here? Walk through TUTORIAL.md — build a complete, running safe agent in about 10 minutes, no API keys needed.
The full version
Under the hood safely builds the same objects you can wire by hand for full
control: a with safety_context(...), a PermissionSet, guard objects, Quota,
and so on. The same per-provider glue is absorbed by ToolRegistry.
from agent_safety import (
safety_context, guarded_tool, PermissionSet,
MaxLength, RedactPII, PromptInjectionGuard, Quota, ListSink,
)
@guarded_tool("filesystem.read")
def read_file(path: str) -> str:
return open(path).read()
audit = ListSink()
with safety_context(
PermissionSet.of("filesystem.read"), # capabilities the agent gets
prompt_guards=[PromptInjectionGuard(), MaxLength(8000)],
output_guards=[RedactPII()], # scrub secrets from results
quota=Quota(max_calls=25, max_tokens=200_000),# resource budget
audit=[audit], # record every decision
):
text = read_file("notes.txt") # allowed, PII-redacted, budget-charged, audited
# any tool needing shell.exec / network.* here -> PermissionDenied
The core ideas
1. A with block scopes everything — and can only de-escalate
safety_context(...) installs a policy for the duration of a block and restores
the previous one on exit (even on exceptions). Nested contexts intersect
permissions and append guards, quotas, and audit sinks, so a sub-step can
drop privileges but nothing inside can grant itself a capability it wasn't given.
Outside any context the policy is deny-all (fail-safe). Backed by
contextvars, so it's correct under threads and asyncio tasks.
with safety_context(PermissionSet.of("filesystem.*", "network.http")):
is_allowed("network.http") # True
with safety_context(PermissionSet.of("filesystem.read")):
is_allowed("filesystem.write") # False — narrowed
is_allowed("network.http") # False — narrowed
# even safety_context(PermissionSet.allow_all()) here cannot widen it
is_allowed("network.http") # True again — outer scope restored
2. Permission sets: capabilities, deny-wins, default-deny
PermissionSet.of("filesystem.*", deny=["filesystem.delete"]) — glob patterns,
explicit deny overrides allow, anything unmatched is forbidden, and intersect()
can only ever narrow. That one-way ratchet is what the with block relies on.
3. Guards filter prompts, inputs, and outputs
Composable check(value, stage) objects that pass, sanitize, or block:
| Guard | What it does |
|---|---|
MaxLength(n) |
block over-long values (prompt-stuffing / runaway cost) |
DenyPattern(regex) |
block values matching a banned pattern |
PromptInjectionGuard() |
tripwire for "ignore previous instructions"-style attacks |
RedactPII() |
replace emails / cards / SSNs / API keys with [REDACTED:…] |
SecretScanner() |
detect provider credentials (AWS/GitHub/Slack/Google keys, JWTs, PEM keys) and redact or block |
UnicodeSanitizer() |
strip invisible / bidi / tag characters used for hidden prompt injection |
Honeytoken(token) |
trip if a planted canary secret appears in a value (exfiltration tripwire) |
PathBoundary(root) |
confine a filesystem path to root — block ../ traversal and symlink escapes |
NetworkAllowlist(hosts=…) |
confine a URL to approved hosts/schemes; block private-IP / localhost targets (SSRF) |
Compose([...]) |
chain guards, threading the transformed value |
The last two are sandbox guards: they constrain the resource a value points
at rather than its content, so they belong in input_guards=[…] on the tools that
touch the filesystem or the network.
@guarded_tool("filesystem.read", input_guards=[PathBoundary("/srv/data")])
def read_file(path: str) -> str: # "../../etc/passwd" -> GuardViolation
return open(path).read()
@guarded_tool("network.http", input_guards=[NetworkAllowlist(["api.weather.com"])])
def fetch(url: str) -> str: # http://169.254.169.254/ -> GuardViolation
...
4. Budgets: quotas, rate limits, deadlines, loop detection
Ways to bound how much, how fast, and how long an agent acts, all charged on every guarded call alongside the ones already in scope (so an inner limit can be tighter but never looser):
| Construct | Bounds |
|---|---|
Quota(max_calls=…, max_tokens=…) |
the total calls/tokens an agent may spend |
RateLimit(per_second=5) |
a sliding-window burst cap (also per_minute=, or max_calls=/per_seconds=) |
Deadline(seconds=30) |
a wall-clock budget, timed from the first action |
ConcurrencyLimit(4) |
most tool calls running at once — share one across agents to cap them together |
RiskBudget(20) |
spend danger, not calls — weight tools with @tool(..., risk=N) and cap the total |
LoopGuard(max_identical=3) |
a circuit breaker for an agent stuck repeating one tool with the same args |
with safety_context(
PermissionSet.of("*"),
quota=Quota(max_calls=200, max_tokens=500_000),
rate_limit=RateLimit(per_second=5), # 6th call in a second -> RateLimitExceeded
deadline=Deadline(seconds=30), # past 30s of work -> DeadlineExceeded
loop_guard=LoopGuard(max_identical=3), # 4th identical call -> LoopDetected
):
...
Report tokens from whatever your model's usage object gives you via
charge_tokens(...).
Many agents at once. Because the policy lives in a contextvars.ContextVar,
every thread and every asyncio task automatically gets its own rules — so
running several agents concurrently is just several safely(...) blocks, fully
isolated. Give each agent different powers, and share one ConcurrencyLimit to
cap their combined parallelism. See examples/multi_agent.py.
5. Monitor mode (dry run)
Adopting least privilege on a working agent is scary — what if you forgot to allow something? Run in monitor mode first: nothing is blocked, but every would-be denial is recorded. Watch the log, see exactly which capabilities the agent needs, then turn enforcement on.
with safely(allow="read_file", monitor=True, log=True):
delete_everything() # runs — but logs permission: would_deny delete_everything
Monitor mode obeys the same ratchet: a nested block can switch monitor → enforce (tighten), never enforce → monitor.
6. Human-in-the-loop approval
ApprovalGate(require=[…capabilities…], approver=fn) requires an explicit "yes"
before any matching tool runs. The approver is any callable (CLI prompt, Slack
round-trip, policy service) and may be sync or async; it's consulted after
the permission check but before the tool executes, and a denial raises
ApprovalDenied — which safe_dispatch reports back to the model instead of
crashing the loop.
def cli_ok(req) -> bool:
return input(f"Allow {req.tool}{req.args}? [y/N] ").lower() == "y"
with safety_context(
PermissionSet.of("shell.exec", "filesystem.*"),
approval=ApprovalGate(require=["shell.exec", "filesystem.delete"], approver=cli_ok),
):
run_shell("ls") # shell.exec -> prompts for a human yes/no
read_file("notes.txt") # not gated -> runs straight through
7. Explainability: make the agent say why
Least privilege limits what an agent may do; a ReasoningGate makes it justify
why before it does it. For matching capabilities the agent must supply a
rationale="…" with the call; the rationale is validated, recorded to the audit
trail, and handed to any approver — then stripped before the tool runs. A missing
or thin rationale raises ExplanationRequired, reported back to the model so it
retries with an explanation.
with safety_context(
PermissionSet.of("shell.exec"),
reasoning=ReasoningGate(require=["shell.exec"], min_length=20),
):
run_shell("rm build/*", rationale="Clearing stale build artifacts before a clean rebuild")
# run_shell("rm build/*") -> ExplanationRequired
Separately, thought_trace() + record_thought("…") let the agent narrate its
reasoning inside a block; each step is timestamped onto the audit trail and
stamped with the active trace_span, giving a replayable record of stated intent
alongside the decisions it triggered. A validator (ReasoningGate(..., validator=fn))
can even hold the rationale to a quality bar — e.g. an LLM-as-judge.
8. Transactional rollback — undo on failure
Least privilege limits what an agent can do; rollback handles the irreversible
things it did do when a later step fails. rollback() is a with block that
records a compensating action next to each forward action — commit on a clean
exit, unwind LIFO on an exception, then re-raise:
with rollback() as tx:
row = create_record(payload)
tx.on_undo(delete_record, row.id) # how to undo the line above
send_email(row.email)
tx.on_undo(send_retraction, row.email)
charge_card(row) # raises -> retraction, then delete,
# then the error propagates
It's a best-effort, in-process unwind (your compensations, not a DB transaction):
a compensation that itself fails is recorded on tx.compensation_errors and
audited without stopping the rest, and the body's original exception is never
masked. tx.commit() is an explicit point-of-no-return; async_rollback() awaits
coroutine compensations. Every begin/commit/compensation hits the same audit sinks.
9. Audit, tracing & metrics
Audit sinks (ListSink, JsonlSink, or any callable) receive an AuditEvent for
every permission decision, guard action, quota/rate charge, approval, reasoning,
loop trip, and rollback/compensation — a tamper-evident record of what the agent
tried. Wrap work in trace_span("plan") and each event is stamped with the dotted
span path, turning the flat log into a causal tree; drop in a MetricsSink to get
running counts (m.counts["permission/deny"]) instead of storing every event.
10. Constitutional rules, previews & honeytokens
Three checks that go past static permissions:
with safely(
allow="*",
rule="never email a customer without prior consent", # judged by a model
judge=my_model, # judge(action, rule) -> ok?
preview=approve_fn, # see a tool's preview, then y/n
honeytoken="sk-CANARY-9f3x", # planted secret = exfil tripwire
):
...
rule=+judge=— state a safety rule in plain English; a model (any callable, provider-agnostic) judges each matching call and a "no" raisesConstitutionViolation. It's probabilistic, so keep the hardallow=/limits underneath as the real backstop.preview=— a tool with a@tool(..., preview=fn)describes what it would do; the approver sees that preview and approves or rejects before it runs.honeytoken=— plant a fake secret an attacker would grab; if it ever shows up in an outbound call,HoneytokenTrippedfires — strong evidence of a hijack.
MCP. guard_mcp(session) wraps any Model Context Protocol session so every
remote tool call runs through the active safely(...) policy — permission,
budgets, approval, constitutional rules, loop/concurrency, input guards, and audit
— with no MCP SDK dependency:
safe = guard_mcp(session)
with safely(allow=["search", "fs.read"], calls=20, hide_secrets=True):
result = await safe.call_tool("search", {"q": "agent safety"})
@guarded_tool / @guarded_async_tool
Decorate the functions an agent may call. Every invocation is quota-charged, permission-checked against the current policy, input/output-guarded, and audited — sync or async:
@guarded_async_tool("shell.exec", input_guards=[DenyPattern(r"rm\s+-rf")])
async def run(cmd: str) -> str:
...
Works with Claude, OpenAI, and Gemini
The safety primitives have no SDK dependency. ToolRegistry declares each tool
once, emits schemas in any provider's dialect, and dispatches a tool call —
parsed from any provider's response — through the full safety pipeline.
from agent_safety import ToolRegistry, PermissionSet, RedactPII, safety_context
registry = ToolRegistry()
@registry.tool("weather.read", description="Get weather for a city.",
parameters={"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]})
def get_weather(city: str) -> str:
return f"sunny in {city}"
tools = registry.schemas("openai") # or "anthropic" / "gemini"
Or let the signature write the schema
Hand-writing JSON Schema that just restates the signature is duplication. Omit
parameters (and description) and agent_safety infers them from the type
hints and docstring — Annotated/Param carry per-field descriptions and
constraints, Literal/Enum become enums, defaults become optional fields:
from typing import Annotated, Literal
from agent_safety import ToolRegistry, Param
registry = ToolRegistry()
@registry.tool("weather.read")
def get_weather(
city: Annotated[str, "city name, e.g. 'Paris'"],
units: Literal["metric", "imperial"] = "metric",
days: Annotated[int, Param(description="forecast horizon", minimum=1, maximum=14)] = 3,
) -> str:
"""Get the weather forecast for a city.""" # -> the tool description
...
tools = registry.schemas("anthropic")
# {'name': 'get_weather', 'description': 'Get the weather forecast for a city.',
# 'input_schema': {'type': 'object', 'properties': {
# 'city': {'type': 'string', 'description': "city name, e.g. 'Paris'"},
# 'units': {'enum': ['metric', 'imperial'], 'type': 'string', 'default': 'metric'},
# 'days': {'type': 'integer', 'description': 'forecast horizon',
# 'minimum': 1, 'maximum': 14, 'default': 3}},
# 'required': ['city']}}
An explicit parameters= / description= always wins, and tool_schema(fn) is
exported if you want the schema without the registry. Pure standard library — no
Pydantic.
Pass @registry.tool(..., validate=True) to close the loop: each call's arguments
are checked against that schema (types, enum, required, ranges, array items)
before dispatch, so a hallucinated or malformed call is reported back to the
model instead of reaching your function as the wrong type.
parse_tool_calls(provider, response) does the other half — it normalizes the
tool calls out of a raw Anthropic / OpenAI / Gemini response into ToolCalls ready
to feed straight to safe_dispatch, with no SDK dependency.
Then the per-provider loop is ~5 lines; the safety-relevant call is identical:
# Anthropic — tool_use block: name + dict input
result = registry.safe_dispatch("anthropic", block.id, block.name, block.input)
# OpenAI — tool_calls: name + JSON-string arguments (dispatch parses it)
result = registry.safe_dispatch("openai", call.id, call.function.name, call.function.arguments)
# Gemini — function_call part: name + dict args
result = registry.safe_dispatch("gemini", "n/a", part.function_call.name, dict(part.function_call.args))
safe_dispatch returns a provider-native tool-result message in every case —
including when a call is denied or guarded, so a blocked tool is reported
back to the model instead of crashing your loop. Wrap the loop in a
safety_context and the same least-privilege, redaction, quota, and audit apply
no matter which model is driving.
Run it
cd python-agent-safety
pip install -e ".[dev]"
python examples/easy.py # the simplest possible intro (@tool + safely)
python examples/first_agent.py # a complete offline agent loop (see TUTORIAL.md)
python examples/multi_agent.py # two agents, different powers, a shared concurrency cap
python examples/quickstart.py # narrated single-provider walkthrough
python examples/hardening.py # sandbox + limits + approval + reasoning + rollback
python examples/providers.py # one policy across Anthropic/OpenAI/Gemini
python examples/benchmark.py # per-call overhead on your machine
python benchmarks/attack_suite.py # the attack scorecard (what's contained)
python -m pytest # 242 tests (incl. the CI-gated attack suite)
python -m ruff check . && python -m mypy # lint + strict type-check (matches CI)
# Optional live check against the real Gemini API (your key, never hardcoded):
GEMINI_API_KEY=... python -m pytest tests/test_gemini_live.py -v
Layout
src/agent_safety/
easy.py tool / safely / guard / Profiles — the beginner front door
permissions.py PermissionSet — capability allow/deny + intersect (+ to_dict/from_dict)
guards.py Stage, Guard protocol, content + security guards (Secret/Unicode)
sandbox.py PathBoundary, NetworkAllowlist — filesystem/SSRF resource guards
quota.py Quota (call/token budgets) + RiskBudget (per-action risk)
limits.py RateLimit + Deadline + ConcurrencyLimit + LoopGuard
action.py Action — the one object every safety hook (approver/judge/...) receives
approval.py ApprovalGate — human-in-the-loop gating
preview.py PreviewGate — approve a tool's "what would this do?" preview
constitution.py ConstitutionGate — plain-English rules judged by a model
reasoning.py ReasoningGate + thought_trace / record_thought — explainability
transaction.py rollback() / async_rollback() — compensating (saga) transactions
mcp.py guard_mcp — run MCP tool calls through the active policy
tracing.py trace_span() / current_span() — causal span paths on audit events
audit.py AuditEvent, ListSink / JsonlSink / MetricsSink
policy.py Policy — the immutable bundle of all of the above; narrow(), explain()
context.py safety_context(), require(), check_*(), charge_*()
decorators.py guarded_tool / guarded_async_tool
schema.py tool_schema / Param — derive JSON-Schema from a signature
validation.py validate_args — check tool inputs against the declared schema
integrations.py ToolRegistry — schema dialects, neutral dispatch, parse_tool_calls
exceptions.py AgentSafetyError + PermissionDenied / GuardViolation / QuotaExceeded /
RateLimitExceeded / DeadlineExceeded / RiskBudgetExceeded / LoopDetected /
ApprovalDenied / ExplanationRequired / ConstitutionViolation /
HoneytokenTripped / RollbackError
py.typed PEP 561 marker — ships inline type information to consumers
Scope & honesty
The guards are heuristics and a structural foundation, not a complete
security boundary by themselves. The durable guarantee is least privilege: even
a prompt injection that slips past the regex tripwire still cannot invoke a
capability the active PermissionSet never granted, spend past its Quota,
RateLimit, or Deadline, repeat itself past a LoopGuard, run a gated tool
without an ApprovalGate "yes" or a ReasoningGate rationale, or exfiltrate a
secret the RedactPII / SecretScanner output guard scrubbed — and every attempt
is on the audit trail. (The reasoning rationale is an accountability record, not
a correctness check — a model can still rationalize; pair it with a validator and
the gates above.)
The sandbox guards are likewise pre-flight intent checks, not an OS sandbox:
PathBoundary resolves symlinks and NetworkAllowlist rejects private-IP
literals, but neither defeats a TOCTOU race or DNS rebinding on its own. Run them
in front of a real OS/network sandbox and a DNS-aware HTTP client, and layer
real moderation / secret-scanning behind the same Guard interface — not instead
of these, but with them.
rollback() is best-effort, in-process compensation, not a distributed
transaction: your compensators run sequentially and can themselves fail (which is
recorded, not hidden). It bounds the blast radius of a failed multi-step action;
it does not give you atomicity across external systems.
For the full picture — trust boundaries, what's in and out of scope, residual risks, and a mapping to the OWASP LLM Top 10 — see THREAT_MODEL.md. To report a vulnerability, see SECURITY.md. Release history is in CHANGELOG.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_safety-0.8.0.tar.gz.
File metadata
- Download URL: agent_safety-0.8.0.tar.gz
- Upload date:
- Size: 89.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30fa8eda2e5429f8563aeb23b76649abafd146aea01aea2bca9e4a0795e08f58
|
|
| MD5 |
26217d0b7288aad325af228afd9c2fb1
|
|
| BLAKE2b-256 |
4da40bb50571cb9e1da7c83473e0f7d2f6d46d700f406fec040acf1d1eb2a1ba
|
File details
Details for the file agent_safety-0.8.0-py3-none-any.whl.
File metadata
- Download URL: agent_safety-0.8.0-py3-none-any.whl
- Upload date:
- Size: 69.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
701b872df93b96df58f486a38baed3d6183373c962b5893df2b84bd1e77bf5de
|
|
| MD5 |
2b1993b215abab1f3055fb40ed29ae27
|
|
| BLAKE2b-256 |
1d341ce0105de6df148c0148ff4915a44648ee64d871cc76aac6a1d0076c1ee7
|