Security guardrails for AI agents — prompt injection detection, trust-aware input scanning, and advisory action gating

Project description

ClawStrike

Security guardrails for AI agents — prompt injection detection, trust-aware input scanning, and advisory action gating.

The Problem

AI agents like OpenClaw now have real system access — shell execution, email, calendars, file systems — and users grant this willingly because autonomy is the whole point. OpenClaw alone has surpassed 200k GitHub stars in weeks, signaling massive demand for agents that act on your behalf. But the security model hasn't kept up.

The primary risk isn't the agent itself — it's the inputs. Every email body, group chat message, webhook payload, and skill data feed is a potential prompt injection vector. A carefully crafted message from any of these channels can instruct the agent to take actions the user never intended — sending emails, exfiltrating files, modifying system configuration — all while operating within its granted permissions.

Most emerging security approaches focus on sandboxing: isolating the agent, limiting blast radius, restricting what it can do. That matters, but it doesn't address the input layer. A sandboxed agent can still be manipulated into misusing every permission it legitimately has. If an agent is allowed to send emails, a sandbox won't stop a prompt injection from composing and sending one.

What's missing is input-layer defense: scanning content before it reaches the agent, differentiating trust based on where input comes from (an owner's DM is not the same threat as an unsolicited email body), and gating risky actions with a review step before they execute. These need to work together — without trust-aware scanning, action gating is flying blind; without action gating, a bypassed classifier has no safety net.

Even OpenClaw's own documentation acknowledges there is no "perfectly secure" setup. ClawStrike doesn't claim to be one either. What it provides is a layered defense that makes attacks harder, catches the common cases, and gives you a forensic trail when something does go wrong.

What ClawStrike Does

ClawStrike is a security layer you install as a skill in your AI agent. It instructs the agent to check with ClawStrike before acting on any input — scanning content, evaluating trust, and gating risky actions.

The three pillars

	What it does	Why it matters
Classify	Scans every inbound message for prompt injection using Meta's Llama Prompt Guard 2 models	Catches injection attempts before they reach the agent
Trust	Assigns a trust level based on the input channel (owner DM → high, email body → low, webhook → untrusted) and tracks contacts over time	A message from your own account isn't treated the same as an unsolicited email
Gate	Evaluates planned actions against a risk taxonomy and recommends allow, block, or prompt the user	Shell execution from an untrusted source gets blocked; a calendar read from the owner gets auto-allowed

These three work together. The classifier's sensitivity adjusts based on trust — untrusted sources face stricter thresholds. The gating engine uses both the action's risk level and the session's trust level to decide what to recommend. And if the classifier flags something suspicious, the gating engine automatically tightens for the rest of that session.

Every decision — classification, trust change, gating recommendation, user approval — is written to a local audit log for forensic review.

A typical flow

External input arrives (email, group chat, webhook, ...)
         │
         ▼
   ┌───────────┐
   │  Classify │──── Score ≥ block threshold? ──► Block. Notify owner.
   └───────────┘
         │ pass or flag
         ▼
   ┌───────────┐
   │   Trust   │──── Resolve trust from channel + contact history
   └───────────┘     Adjust classifier thresholds accordingly
         │
         ▼
     Agent acts
         │
         ▼
   ┌───────────┐
   │   Gate    │──── Risk level + trust level → allow / block / prompt user
   └───────────┘
         │
         ▼
   ┌───────────┐
   │ Audit Log │──── Every decision recorded
   └───────────┘

It learns over time

ClawStrike starts strict and relaxes as it learns your patterns. New contacts begin as untrusted and earn trust through repeated safe interactions, eventually reaching their channel's default trust level. Actions you approve can be added to an allowlist so you aren't prompted for the same routine operation twice. The system adapts to your workflows while staying vigilant for novel or untrusted activity.

Advisory mode (MVP)

In the current release, ClawStrike operates in skill mode — it advises the agent, and the agent is instructed to comply. This is effective against unsophisticated attacks and provides full visibility, but a sufficiently advanced injection could instruct the agent to ignore the skill. Enforcement-grade interception (where blocked content never reaches the agent at all) ships in a future release.

ClawStrike integrates via two methods:

Method	How it works	Best for
MCP	Persistent process, agent calls ClawStrike tools directly	MCP-capable agents (full feature set including session tracking)
CLI	One-shot shell commands (`clawstrike classify`, `clawstrike gate`)	Any agent with shell access (e.g., OpenClaw)

Getting Started

Prerequisites: Prompt Guard model access

ClawStrike uses Meta's Llama Prompt Guard 2 for prompt injection detection. Before installing, you need to grant access to the model weights:

Create a Hugging Face account if you don't have one
Visit the model page for your chosen model and accept Meta's license:
- Llama-Prompt-Guard-2-86M — Multilingual (~1.13 GB) — recommended
- Llama-Prompt-Guard-2-22M — English only (~300 MB)
Generate a read-only access token at huggingface.co/settings/tokens

Option A: Docker (recommended for new setups)

Best if you're setting up OpenClaw and ClawStrike together from scratch.

git clone https://github.com/yogur/ClawStrike && cd ClawStrike
cp clawstrike.example.yaml clawstrike.yaml    # edit to taste
cp .env.example .env                           # add HF_TOKEN + LLM credentials
bash docker-setup.sh

The setup script builds the image, downloads the model, runs OpenClaw onboarding, and starts the gateway. First run takes a few minutes for the model download; subsequent starts are fast.

See the full Docker setup guide for details, volume reference, and troubleshooting.

Option B: Direct install (pip / uv)

Best if you already have OpenClaw running and want to add ClawStrike alongside it.

pip install clawstrike                         # or: uv add clawstrike

# Install Hugging Face CLI and authenticate
# See: https://huggingface.co/docs/huggingface_hub/en/guides/cli
pip install huggingface_hub[cli]
hf auth login --token $HF_TOKEN --add-to-git-credential

# Bootstrap config with secure defaults
clawstrike init

# Copy the ClawStrike skill into your OpenClaw skills directory
cp -r skills/clawstrike-cli /path/to/openclaw/skills/clawstrike

See the full direct setup guide for OpenClaw configuration, file permissions, and security recommendations.

Verify

clawstrike health
# {"status": "ok", "mode": "skill", "classifier": "multilingual", "mcp_enabled": false}

Configuration

clawstrike init generates a clawstrike.yaml with secure defaults — see clawstrike.example.yaml for a fully annotated starting point.

The settings most users will want to review:

Setting	What to decide	Default
`classifier.model`	`"multilingual"` (86M, multiple languages) or `"english-only"` (22M, lower memory)	`multilingual`
`classifier.threshold.block` / `.flag`	How aggressive detection should be — lower values catch more but risk false positives	`0.92` / `0.70`
`trust.channel_defaults`	Which input channels you consider high, medium, low, or untrusted trust	owner_dm=high, email=low, webhook=untrusted
`trust.contacts`	Specific senders to always trust or always block, regardless of channel	`{}` (none)
`action_gating.allowlist_learning`	Whether approving an action can create a permanent "always allow" rule	`false` (off)
`audit.log_raw_input`	Whether input text snippets are stored in the audit log, or only a hash	`true` (snippets stored)

Example — tighten detection and block a known bad sender:

clawstrike:
  classifier:
    model: "english-only"
    threshold:
      block: 0.85          # lower = more aggressive blocking
      flag: 0.60

  trust:
    contacts:
      "attacker@evil.com": "blocked"
      "colleague@company.com": "trusted"

Security note: The config file controls ClawStrike's security policy. It should be owned by you (not the agent's service account) and not writable by the agent process. clawstrike init sets file permissions to 600 (owner read/write only) by default.

See the full configuration reference for all options.

Project Status

ClawStrike is in MVP (Skill Mode). The core guardrails — prompt injection detection, trust tiers, action gating, and audit logging — are functional and tested.

In this release, all guardrail decisions are advisory: the agent is instructed to comply via the skill's system prompt, but is not mechanically prevented from ignoring recommendations. This is effective against common attacks and provides full visibility through the audit log, but is not a hard security boundary.

Coming next:

Enforcement mode (Proxy Mode) — ClawStrike intercepts LLM API calls directly, blocking dangerous tool calls before they reach the agent. No LLM cooperation required.
LLM-as-Judge — Semantic intent verification for ambiguous cases.
Skill scanner — Static analysis of agent skills before installation.
Output guardrails — PII and credential detection in outbound content.

Contributing

Contributions are welcome. Please open an issue to discuss before submitting large changes.

License

MIT License

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Mar 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clawstrike-0.1.0.tar.gz (215.8 kB view details)

Uploaded Mar 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clawstrike-0.1.0-py3-none-any.whl (27.5 kB view details)

Uploaded Mar 7, 2026 Python 3

File details

Details for the file clawstrike-0.1.0.tar.gz.

File metadata

Download URL: clawstrike-0.1.0.tar.gz
Upload date: Mar 7, 2026
Size: 215.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for clawstrike-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d11cc85ba869fb91f14cd8b19bc89eee952737244745a7fab3f5c21585aede83`
MD5	`137f64be5b13d3c1f71a84a84dddc7ca`
BLAKE2b-256	`b557e691f3a4ea26aa6799b56020dfcd10688f49bccd3e64257853d6ccacb0f5`

See more details on using hashes here.

File details

Details for the file clawstrike-0.1.0-py3-none-any.whl.

File metadata

Download URL: clawstrike-0.1.0-py3-none-any.whl
Upload date: Mar 7, 2026
Size: 27.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for clawstrike-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c12b531de230dc07c9e82fa14d54770a70bcd0f3ef032542636152ed03670103`
MD5	`1dfe1688979310d827a8a1ad1211c3b1`
BLAKE2b-256	`94bd28e14cd43483bb30b168dc44f3d3a4f88d849a1b6646abf51ee7c8e8f14d`

See more details on using hashes here.

clawstrike 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

ClawStrike

The Problem

What ClawStrike Does

The three pillars

A typical flow

It learns over time

Advisory mode (MVP)

Getting Started

Prerequisites: Prompt Guard model access

Option A: Docker (recommended for new setups)

Option B: Direct install (pip / uv)

Verify

Configuration

Project Status

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes