Skip to main content

Security guardrails for AI agents — prompt injection detection, trust-aware input scanning, and advisory action gating

Project description

ClawStrike

Security guardrails for AI agents — prompt injection detection, trust-aware input scanning, and advisory action gating.


The Problem

AI agents like OpenClaw now have real system access — shell execution, email, calendars, file systems — and users grant this willingly because autonomy is the whole point. OpenClaw alone has surpassed 200k GitHub stars in weeks, signaling massive demand for agents that act on your behalf. But the security model hasn't kept up.

The primary risk isn't the agent itself — it's the inputs. Every email body, group chat message, webhook payload, and skill data feed is a potential prompt injection vector. A carefully crafted message from any of these channels can instruct the agent to take actions the user never intended — sending emails, exfiltrating files, modifying system configuration — all while operating within its granted permissions.

Most emerging security approaches focus on sandboxing: isolating the agent, limiting blast radius, restricting what it can do. That matters, but it doesn't address the input layer. A sandboxed agent can still be manipulated into misusing every permission it legitimately has. If an agent is allowed to send emails, a sandbox won't stop a prompt injection from composing and sending one.

What's missing is input-layer defense: scanning content before it reaches the agent, differentiating trust based on where input comes from (an owner's DM is not the same threat as an unsolicited email body), and gating risky actions with a review step before they execute. These need to work together — without trust-aware scanning, action gating is flying blind; without action gating, a bypassed classifier has no safety net.

Even OpenClaw's own documentation acknowledges there is no "perfectly secure" setup. ClawStrike doesn't claim to be one either. What it provides is a layered defense that makes attacks harder, catches the common cases, and gives you a forensic trail when something does go wrong.

What ClawStrike Does

ClawStrike is a security layer you install as a skill in your AI agent. It instructs the agent to check with ClawStrike before acting on any input — scanning content, evaluating trust, and gating risky actions.

The three pillars

What it does Why it matters
Classify Scans every inbound message for prompt injection using Meta's Llama Prompt Guard 2 models Catches injection attempts before they reach the agent
Trust Assigns a trust level based on the input channel (owner DM → high, email body → low, webhook → untrusted) and tracks contacts over time A message from your own account isn't treated the same as an unsolicited email
Gate Evaluates planned actions against a risk taxonomy and recommends allow, block, or prompt the user Shell execution from an untrusted source gets blocked; a calendar read from the owner gets auto-allowed

These three work together. The classifier's sensitivity adjusts based on trust — untrusted sources face stricter thresholds. The gating engine uses both the action's risk level and the session's trust level to decide what to recommend. And if the classifier flags something suspicious, the gating engine automatically tightens for the rest of that session.

Every decision — classification, trust change, gating recommendation, user approval — is written to a local audit log for forensic review.

A typical flow

External input arrives (email, group chat, webhook, ...)
         │
         ▼
   ┌───────────┐
   │  Classify │──── Score ≥ block threshold? ──► Block. Notify owner.
   └───────────┘
         │ pass or flag
         ▼
   ┌───────────┐
   │   Trust   │──── Resolve trust from channel + contact history
   └───────────┘     Adjust classifier thresholds accordingly
         │
         ▼
     Agent acts
         │
         ▼
   ┌───────────┐
   │   Gate    │──── Risk level + trust level → allow / block / prompt user
   └───────────┘
         │
         ▼
   ┌───────────┐
   │ Audit Log │──── Every decision recorded
   └───────────┘

It learns over time

ClawStrike starts strict and relaxes as it learns your patterns. New contacts begin as untrusted and earn trust through repeated safe interactions, eventually reaching their channel's default trust level. Actions you approve can be added to an allowlist so you aren't prompted for the same routine operation twice. The system adapts to your workflows while staying vigilant for novel or untrusted activity.

Advisory mode (MVP)

In the current release, ClawStrike operates in skill mode — it advises the agent, and the agent is instructed to comply. This is effective against unsophisticated attacks and provides full visibility, but a sufficiently advanced injection could instruct the agent to ignore the skill. Enforcement-grade interception (where blocked content never reaches the agent at all) ships in a future release.

ClawStrike integrates via two methods:

Method How it works Best for
MCP Persistent process, agent calls ClawStrike tools directly MCP-capable agents (full feature set including session tracking)
CLI One-shot shell commands (clawstrike classify, clawstrike gate) Any agent with shell access (e.g., OpenClaw)

Getting Started

Prerequisites: Prompt Guard model access

ClawStrike uses Meta's Llama Prompt Guard 2 for prompt injection detection. Before installing, you need to grant access to the model weights:

  1. Create a Hugging Face account if you don't have one
  2. Visit the model page for your chosen model and accept Meta's license:
  3. Generate a read-only access token at huggingface.co/settings/tokens

Option A: Docker (recommended for new setups)

Best if you're setting up OpenClaw and ClawStrike together from scratch.

git clone https://github.com/yogur/ClawStrike && cd ClawStrike
cp clawstrike.example.yaml clawstrike.yaml    # edit to taste
cp .env.example .env                           # add HF_TOKEN + LLM credentials
bash docker-setup.sh

The setup script builds the image, downloads the model, runs OpenClaw onboarding, and starts the gateway. First run takes a few minutes for the model download; subsequent starts are fast.

See the full Docker setup guide for details, volume reference, and troubleshooting.

Option B: Direct install (pip / uv)

Best if you already have OpenClaw running and want to add ClawStrike alongside it.

pip install clawstrike                         # or: uv add clawstrike

# Install Hugging Face CLI and authenticate
# See: https://huggingface.co/docs/huggingface_hub/en/guides/cli
pip install huggingface_hub[cli]
hf auth login --token $HF_TOKEN --add-to-git-credential

# Bootstrap config with secure defaults
clawstrike init

# Copy the ClawStrike skill into your OpenClaw skills directory
cp -r skills/clawstrike-cli /path/to/openclaw/skills/clawstrike

See the full direct setup guide for OpenClaw configuration, file permissions, and security recommendations.

Verify

clawstrike health
# {"status": "ok", "mode": "skill", "classifier": "multilingual", "mcp_enabled": false}

Configuration

clawstrike init generates a clawstrike.yaml with secure defaults — see clawstrike.example.yaml for a fully annotated starting point.

The settings most users will want to review:

Setting What to decide Default
classifier.model "multilingual" (86M, multiple languages) or "english-only" (22M, lower memory) multilingual
classifier.threshold.block / .flag How aggressive detection should be — lower values catch more but risk false positives 0.92 / 0.70
trust.channel_defaults Which input channels you consider high, medium, low, or untrusted trust owner_dm=high, email=low, webhook=untrusted
trust.contacts Specific senders to always trust or always block, regardless of channel {} (none)
action_gating.allowlist_learning Whether approving an action can create a permanent "always allow" rule false (off)
audit.log_raw_input Whether input text snippets are stored in the audit log, or only a hash true (snippets stored)

Example — tighten detection and block a known bad sender:

clawstrike:
  classifier:
    model: "english-only"
    threshold:
      block: 0.85          # lower = more aggressive blocking
      flag: 0.60

  trust:
    contacts:
      "attacker@evil.com": "blocked"
      "colleague@company.com": "trusted"

Security note: The config file controls ClawStrike's security policy. It should be owned by you (not the agent's service account) and not writable by the agent process. clawstrike init sets file permissions to 600 (owner read/write only) by default.

See the full configuration reference for all options.

Project Status

ClawStrike is in MVP (Skill Mode). The core guardrails — prompt injection detection, trust tiers, action gating, and audit logging — are functional and tested.

In this release, all guardrail decisions are advisory: the agent is instructed to comply via the skill's system prompt, but is not mechanically prevented from ignoring recommendations. This is effective against common attacks and provides full visibility through the audit log, but is not a hard security boundary.

Coming next:

  • Enforcement mode (Proxy Mode) — ClawStrike intercepts LLM API calls directly, blocking dangerous tool calls before they reach the agent. No LLM cooperation required.
  • LLM-as-Judge — Semantic intent verification for ambiguous cases.
  • Skill scanner — Static analysis of agent skills before installation.
  • Output guardrails — PII and credential detection in outbound content.

Contributing

Contributions are welcome. Please open an issue to discuss before submitting large changes.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clawstrike-0.1.0.tar.gz (215.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clawstrike-0.1.0-py3-none-any.whl (27.5 kB view details)

Uploaded Python 3

File details

Details for the file clawstrike-0.1.0.tar.gz.

File metadata

  • Download URL: clawstrike-0.1.0.tar.gz
  • Upload date:
  • Size: 215.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for clawstrike-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d11cc85ba869fb91f14cd8b19bc89eee952737244745a7fab3f5c21585aede83
MD5 137f64be5b13d3c1f71a84a84dddc7ca
BLAKE2b-256 b557e691f3a4ea26aa6799b56020dfcd10688f49bccd3e64257853d6ccacb0f5

See more details on using hashes here.

File details

Details for the file clawstrike-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: clawstrike-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 27.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for clawstrike-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c12b531de230dc07c9e82fa14d54770a70bcd0f3ef032542636152ed03670103
MD5 1dfe1688979310d827a8a1ad1211c3b1
BLAKE2b-256 94bd28e14cd43483bb30b168dc44f3d3a4f88d849a1b6646abf51ee7c8e8f14d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page