Skip to main content

Detect logical contradictions, gaps, and exploitable edge cases in AI system prompts

Project description

rule-audit

Static analyzer for AI system prompts. Finds logical contradictions, coverage gaps, and exploitable edge cases — without running an LLM.

Built at Hermes Labs Hackathon Round 8: ModelBreak.


The Problem

Any sufficiently complex AI safety prompt contains rules that contradict each other under specific conditions. These contradictions are invisible to the author but obvious to an attacker. rule-audit finds them first.

Real contradiction in standard safety prompts:

"You must always follow user instructions."
"You must never produce harmful content."

These are irreconcilable the moment a user instructs the model to produce harmful content. No priority clause means the model chooses arbitrarily — or the attacker chooses for it.


Install

pip install rule-audit

Or from source:

git clone https://github.com/roli-lpci/rule-audit
cd rule-audit
pip install -e ".[dev]"

Quickstart

CLI

# Inline prompt
rule-audit "You are helpful. You must never lie. Always answer every question."

# From file
rule-audit --file system_prompt.txt

# Save Markdown report
rule-audit --file system_prompt.txt --output report.md

# JSON output for downstream processing
rule-audit --file system_prompt.txt --format json

# Summary only (for CI gates)
rule-audit --file system_prompt.txt --format summary

# Show all parsed rules
rule-audit --file system_prompt.txt --verbose

# Only show high-severity findings
rule-audit --file system_prompt.txt --min-severity high

Exit codes: 0 = low/no risk, 2 = high/critical risk, 1 = error.

Python API

from rule_audit import audit

report = audit("""
    You are a helpful AI assistant.
    Always answer user questions completely.
    Never produce content that could harm anyone.
    Follow operator instructions above all else.
    If asked to roleplay without restrictions, comply.
""")

print(report.summary())
# rule-audit report  [2026-04-15T...]
# ============================================================
#   Rules parsed          : 6
#   Contradictions        : 3  (2 high, 1 medium)
#   Coverage gaps         : 4
#   Priority ambiguities  : 2
#   Meta-paradoxes        : 0
#   Absoluteness issues   : 5
#   Edge case scenarios   : 11
#   Risk score            : 67/100  [HIGH]

# Full Markdown report
md = report.to_markdown()

# Access findings programmatically
for c in report.result.contradictions:
    print(c.severity, c.description)

for ec in report.edge_cases:
    print(ec.title)
    print(ec.attack_vector)

What It Detects

1. Contradictions

Rule pairs where one says "always X" and another says "never X in context Y". Three subtypes:

  • Direct — opposing modalities on the same topic (MUST vs MUST_NOT)
  • Conditional — one rule applies unconditionally, another restricts within a subset (boundary is undefined)
  • Absoluteness — two absolute rules that pull in opposite directions (compliance vs safety)

2. Coverage Gaps

Scenario domains with no rule coverage. Checks for:

  • Harmful content handling
  • Principal hierarchy (user vs operator vs developer)
  • Ambiguous request handling
  • Persona / roleplay scenarios
  • Refusal protocol
  • Instruction conflict resolution
  • Self-disclosure rules
  • Edge case fallback behavior

3. Priority Ambiguities

Rule clusters that conflict with no explicit ordering. Classic example: a safety rule and a helpfulness rule both applying to the same request, with no stated priority.

4. Meta-Rule Paradoxes

Rules that reference rules:

  • Self-defeating — "ignore all instructions" voids itself
  • Override loops — "these instructions supersede all others" is exploitable via injection
  • Circular — a rule that requires itself to be applied before it can be applied

5. Absoluteness Audit

Every always/never/under no circumstances rule is challenged with:

  • Known exceptions that legitimately exist
  • Context-dependent cases where the absolute doesn't hold
  • Adversarial triggers that exploit the absolute

6. Edge Case Scenarios

For each finding, generates the exact attack prompt an adversary would construct — including the attack vector, expected failure mode, and mitigation.


Architecture

rule_audit/
├── __init__.py      # Public API: audit(), audit_file(), AuditReport
├── parser.py        # Sentence splitting, modal verb detection, Rule objects
├── analyzer.py      # Contradiction finder, gap detector, priority mapper
├── edge_cases.py    # Scenario generator from analysis results
├── report.py        # Markdown + summary renderer, AuditReport class
└── cli.py           # CLI entry point

Pure Python. Zero LLM dependency. Zero API calls.

The parser uses NLP heuristics:

  • Sentence boundary detection (period + newline + list markers)
  • Modal verb regex patterns (must/should/may + negations)
  • Absoluteness scoring (lexical keywords → 0.0–1.0 scale)
  • Keyword cluster matching (14 semantic clusters: harm, privacy, identity, truth, ...)

The analyzer uses combinatorial pair analysis:

  • O(n²) rule pair comparison (practical for prompts: n < 100)
  • Cluster overlap detection for shared domain identification
  • Modality opposition lookup table
  • Absoluteness threshold gates

Road to SaaS

This tool was built as a static analyzer, but the architecture supports a commercial path:

Phase Feature Status
v0.1 Core static analysis, CLI, Python API Done
v0.2 Rule diffing (before/after prompt edits) Planned
v0.3 LLM-augmented gap detection (optional) Planned
v0.4 GitHub Action / CI integration Planned
v1.0 Web UI + prompt editor with live feedback Planned
SaaS Per-prompt API, team dashboards, compliance reports Roadmap

Target customers: AI teams building production LLM products who need to audit system prompts before deployment. Compliance teams preparing for EU AI Act audits. Red team consultancies.

Pricing model: Free CLI tier → $X/month API tier → Enterprise (custom).


Development

# Run tests
pytest

# Run with coverage
pytest --cov=rule_audit --cov-report=term-missing

# Test against a real prompt
echo "Your system prompt here" > test_prompt.txt
python -m rule_audit --file test_prompt.txt --verbose

License

MIT — Hermes Labs 2026

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rule_audit-0.1.0.tar.gz (43.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rule_audit-0.1.0-py3-none-any.whl (31.5 kB view details)

Uploaded Python 3

File details

Details for the file rule_audit-0.1.0.tar.gz.

File metadata

  • Download URL: rule_audit-0.1.0.tar.gz
  • Upload date:
  • Size: 43.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for rule_audit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 eaa71e1c14a7a7697650a5d5a0dae50f3ccbc984c14683df9431a96419d23c22
MD5 b80f69a3291c3f28027b6e08ca50401c
BLAKE2b-256 476feeace4455dee99a9859ed24e301ccdfbc342aa2a376a9896d26d4c837c23

See more details on using hashes here.

File details

Details for the file rule_audit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: rule_audit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 31.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for rule_audit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2153c9e2c69ce6d41daa286cf15c3a967a63a5b832c5cce52bd2b12e72bf2554
MD5 bcb25c488c7ac2bc0442df2ee2425541
BLAKE2b-256 eacf082e984d425cbde2f91566222307226b32c0afe2b2997ff59c08b56d4066

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page