Skip to main content

Pre-ship risk critic (CLI + Python library) — surfaces breaking risk scenarios before they reach production

Project description

Gremlin

AI critic for your codebase — surfaces breaking risk scenarios before they reach production

What is Gremlin?

Gremlin is a pre-ship risk critic (CLI + Python library) that answers: "What could break?"

Feed it a feature spec, PR diff, or plain English — Gremlin critiques it for blind spots using:

  • 107 curated risk patterns across 14 domains (payments, auth, infra, serialization, distributed systems, and more)
  • LLM reasoning (applies patterns intelligently to your specific context)
  • Structured output (severity-ranked risk scenarios with confidence scores)

Installation

# Install from PyPI
pip install gremlin-critic

# Set your Anthropic API key
export ANTHROPIC_API_KEY=sk-ant-...

For development: git clone https://github.com/abhi10/gremlin.git && pip install -e ".[dev]"

Quick Start

CLI Usage

# Review a feature for risks
gremlin review "checkout flow with Stripe integration"

# Deep analysis with lower confidence threshold
gremlin review "auth system" --depth deep --threshold 60

# See available patterns
gremlin patterns list

# Show patterns for a specific domain
gremlin patterns show payments

Programmatic API (New in v0.2.0)

from gremlin import Gremlin

# Basic usage
gremlin = Gremlin()
result = gremlin.analyze("user authentication")

# Check for critical risks
if result.has_critical_risks():
    print(f"Found {result.critical_count} critical risks!")
    for risk in result.risks:
        print(f"- [{risk.severity}] {risk.scenario}")

# Multiple output formats
json_output = result.to_json()       # JSON string
junit_xml = result.to_junit()        # JUnit XML for CI
llm_format = result.format_for_llm() # Concise format for agents

# Async support for agent frameworks
result = await gremlin.analyze_async("payment processing")

# With additional context
result = gremlin.analyze(
    scope="checkout flow",
    context="Using Stripe API with webhook handling",
    depth="deep"
)

See the API documentation below for detailed usage.

Example Output

┌─────────────────────────────────────────────────────────────────────────────┐
│ Risk Scenarios for: checkout flow                                           │
└─────────────────────────────────────────────────────────────────────────────┘

🔴 CRITICAL (95% confidence)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Webhook Race Condition

  What if the Stripe webhook arrives before the order record is committed?

  Impact: Payment captured but order not created. Customer charged without record.
  Domain: payments


🟠 HIGH (87% confidence)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Double Submit on Payment Button

  What if the user clicks "Pay Now" twice rapidly?

  Impact: Potential duplicate charges.
  Domain: payments, concurrency

Risk Dashboard

Interactive visualization of Gremlin analysis results — live at abhi10.github.io/gremlin

Gremlin Risk Dashboard

Features:

  • Heatmap visualization — severity distribution across feature areas (CRITICAL / HIGH / MEDIUM / LOW)
  • Severity donut chart — at-a-glance risk breakdown
  • Domain bar chart — risk count per domain (concurrency, auth, payments...)
  • Interactive risk table — sortable, filterable, expandable rows with full scenario + impact
  • Multi-project — includes scans of celery, pydantic, openclaw

Commands

Command Description
gremlin review "scope" Analyze a feature for QA risks
gremlin patterns list Show all available pattern categories
gremlin patterns show <domain> Show patterns for a specific domain

Options for review

Option Default Description
--depth quick Analysis depth: quick or deep
--threshold 80 Confidence filter (0-100)
--output rich Output format: rich, md, json
--patterns - Custom patterns file (YAML)
--context - Additional context: string, @file, or - for stdin
--validate false Run second pass to filter hallucinations

Custom Patterns

Add domain-specific patterns for your codebase:

Project-level (auto-loaded)

# .gremlin/patterns.yaml
domain_specific:
  image_processing:
    keywords: [image, photo, upload, resize, cdn]
    patterns:
      - "What if EXIF rotation is ignored during resize?"
      - "What if CDN cache serves stale image after update?"

Via --patterns flag

gremlin review "image upload" --patterns @my-patterns.yaml

Learn from incidents

gremlin learn "Portrait images displayed sideways" --domain files --source prod-incident

See docs/CUSTOM_PATTERNS.md for the full authoring guide.

Pattern Domains

Gremlin includes curated patterns for these domains:

  • auth - Authentication, sessions, tokens
  • payments - Checkout, billing, refunds
  • file_upload - File handling, validation
  • database - Queries, transactions, migrations
  • api - Rate limiting, endpoints
  • deployment - Config, containers, environments
  • infrastructure - Servers, certs, resources
  • And more...

How It Works

User: gremlin review "checkout flow"
         │
         ▼
    ┌─────────────┐
    │ Parse scope │
    └──────┬──────┘
           │
           ▼
    ┌─────────────────┐
    │ Infer domains   │  "checkout" → payments
    └────────┬────────┘
             │
             ▼
    ┌─────────────────┐
    │ Select patterns │  universal + payments
    └────────┬────────┘
             │
             ▼
    ┌─────────────────┐
    │ Build prompt    │  system.md + patterns + scope
    └────────┬────────┘
             │
             ▼
    ┌─────────────────┐
    │ Call Claude API │
    └────────┬────────┘
             │
             ▼
    ┌─────────────────┐
    │ Render output   │  Risk scenarios
    └─────────────────┘

Performance

Gremlin's pattern-based approach achieves 90.7% tie rate with baseline Claude Sonnet 4 across 54 real-world test cases:

Metric Result Notes
Tie Rate 90.7% Gremlin matches baseline Claude quality
Win/Tie Rate 98.1% Combined wins + ties
Gremlin Wins 7.4% Cases where patterns provide unique value
Claude Wins 1.9% Minor category labeling differences
Pattern Count 107 Universal + domain-specific patterns

Key Achievement: 90% reduction in quality gaps (19% → 1.9%) through strategic pattern improvements.

See Phase 2 Tier 1 Results for detailed analysis.

Claude Code Integration

Gremlin also provides a Claude Code agent for code-focused risk critique during PR reviews. See docs/INTEGRATION_GUIDE.md for setup.

Programmatic API

Gremlin can be used as a Python library for integration with CI/CD pipelines, agent frameworks, and custom tools.

Installation

pip install gremlin-critic
# Or for development:
pip install -e ".[dev]"

Basic Usage

from gremlin import Gremlin, Risk, AnalysisResult

# Initialize analyzer
gremlin = Gremlin()

# Analyze a scope
result = gremlin.analyze("checkout flow")

# Access results
print(f"Found {len(result.risks)} risks")
print(f"Matched domains: {result.matched_domains}")
print(f"Pattern count: {result.pattern_count}")

Configuration

# Use different provider/model
gremlin = Gremlin(
    provider="anthropic",           # anthropic, openai, ollama
    model="claude-sonnet-4-20250514",
    threshold=80                     # Confidence threshold
)

# Analyze with context
result = gremlin.analyze(
    scope="user authentication",
    context="Using JWT with Redis session store",
    depth="deep"                     # quick or deep
)

Output Formats

# Dictionary (for JSON APIs)
data = result.to_dict()

# JSON string
json_str = result.to_json()

# JUnit XML (for CI/CD integration)
junit_xml = result.to_junit()

# LLM-friendly format (for agent consumption)
agent_input = result.format_for_llm()

Risk Analysis

# Check risk severity
if result.has_critical_risks():
    print(f"⚠️  {result.critical_count} critical risks found")

if result.has_high_severity_risks():
    print(f"Found {result.high_count} high + {result.critical_count} critical")

# Iterate through risks
for risk in result.risks:
    print(f"[{risk.severity}] ({risk.confidence}%)")
    print(f"  Scenario: {risk.scenario}")
    print(f"  Impact: {risk.impact}")
    print(f"  Domains: {', '.join(risk.domains)}")

Async Support

import asyncio
from gremlin import Gremlin

async def analyze_features():
    gremlin = Gremlin()

    # Run multiple analyses concurrently
    results = await asyncio.gather(
        gremlin.analyze_async("checkout flow"),
        gremlin.analyze_async("user authentication"),
        gremlin.analyze_async("file upload")
    )

    for result in results:
        print(f"{result.scope}: {len(result.risks)} risks")

asyncio.run(analyze_features())

Use Cases

1. LLM Agent Tool

from gremlin import Gremlin

def analyze_code_risks(code: str, feature: str) -> str:
    """Tool for LLM agents to analyze code risks."""
    gremlin = Gremlin()
    result = gremlin.analyze(scope=feature, context=code)
    return result.format_for_llm()

# Use with LangChain, CrewAI, AutoGen, etc.

2. CI/CD Integration

from gremlin import Gremlin
import sys

gremlin = Gremlin(threshold=70)
result = gremlin.analyze("PR changes", context=diff_content)

# Output JUnit XML
with open("gremlin-results.xml", "w") as f:
    f.write(result.to_junit())

# Exit with error if critical risks found
if result.has_critical_risks():
    print(f"❌ Found {result.critical_count} critical risks")
    sys.exit(1)

3. Custom Validation Pipeline

from gremlin import Gremlin

def validate_feature_design(prd: str, feature_name: str) -> dict:
    """Validate a feature design for risks."""
    gremlin = Gremlin(depth="deep")
    result = gremlin.analyze(feature_name, context=prd)

    return {
        "feature": feature_name,
        "risk_count": len(result.risks),
        "critical": result.critical_count,
        "high": result.high_count,
        "requires_review": result.has_high_severity_risks(),
        "report": result.to_dict()
    }

API Reference

Classes:

  • Gremlin - Main analyzer class
  • Risk - Individual risk finding with severity, confidence, scenario, impact
  • AnalysisResult - Complete analysis with multiple output formats

Methods:

  • Gremlin.analyze(scope, context, depth) - Synchronous analysis
  • Gremlin.analyze_async(scope, context, depth) - Async analysis
  • AnalysisResult.to_dict() - Dictionary serialization
  • AnalysisResult.to_json() - JSON string
  • AnalysisResult.to_junit() - JUnit XML
  • AnalysisResult.format_for_llm() - LLM-friendly format
  • AnalysisResult.has_critical_risks() - Check for critical risks
  • AnalysisResult.has_high_severity_risks() - Check for high+ risks

Development

# Clone the repo
git clone https://github.com/abhi10/gremlin.git
cd gremlin

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Lint
ruff check .

License

MIT

Contributing

Contributions welcome! Please open an issue first to discuss what you'd like to change.

Acknowledgments

  • Inspired by exploratory testing principles from James Bach and James Whittaker
  • Powered by Claude from Anthropic

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gremlin_critic-0.2.1.tar.gz (3.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gremlin_critic-0.2.1-py3-none-any.whl (38.4 kB view details)

Uploaded Python 3

File details

Details for the file gremlin_critic-0.2.1.tar.gz.

File metadata

  • Download URL: gremlin_critic-0.2.1.tar.gz
  • Upload date:
  • Size: 3.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for gremlin_critic-0.2.1.tar.gz
Algorithm Hash digest
SHA256 38cbecb6fc6ae52a53d74b4f6c1bcf830cd3788236811fab5e287a524ea96331
MD5 172f5a26bff7133cf58b97b925ec7f07
BLAKE2b-256 4894653c5838d79e8cd0a20f6b47165bbe65a71b1efeb52989064471552ae35e

See more details on using hashes here.

File details

Details for the file gremlin_critic-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: gremlin_critic-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 38.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for gremlin_critic-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e384748940dcd91b5316deb89f0be10cbaf8a8da85041f2907e8c19ca69ca890
MD5 8c656f621f89f5bdcb809f23c725a5fd
BLAKE2b-256 c0752e17212e70afe8406155afb713eb73de8b69a3804ba8f4b1cadf935a2754

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page