Skip to main content

Pre-ship risk critic (CLI + Python library) — surfaces breaking risk scenarios before they reach production

Project description

Gremlin

AI critic for your codebase — surfaces breaking risk scenarios before they reach production

What is Gremlin?

Gremlin is a pre-ship risk critic (CLI + Python library) that answers: "What could break?"

Feed it a feature spec, PR diff, or plain English — Gremlin critiques it for blind spots using:

  • 93 curated risk patterns across 12 domains (payments, auth, infra, security, and more)
  • LLM reasoning (applies patterns intelligently to your specific context)
  • Structured output (severity-ranked risk scenarios with confidence scores)

Installation

# Clone the repo
git clone https://github.com/abhi10/gremlin.git
cd gremlin

# Install
pip install -e .

# Set your Anthropic API key
export ANTHROPIC_API_KEY=sk-ant-...

Quick Start

CLI Usage

# Review a feature for risks
gremlin review "checkout flow with Stripe integration"

# Deep analysis with lower confidence threshold
gremlin review "auth system" --depth deep --threshold 60

# See available patterns
gremlin patterns list

# Show patterns for a specific domain
gremlin patterns show payments

Programmatic API (New in v0.2.0)

from gremlin import Gremlin

# Basic usage
gremlin = Gremlin()
result = gremlin.analyze("user authentication")

# Check for critical risks
if result.has_critical_risks():
    print(f"Found {result.critical_count} critical risks!")
    for risk in result.risks:
        print(f"- [{risk.severity}] {risk.scenario}")

# Multiple output formats
json_output = result.to_json()       # JSON string
junit_xml = result.to_junit()        # JUnit XML for CI
llm_format = result.format_for_llm() # Concise format for agents

# Async support for agent frameworks
result = await gremlin.analyze_async("payment processing")

# With additional context
result = gremlin.analyze(
    scope="checkout flow",
    context="Using Stripe API with webhook handling",
    depth="deep"
)

See the API documentation below for detailed usage.

Example Output

┌─────────────────────────────────────────────────────────────────────────────┐
│ Risk Scenarios for: checkout flow                                           │
└─────────────────────────────────────────────────────────────────────────────┘

🔴 CRITICAL (95% confidence)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Webhook Race Condition

  What if the Stripe webhook arrives before the order record is committed?

  Impact: Payment captured but order not created. Customer charged without record.
  Domain: payments


🟠 HIGH (87% confidence)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Double Submit on Payment Button

  What if the user clicks "Pay Now" twice rapidly?

  Impact: Potential duplicate charges.
  Domain: payments, concurrency

Commands

Command Description
gremlin review "scope" Analyze a feature for QA risks
gremlin patterns list Show all available pattern categories
gremlin patterns show <domain> Show patterns for a specific domain

Options for review

Option Default Description
--depth quick Analysis depth: quick or deep
--threshold 80 Confidence filter (0-100)
--output rich Output format: rich, md, json
--patterns - Custom patterns file (YAML)
--context - Additional context: string, @file, or - for stdin
--validate false Run second pass to filter hallucinations

Custom Patterns

Add domain-specific patterns for your codebase:

Project-level (auto-loaded)

# .gremlin/patterns.yaml
domain_specific:
  image_processing:
    keywords: [image, photo, upload, resize, cdn]
    patterns:
      - "What if EXIF rotation is ignored during resize?"
      - "What if CDN cache serves stale image after update?"

Via --patterns flag

gremlin review "image upload" --patterns @my-patterns.yaml

Learn from incidents

gremlin learn "Portrait images displayed sideways" --domain files --source prod-incident

See docs/CUSTOM_PATTERNS.md for the full authoring guide.

Pattern Domains

Gremlin includes curated patterns for these domains:

  • auth - Authentication, sessions, tokens
  • payments - Checkout, billing, refunds
  • file_upload - File handling, validation
  • database - Queries, transactions, migrations
  • api - Rate limiting, endpoints
  • deployment - Config, containers, environments
  • infrastructure - Servers, certs, resources
  • And more...

How It Works

User: gremlin review "checkout flow"
         │
         ▼
    ┌─────────────┐
    │ Parse scope │
    └──────┬──────┘
           │
           ▼
    ┌─────────────────┐
    │ Infer domains   │  "checkout" → payments
    └────────┬────────┘
             │
             ▼
    ┌─────────────────┐
    │ Select patterns │  universal + payments
    └────────┬────────┘
             │
             ▼
    ┌─────────────────┐
    │ Build prompt    │  system.md + patterns + scope
    └────────┬────────┘
             │
             ▼
    ┌─────────────────┐
    │ Call Claude API │
    └────────┬────────┘
             │
             ▼
    ┌─────────────────┐
    │ Render output   │  Risk scenarios
    └─────────────────┘

Performance

Gremlin's pattern-based approach achieves 90.7% tie rate with baseline Claude Sonnet 4 across 54 real-world test cases:

Metric Result Notes
Tie Rate 90.7% Gremlin matches baseline Claude quality
Win/Tie Rate 98.1% Combined wins + ties
Gremlin Wins 7.4% Cases where patterns provide unique value
Claude Wins 1.9% Minor category labeling differences
Pattern Count 93 Universal + domain-specific patterns

Key Achievement: 90% reduction in quality gaps (19% → 1.9%) through strategic pattern improvements.

See Phase 2 Tier 1 Results for detailed analysis.

Claude Code Integration

Gremlin also provides a Claude Code agent for code-focused risk critique during PR reviews. See docs/INTEGRATION_GUIDE.md for setup.

Programmatic API

Gremlin can be used as a Python library for integration with CI/CD pipelines, agent frameworks, and custom tools.

Installation

pip install gremlin-critic
# Or for development:
pip install -e ".[dev]"

Basic Usage

from gremlin import Gremlin, Risk, AnalysisResult

# Initialize analyzer
gremlin = Gremlin()

# Analyze a scope
result = gremlin.analyze("checkout flow")

# Access results
print(f"Found {len(result.risks)} risks")
print(f"Matched domains: {result.matched_domains}")
print(f"Pattern count: {result.pattern_count}")

Configuration

# Use different provider/model
gremlin = Gremlin(
    provider="anthropic",           # anthropic, openai, ollama
    model="claude-sonnet-4-20250514",
    threshold=80                     # Confidence threshold
)

# Analyze with context
result = gremlin.analyze(
    scope="user authentication",
    context="Using JWT with Redis session store",
    depth="deep"                     # quick or deep
)

Output Formats

# Dictionary (for JSON APIs)
data = result.to_dict()

# JSON string
json_str = result.to_json()

# JUnit XML (for CI/CD integration)
junit_xml = result.to_junit()

# LLM-friendly format (for agent consumption)
agent_input = result.format_for_llm()

Risk Analysis

# Check risk severity
if result.has_critical_risks():
    print(f"⚠️  {result.critical_count} critical risks found")

if result.has_high_severity_risks():
    print(f"Found {result.high_count} high + {result.critical_count} critical")

# Iterate through risks
for risk in result.risks:
    print(f"[{risk.severity}] ({risk.confidence}%)")
    print(f"  Scenario: {risk.scenario}")
    print(f"  Impact: {risk.impact}")
    print(f"  Domains: {', '.join(risk.domains)}")

Async Support

import asyncio
from gremlin import Gremlin

async def analyze_features():
    gremlin = Gremlin()

    # Run multiple analyses concurrently
    results = await asyncio.gather(
        gremlin.analyze_async("checkout flow"),
        gremlin.analyze_async("user authentication"),
        gremlin.analyze_async("file upload")
    )

    for result in results:
        print(f"{result.scope}: {len(result.risks)} risks")

asyncio.run(analyze_features())

Use Cases

1. LLM Agent Tool

from gremlin import Gremlin

def analyze_code_risks(code: str, feature: str) -> str:
    """Tool for LLM agents to analyze code risks."""
    gremlin = Gremlin()
    result = gremlin.analyze(scope=feature, context=code)
    return result.format_for_llm()

# Use with LangChain, CrewAI, AutoGen, etc.

2. CI/CD Integration

from gremlin import Gremlin
import sys

gremlin = Gremlin(threshold=70)
result = gremlin.analyze("PR changes", context=diff_content)

# Output JUnit XML
with open("gremlin-results.xml", "w") as f:
    f.write(result.to_junit())

# Exit with error if critical risks found
if result.has_critical_risks():
    print(f"❌ Found {result.critical_count} critical risks")
    sys.exit(1)

3. Custom Validation Pipeline

from gremlin import Gremlin

def validate_feature_design(prd: str, feature_name: str) -> dict:
    """Validate a feature design for risks."""
    gremlin = Gremlin(depth="deep")
    result = gremlin.analyze(feature_name, context=prd)

    return {
        "feature": feature_name,
        "risk_count": len(result.risks),
        "critical": result.critical_count,
        "high": result.high_count,
        "requires_review": result.has_high_severity_risks(),
        "report": result.to_dict()
    }

API Reference

Classes:

  • Gremlin - Main analyzer class
  • Risk - Individual risk finding with severity, confidence, scenario, impact
  • AnalysisResult - Complete analysis with multiple output formats

Methods:

  • Gremlin.analyze(scope, context, depth) - Synchronous analysis
  • Gremlin.analyze_async(scope, context, depth) - Async analysis
  • AnalysisResult.to_dict() - Dictionary serialization
  • AnalysisResult.to_json() - JSON string
  • AnalysisResult.to_junit() - JUnit XML
  • AnalysisResult.format_for_llm() - LLM-friendly format
  • AnalysisResult.has_critical_risks() - Check for critical risks
  • AnalysisResult.has_high_severity_risks() - Check for high+ risks

Development

# Clone the repo
git clone https://github.com/abhi10/gremlin.git
cd gremlin

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Lint
ruff check .

License

MIT

Contributing

Contributions welcome! Please open an issue first to discuss what you'd like to change.

Acknowledgments

  • Inspired by exploratory testing principles from James Bach and James Whittaker
  • Powered by Claude from Anthropic

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gremlin_critic-0.2.0.tar.gz (163.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gremlin_critic-0.2.0-py3-none-any.whl (37.2 kB view details)

Uploaded Python 3

File details

Details for the file gremlin_critic-0.2.0.tar.gz.

File metadata

  • Download URL: gremlin_critic-0.2.0.tar.gz
  • Upload date:
  • Size: 163.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for gremlin_critic-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a0c1ba1a4a21539d1aa59037c1b7adcabf64a73d93421aa17a24fbe38c8c2ff3
MD5 9ff924f6765d876223b591c9f97ebbfb
BLAKE2b-256 df4b911581d7ce3ab5ada71a1922515935edffa612c0b5a3bf87895008856665

See more details on using hashes here.

File details

Details for the file gremlin_critic-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: gremlin_critic-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 37.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for gremlin_critic-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1706c76b23851c0a8b4e915cf62f3608b2994664f58fe886f64911e6773ed881
MD5 869d1ace25359139939802255f34e1cb
BLAKE2b-256 76107afb652fce10b38d22c2bc1d2f02e908ec8c327d3c4adc8cb70352e1c009

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page