A cli tool for prompt regression testing

Project description

prompt-regress 🔍

AI Model Output Regression Testing Tool

When upgrading prompts or switching models (e.g. GPT-4 → Claude Opus), developers need a quick way to know if outputs broke. prompt-regress solves this by comparing model outputs across prompt versions or model versions.

🚀 Features

Model-Agnostic: Works with OpenAI, Anthropic, local models (Ollama), and more
Semantic Similarity: Beyond text matching - understands meaning changes
Cost Tracking: Monitor token usage and cost differences
JSON Validation: Ensure structured outputs remain valid

📦 Installation

pip install prompt-regress

🏃 Quick Start

1. Initialize Configuration

prompt-regress init

This creates a prompt-regress.yml configuration file:

metrics:
  semantic_similarity:
    threshold: 0.8
  text_similarity:
    threshold: 0.7
models:
- name: gpt-4
  parameters:
    max_tokens: 1000
    temperature: 0.7
  provider: openai
- name: claude-4
  parameters:
    max_tokens: 1000
    temperature: 0.7
  provider: anthropic
- host: http://localhost:11434
  name: deepseek-r1:1.5b
  parameters:
    max_tokens: 1000
    temperature: 0.7
  provider: local
regression_options:
  max_concurrency: 5
test_cases:
- inputs:
  - text: Sample text to summarize
  name: summarization
  prompt_template: 'Summarize this text in 2-3 sentences: {text}'
- inputs:
  - context: This is a sample context.
    question: What is the context about?
  - context: My name is Foo.
    question: What is my name?
  name: question_answering
  prompt_template: 'Answer the question based on the context: {context} Question:
    {question}'
- expect_json: true
  inputs:
  - data: John Doe, age 30, works at TechCorp
  name: json_extraction
  prompt_template: 'Extract key information as JSON: {data}'

💡 Note: For more examples please check here.

2. Compare Models

# Compare two models
prompt-regress check --baseline gpt-4 --target claude-opus

# Output example:
🔍 Prompt Regression Test Report
==================================================
✅ Passed: 2/2
❌ Failed: 0/2

✅ PASS summarization
  Text Similarity: 0.856
  Semantic Similarity: 0.923

✅ PASS json_extraction
  Text Similarity: 0.734
  Semantic Similarity: 0.891

3. CI/CD Integration

# Fail build if regressions detected
prompt-regress check \
  --baseline gpt-4 \
  --target claude-opus \
  --fail-on-regression

🔧 CLI Commands

Initialize Project

prompt-regress init [--config prompt-regress.yml]

Compare Models

prompt-regress check \
  --baseline MODEL_NAME \
  --target MODEL_NAME \
  [--config CONFIG_FILE] \
  [--format console|json] \
  [--fail-on-regression]

List Available Models

prompt-regress models [--config prompt-regress.yml]

List Test Cases

prompt-regress tests [--config prompt-regress.yml]

🐍 Python SDK

from prompt_regress import PromptRegress

# Initialize
regress = PromptRegress("prompt-regress.yml")

# Compare models
results = regress.compare_models("gpt-4", "claude-opus")

# Generate report
report = regress.generate_report(results, format="json")
print(report)

# Check individual results
for result in results:
    if not result.passed:
        print(f"❌ {result.test_case} failed!")
        print(f"   Semantic similarity: {result.semantic_similarity:.3f}")

🤖 Supported Providers

Provider	Models	API Key Required
OpenAI	gpt-4, gpt-3.5-turbo, etc.	✅
Anthropic	claude-opus, claude-sonnet	✅
Local (Ollama)	llama2, codellama, etc.	❌

📊 Comparison Metrics

Text Similarity: Exact text matching using difflib
Semantic Similarity: Meaning comparison using sentence transformers
Token Usage: Track token consumption changes
Cost Analysis: Monitor API cost differences
JSON Validation: Ensure structured outputs remain valid
Performance: Response time and throughput

🚦 GitHub Actions Integration

Add to .github/workflows/prompt-regression.yml:

name: Prompt Regression Tests

on:
  pull_request:
    paths: ['prompts/**', 'prompt-regress.yml']

jobs:
  regression-test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    
    - name: Install prompt-regress
      run: pip install prompt-regress
    
    - name: Run regression tests
      env:
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      run: |
        prompt-regress check \
          --baseline gpt-4 \
          --target claude-opus \
          --fail-on-regression

🎯 Use Cases

1. Model Migration

# Switching from GPT-4 to Claude Opus
prompt-regress check --baseline gpt-4 --target claude-opus

2. Prompt Optimization

# Test prompt changes with same model
prompt-regress check --baseline gpt-4-v1 --target gpt-4-v2

3. Cost Optimization

# Compare expensive vs cheaper models
prompt-regress check --baseline gpt-4 --target gpt-3.5-turbo

4. Local Model Testing

# Compare cloud vs local models
prompt-regress check --baseline gpt-4 --target llama2-local

⚙️ Configuration

Model Configuration

models:
  - name: custom-gpt-4
    provider: openai
    parameters:
      temperature: 0.5
      max_tokens: 2000
    
  - name: local-llama
    provider: local
    host: http://localhost:11434
    model: llama2

Test Case Configuration

test_cases:
  - name: code_generation
    prompt: "Generate Python code for: {task}"
    inputs:
      - task: "sort a list of dictionaries by key"
      - task: "create a REST API endpoint"
    expect_json: false
    timeout: 30
    
  - name: data_extraction
    prompt: "Extract data as JSON: {text}"
    inputs:
      - text: "Company: Acme Corp, Revenue: $1M, Employees: 50"
    expect_json: true

Metrics Configuration

metrics:
  semantic_similarity:        # Minimum semantic similarity (0-1)
    threshold: 0.8             
  text_similarity:            # Minimum text similarity (0-1)
    threshold: 0.7

🧪 Advanced Usage

Custom Similarity Functions

from prompt_regress import PromptRegress

class CustomPromptRegress(PromptRegress):
    def _calculate_custom_similarity(self, text1: str, text2: str) -> float:
        # Your custom similarity logic here
        return similarity_score

Custom Providers

from prompt_regress import ModelProvider

class CustomProvider(ModelProvider):
    def generate(self, prompt: str) -> Tuple[str, int]:
        # Your custom model API integration
        response = your_api_call(prompt)
        return response.text, response.token_count

📈 Monitoring & Alerts

Slack Integration

# Send results to Slack webhook
prompt-regress check \
  --baseline gpt-4 \
  --target claude-opus \
  --format json | \
  curl -X POST -H 'Content-type: application/json' \
  --data @- $SLACK_WEBHOOK_URL

Email Alerts

import smtplib
from prompt_regress import PromptRegress

regress = PromptRegress()
results = regress.compare_models("gpt-4", "claude-opus")

failed_tests = [r for r in results if not r.passed]
if failed_tests:
    send_email_alert(f"Regression detected in {len(failed_tests)} tests")

🔒 Security

API Key Management

# Environment variables
export OPENAI_API_KEY="your-key-here"
export ANTHROPIC_API_KEY="your-key-here"

# Or use .env file
echo "OPENAI_API_KEY=your-key-here" >> .env
echo "ANTHROPIC_API_KEY=your-key-here" >> .env

Rate Limiting

models:
  - name: gpt-4
    provider: openai
    rate_limit:
      requests_per_minute: 60
      tokens_per_minute: 40000

🐛 Troubleshooting

Common Issues

1. Sentence Transformers Download

# Pre-download models
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"

2. API Key Issues

# Test API connectivity
prompt-regress models --config test-config.yml

3. Memory Issues with Large Inputs

# Reduce batch size
batch_size: 1
max_input_length: 2000

Debug Mode

# Enable verbose logging
prompt-regress check --baseline gpt-4 --target claude-opus --verbose

🏆 Why prompt-regress?

Before prompt-regress:

❌ Manual testing of prompt changes
❌ No visibility into model output quality
❌ Expensive mistakes in production
❌ Time-consuming model comparisons

After prompt-regress:

✅ Automated regression testing
✅ Quantified quality metrics
✅ Catch issues before deployment
✅ Efficient model evaluation

📄 License

MIT License - see LICENSE file for details.

⭐ Star this repo if prompt-regress helps you build better AI applications!

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Jul 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prompt_regress-0.1.0.tar.gz (15.8 kB view details)

Uploaded Jul 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

prompt_regress-0.1.0-py3-none-any.whl (14.7 kB view details)

Uploaded Jul 26, 2025 Python 3

File details

Details for the file prompt_regress-0.1.0.tar.gz.

File metadata

Download URL: prompt_regress-0.1.0.tar.gz
Upload date: Jul 26, 2025
Size: 15.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for prompt_regress-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1a89b10968a6d7c65607ba68f00be7fda04751c3b6ed2f3e4867fea52a91469c`
MD5	`759a488dc8ca1d55ea9e7fc6dbedd778`
BLAKE2b-256	`0f4529f349c294b86bfe073d4673c98e1ad22f4df0315c0ecd0ee0f979bc964a`

See more details on using hashes here.

File details

Details for the file prompt_regress-0.1.0-py3-none-any.whl.

File metadata

Download URL: prompt_regress-0.1.0-py3-none-any.whl
Upload date: Jul 26, 2025
Size: 14.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for prompt_regress-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2dfc0f4493bf75420786c1af9dcf7a5b8d23c36d17050769da7c58932061bdce`
MD5	`0762d6409b1da39acf02472b234b6dd6`
BLAKE2b-256	`d85c53f13575e0d55427fc3408d68a606d1d5102ed0919555320558dd779d446`

See more details on using hashes here.

prompt-regress 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

prompt-regress 🔍

🚀 Features

📦 Installation

🏃 Quick Start

1. Initialize Configuration

2. Compare Models

3. CI/CD Integration

🔧 CLI Commands

Initialize Project

Compare Models

List Available Models

List Test Cases

🐍 Python SDK

🤖 Supported Providers

📊 Comparison Metrics

🚦 GitHub Actions Integration

🎯 Use Cases

1. Model Migration

2. Prompt Optimization

3. Cost Optimization

4. Local Model Testing

⚙️ Configuration

Model Configuration

Test Case Configuration

Metrics Configuration

🧪 Advanced Usage

Custom Similarity Functions

Custom Providers

📈 Monitoring & Alerts

Slack Integration

Email Alerts

🔒 Security

API Key Management

Rate Limiting

🐛 Troubleshooting

Common Issues

Debug Mode

🏆 Why prompt-regress?

Before prompt-regress:

After prompt-regress:

📄 License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes