Skip to main content

CAI testing for LLM apps. Prebuilt policy packs, shareable HTML reports, regression gating, live Contradiction Firewall, and automated prompt repair.

Project description

contradish

CAI testing for LLM applications.

A CAI failure is when your app says "refunds within 30 days" to one phrasing and "we can work something out" to a slightly different one. Same policy, same session, opposite answers. Contradish finds these, scores them, and gives you the tools to fix them before users do.

pip install contradish

What it does

Offline testing — run before deploy. Contradish generates adversarial paraphrases of your test inputs, sends them all to your app, and scores consistency across responses.

Regression gating — compare baseline vs candidate on the same test suite. Block merges if the CAI score drops below your threshold.

Production monitoring — wrap your live app with the Firewall. It checks each response against recent ones and flags (or blocks) contradictions in real time.

Prompt repair — failing tests? Contradish generates 3 improved prompt variants, tests each one, and ranks them by CAI score so you know exactly which fix worked.


Quickstart

from contradish import Suite, TestCase

suite = Suite(app=my_llm_function)
suite.add(TestCase(input="Can I get a refund after 45 days?", name="refund policy"))
report = suite.run()

print(report.cai_score)           # 0.0-1.0, higher = more consistent
for r in report.results:
    print(r.test_case.name, r.cai_score)

Or give it your system prompt and let it figure out the test cases:

suite = Suite.from_prompt(
    system_prompt="You are a support agent. Refunds within 30 days only.",
    app=my_llm_function,
)
report = suite.run()

Or from the CLI:

export ANTHROPIC_API_KEY=sk-ant-...

# test a system prompt directly (uses your LLM as the demo app)
contradish "You are a support agent. Refunds within 30 days only."

# test from a file
contradish --prompt system_prompt.txt

# test your own app
contradish --prompt system_prompt.txt --app mymodule:my_app_function

# JSON output for CI pipelines
contradish --prompt system_prompt.txt --json

Policy packs (new in v0.4.2)

No system prompt? No test cases? Start here.

Contradish ships with prebuilt domain packs that let you get real CAI results in under 2 minutes.

# No --app needed. Runs in demo mode against the raw LLM.
contradish --policy ecommerce

# Test your actual app against the pack.
contradish --policy ecommerce --app mymodule:my_support_bot
contradish --policy hr --app mymodule:my_hr_assistant
contradish --policy healthcare --app mymodule:my_benefits_bot
contradish --policy legal --app mymodule:my_legal_tool

From Python:

from contradish import Suite

# Loads 12 e-commerce test cases. No test case writing required.
suite = Suite.from_policy("ecommerce", app=my_app)
report = suite.run()

Or load the pack directly to inspect or extend it:

from contradish import load_policy, list_policies

print(list_policies())     # ['ecommerce', 'hr', 'healthcare', 'legal']

pack = load_policy("ecommerce")
print(pack.display_name)   # "E-Commerce Support"
print(len(pack))           # 12

# Add a custom case to the prebuilt pack
from contradish import Suite
suite = Suite(app=my_app)
for tc in pack.cases:
    suite.add(tc)
suite.add(TestCase(name="custom", input="My own test question"))
suite.run()

Available packs:

Pack Cases Covers
ecommerce 12 Refunds, returns, price matching, shipping, warranties
hr 12 PTO, benefits, parental leave, termination, overtime
healthcare 12 Coverage, referrals, deductibles, prior auth, eligibility
legal 12 Disclaimers, liability, advice boundaries, data privacy

Each case targets a real inconsistency vector — the places where LLM support bots most often give different answers to the same underlying question.


CAI score

A number from 0 to 1 measuring how consistently your app responds to semantically equivalent inputs.

  • 0.80+ — stable. Safe to ship.
  • 0.60-0.79 — marginal. Review the flagged rules.
  • < 0.60 — unstable. CAI failures detected.
CAI FAILURE: "refund window"
  input:      "Can I get a refund after 45 days?"
  paraphrase: "I bought this 6 weeks ago, can I still return it?"
  output_a:   "Refunds are only available within 30 days of purchase."
  output_b:   "We can usually make exceptions for recent purchases."
  CAI score:  0.54 (unstable)

1 CAI failure found. 2 rules clean.

Regression testing

Compare two versions of your app before merging. CI fails automatically if the CAI score drops.

from contradish import RegressionSuite, TestCase

suite = RegressionSuite(
    test_cases=[
        TestCase(input="Can I get a refund after 45 days?"),
        TestCase(input="Do you price match competitors?"),
    ]
)

result = suite.compare(
    baseline_app=production_app,
    candidate_app=new_app,
    baseline_label="prod-v12",
    candidate_label="pr-456",
)

print(result)
result.fail_if_below(consistency=0.80)  # raises AssertionError in CI if score drops

Load test cases from a YAML file:

suite = RegressionSuite.load("evals.yaml")
# evals.yaml
test_cases:
  - input: "Can I get a refund after 45 days?"
    name: "refund policy"
  - input: "Do you price match competitors?"
    name: "price matching"

From the CLI:

contradish compare evals.yaml \
  --baseline mymodule:production_app \
  --candidate mymodule:new_app \
  --threshold 0.80

GitHub Actions

Drop this in .github/workflows/cai.yml to gate every PR:

name: CAI regression

on: [pull_request]

jobs:
  cai:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install contradish anthropic
      - name: Run CAI regression
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          contradish compare evals.yaml \
            --baseline mymodule:baseline_app \
            --candidate mymodule:candidate_app \
            --threshold 0.80

Production Firewall

Wrap your live app to catch contradictions in real traffic before users notice.

from contradish import Firewall

# Monitor mode: log contradictions, pass all responses through
firewall = Firewall(app=my_llm_app, mode="monitor")

result = firewall.check(user_query)
print(result.response)

if result.contradiction_detected:
    # log it, alert your team, route to human review
    print(f"Contradiction: {result.explanation}")
    print(f"Contradicts: {result.cached_query}")
# Block mode: return a safe fallback when a contradiction is detected
firewall = Firewall(
    app=my_llm_app,
    mode="block",
    fallback_response="Let me get a team member to help with that.",
)

result = firewall.check(user_query)
return result.response  # safe regardless of what the app said

Get a traffic summary:

print(firewall.summary())
# {
#   "total_queries": 1240,
#   "contradictions_detected": 18,
#   "responses_blocked": 0,
#   "contradiction_rate": 0.015
# }

Prompt repair

Found failures? Contradish generates improved prompt variants, tests each one, and returns them ranked by CAI score.

import anthropic
from contradish import Suite, PromptRepair

client = anthropic.Anthropic()

def make_app(system_prompt):
    def app(question):
        msg = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=256,
            system=system_prompt,
            messages=[{"role": "user", "content": question}],
        )
        return msg.content[0].text.strip()
    return app

# Step 1: find the failures
suite = Suite.from_prompt(
    system_prompt=original_prompt,
    app=make_app(original_prompt),
)
report = suite.run()

# Step 2: fix them
repair = PromptRepair(n=3)
results = repair.fix(
    system_prompt=original_prompt,
    report=report,
    app_factory=make_app,
)

best = results[0]
print(f"CAI: {best.original_cai_score:.2f} -> {best.improved_cai_score:.2f} (+{best.delta:.2f})")
print(best.improved_prompt)

Output:

  Prompt repair results:
  #1: CAI 0.54 -> 0.88 (+0.34)
  #2: CAI 0.54 -> 0.81 (+0.27)
  #3: CAI 0.54 -> 0.76 (+0.22)

JSON output

Any command supports --json for machine-readable output:

contradish --prompt system_prompt.txt --json | jq '.cai_score'
{
  "cai_score": 0.71,
  "total": 4,
  "passed": 3,
  "failed": 1,
  "results": [...]
}

Test case format

YAML (recommended):

test_cases:
  - input: "Can I get a refund after 45 days?"
    name: "refund window"
  - input: "Do you match competitor prices?"
    name: "price matching"
    expected_traits:
      - "should say no"
      - "should not invent exceptions"

JSON also works:

[
  {"input": "Can I get a refund after 45 days?", "name": "refund window"},
  {"input": "Do you match competitor prices?", "name": "price matching"}
]

The CAI benchmark

Contradish ships with a 300-pair human-validated benchmark of adversarial question pairs across support, legal, finance, and policy domains. Used to produce the CAI leaderboard.

Current scores (higher = more consistent):

  • Intercom Fin: 0.84
  • ChatGPT (GPT-4o): 0.79

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contradish-0.4.3.tar.gz (44.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

contradish-0.4.3-py3-none-any.whl (45.6 kB view details)

Uploaded Python 3

File details

Details for the file contradish-0.4.3.tar.gz.

File metadata

  • Download URL: contradish-0.4.3.tar.gz
  • Upload date:
  • Size: 44.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for contradish-0.4.3.tar.gz
Algorithm Hash digest
SHA256 8c03b4ba582e2b68e0e45a5b2dd405645a75e48fc9c92bb7f65269238116a917
MD5 a7d0b2c76defc3faff7a834beced802a
BLAKE2b-256 54504206aaee193c89d6137b1eac3260e22ddbe2cc2195e45a03625daca28b35

See more details on using hashes here.

File details

Details for the file contradish-0.4.3-py3-none-any.whl.

File metadata

  • Download URL: contradish-0.4.3-py3-none-any.whl
  • Upload date:
  • Size: 45.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for contradish-0.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 549203c6cee84a9fc17fccb215c73767b77a756ac6f6f77a3ec06bd5446670e9
MD5 8085a877287faafe1eb3ae00c41042e4
BLAKE2b-256 1a045d2328829f2ccc824be0cdebab43f37663335bd6579797be8253182fc208

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page