The open-source evaluation framework for AI agents — test, compare, ship with confidence

These details have not been verified by PyPI

Project description

🧪 LitmusAI

Eval framework for AI agents that actually works

5,900+ lines of code · 404 tests · 11 assertion types · 46 safety attacks · 5-model benchmarks

Quickstart · Features · Assertions · Examples · Benchmark Results

Why?

You deployed an agent. It works in demos. But:

Does it still work after your last commit?
Is Claude actually better than GPT for your use case — or just more expensive?
Did your prompt change introduce a safety regression?
What's your actual cost per correct answer?

LitmusAI answers these with one function call.

⚡ Quickstart

pip install litmuseval

30-second eval

import asyncio
from litmusai import Agent, TestSuite, TestCase, Numeric, evaluate

# 1. Connect your agent (works with any OpenAI-compatible API)
agent = Agent.from_openai_chat(
    base_url="https://api.openai.com/v1",
    api_key="sk-...",
    model="gpt-4o",
)

# 2. Define tests with real assertions (not just substring matching)
suite = TestSuite(name="math-eval")
suite.add_case(TestCase(
    id="q1", name="Percentage",
    task="What is 15% of 240?",
    assertions=[Numeric(36, tolerance=0.01)],
))

# 3. Run
results = asyncio.run(evaluate(agent, suite))
print(results)
# ✅ 1/1 passed | 💰 $0.0003 | ⚡ 1200ms avg | 🔤 47 tokens

That $0.0003 is real — not an estimate. from_openai_chat captures actual token usage from the API response.

🔧 What's Inside

1. Universal Agent Adapter

Connect any agent — function, API, or framework:

# Any OpenAI-compatible API (OpenAI, Anthropic via proxy, LiteLLM, Ollama, vLLM)
agent = Agent.from_openai_chat(base_url="...", api_key="...", model="gpt-4o")

# Plain async function
agent = Agent.from_function(my_async_fn, name="my-agent")

# HTTP endpoint
agent = Agent.from_url("http://localhost:8000/chat")

# LangChain
agent = Agent.from_langchain(my_chain)

# CrewAI
agent = Agent.from_crewai(my_crew)

# Any callable
agent = Agent.from_callable(obj_with_call_method)

from_openai_chat is the recommended path — it automatically captures real token counts and cost from the API response. No guessing.

2. Assertion Engine (11 Types)

The heart of the framework. Composable, type-safe scoring that goes way beyond substring matching:

from litmusai import (
    Numeric, Contains, NotContains, Exact, RegexMatch,
    JsonSchema, JsonPath, JsonValid,
    Semantic, LLMGrade, Custom,
    All, AnyOf, AtLeast, Weighted,
)

# ── String assertions ──────────────────────────────
Exact("Paris")                            # exact match
Contains(["Orwell", "1949"], mode="all")  # all must appear
NotContains(["hack", "exploit"])          # must NOT appear
RegexMatch(r"\b\d{4}-\d{2}-\d{2}\b")     # date format

# ── Numeric ────────────────────────────────────────
Numeric(36, tolerance=0.01)               # extracts numbers from text
Numeric(36)                               # handles "thirty-six" too

# ── Structured ─────────────────────────────────────
JsonValid()                               # valid JSON (handles ```json fences)
JsonSchema({"type": "object", "required": ["name", "age"]})
JsonPath("user.name", "Alice")            # check nested values
JsonPath("items[0].price", 10, operator="gt")

# ── AI-powered ─────────────────────────────────────
Semantic("The answer is 36", threshold=0.85)  # embedding similarity
LLMGrade("Is the math correct?", model="gpt-4o-mini")

# ── Custom ─────────────────────────────────────────
Custom(lambda r: len(r.split()) >= 10, name="min_words")
Custom(lambda r: "def " in r and "return" in r, name="has_function")

Compose them

# All must pass
All(Numeric(36), NotContains(["sorry", "I can't"]))

# Any one is enough
AnyOf(Exact("36"), Numeric(36), Contains(["thirty-six"]))

# At least 2 of 3
AtLeast(2, [Exact("36"), Numeric(36), Contains(["36"])])

# Weighted scoring (no hard pass/fail)
Weighted([
    (Numeric(36), 0.6),           # 60% weight on correctness
    (NotContains(["sorry"]), 0.2), # 20% on confidence
    (Custom(lambda r: len(r) > 10, name="detail"), 0.2),
], threshold=0.7)

3. Safety Scanner (46 Attacks)

Red-team your agent automatically:

from litmusai.safety import SafetyScanner

scanner = SafetyScanner(depth="standard")  # "basic" | "standard" | "thorough"
report = await scanner.scan(agent)

print(f"Score: {report.safety_score}/100")
print(f"Verdict: {'✅ SAFE' if report.is_safe else '❌ UNSAFE'}")

# Category breakdown
for cat, stats in report.categories.items():
    print(f"  {cat}: {stats.passed}/{stats.total}")

Attack categories: Prompt injection (8), Jailbreak (6), PII leak (5), Harmful content (5), Hallucination (5), Bias (5), Data exfiltration (3), Over-reliance (5)

Refusal detection built in — an agent that says "I can't reveal my system prompt" won't be flagged as a failure.

4. Cost & Latency Benchmarking

Real token-level cost tracking, not estimates:

from litmusai.benchmarks import CostTracker, compare_models, register_pricing

# Register pricing ($/million tokens)
register_pricing("gpt-4o", input_cost=2.50, output_cost=10.0)
register_pricing("claude-sonnet-4", input_cost=3.0, output_cost=15.0)

# Track across runs
tracker = CostTracker(model="gpt-4o", agent_name="GPT-4o")
tracker.record("q1", task_name="Math", latency_ms=1200,
               passed=True, score=1.0, input_tokens=22, output_tokens=15)

# Compare models
comparison = compare_models(tracker_gpt, tracker_claude)
print(comparison.to_markdown())

5. LLM-as-Judge

Use an LLM to grade responses on criteria you define:

from litmusai.scorers import LLMJudge

judge = LLMJudge(
    provider="openai",
    model="gpt-4o-mini",
    criteria=["correctness", "helpfulness", "safety"],
)
result = await judge.score(
    task="Explain quantum computing",
    response=agent_response,
)
print(f"Score: {result.score}/5 — {result.reason}")

6. CI/CD Integration

Catch regressions before they ship:

# .github/workflows/eval.yml
- uses: kutanti/litmusai@v1
  with:
    agent: my_agent:agent
    suite: coding
    threshold: 0.8
    format: github

Regression detection: flags >5% pass rate drops, >50% cost increases, >50% latency spikes.

7. Result Logging

Full audit trail of every evaluation:

results = await evaluate(agent, suite, log_dir="./eval-logs/")
# Saves: agent_name, task, response, tokens, cost, score, timestamp
# as structured JSON for reproducibility

🎯 Real Examples

Example 1: Compare 5 models

import asyncio
from litmusai import Agent, TestSuite, TestCase, Numeric, Contains, evaluate
from litmusai.benchmarks import register_pricing

# Register pricing
register_pricing("gpt-4o", 2.50, 10.0)
register_pricing("gpt-4.1", 2.0, 8.0)
register_pricing("claude-sonnet-4", 3.0, 15.0)

# Create agents
models = {
    "GPT-4o": Agent.from_openai_chat(
        base_url="https://api.openai.com/v1",
        api_key="sk-...", model="gpt-4o",
    ),
    "GPT-4.1": Agent.from_openai_chat(
        base_url="https://api.openai.com/v1",
        api_key="sk-...", model="gpt-4.1",
    ),
    "Claude Sonnet": Agent.from_openai_chat(
        base_url="https://api.anthropic.com/v1",
        api_key="sk-...", model="claude-sonnet-4-20250514",
    ),
}

# Test suite
suite = TestSuite(name="benchmark")
suite.add_case(TestCase(
    id="math", name="Math", task="What is 15% of 240?",
    assertions=[Numeric(36, tolerance=0.01)],
))
suite.add_case(TestCase(
    id="fact", name="Factual", task="Who wrote 1984?",
    assertions=[Contains(["Orwell", "1949"], mode="all")],
))

# Run all
async def main():
    for name, agent in models.items():
        results = await evaluate(agent, suite, verbose=True)
        print(f"{name}: {results.pass_rate:.0%} | ${results.total_cost:.4f}")

asyncio.run(main())

Example 2: Safety scan before deploy

from litmusai import Agent
from litmusai.safety import SafetyScanner

agent = Agent.from_openai_chat(
    base_url="https://api.openai.com/v1",
    api_key="sk-...", model="gpt-4o",
    system_prompt="You are a helpful customer service agent.",
)

scanner = SafetyScanner(depth="thorough")
report = await scanner.scan(agent)

assert report.is_safe, f"Safety score: {report.safety_score}/100"
assert len(report.critical_failures) == 0, "Critical vulnerabilities found!"

Example 3: JSON API validation

from litmusai import (
    Agent, TestSuite, TestCase,
    All, JsonValid, JsonSchema, JsonPath, evaluate,
)

suite = TestSuite(name="api-tests")
suite.add_case(TestCase(
    id="planets", name="Structured output",
    task="Return the 3 largest planets as a JSON array with name and diameter_km",
    assertions=[All(
        JsonValid(),
        JsonSchema({
            "type": "array", "minItems": 3,
            "items": {
                "type": "object",
                "required": ["name", "diameter_km"],
            },
        }),
        JsonPath("0.name", "Jupiter"),
    )],
))

results = await evaluate(agent, suite)

Example 4: Weighted scoring for nuance

from litmusai import TestCase, Numeric, NotContains, Custom, Weighted

TestCase(
    id="explain", name="Explain well",
    task="Explain why the sky is blue",
    assertions=[Weighted([
        # 50% — mentions Rayleigh scattering
        (Contains(["scatter", "rayleigh", "wavelength"], mode="any"), 0.5),
        # 30% — doesn't refuse or hedge
        (NotContains(["I'm not sure", "I can't"]), 0.3),
        # 20% — substantive answer (>50 words)
        (Custom(lambda r: len(r.split()) >= 50, name="length"), 0.2),
    ], threshold=0.7)],
)

📊 Real Benchmark Data

We benchmarked 5 models on 6 tasks with real token counts (not estimates):

Model	Pass Rate	Real Tokens	Real Cost	Cost/Pass	Avg Latency
GPT-4.1	100%	501	$0.0031	$0.0005 🏆	2,269ms
GPT-4o	100%	575	$0.0046	$0.0008	1,616ms ⚡
Claude Sonnet 4	100%	848	$0.0109	$0.0018	3,794ms
Claude Opus 4	83%	691	$0.0427	$0.0085	2,888ms
Gemini 2.5 Pro	50%	166	$0.0010	$0.0003*	7,263ms

^{*Gemini is cheap but only 50% pass rate on our test suite.}

Key finding: Claude Opus costs 14x more than GPT-4.1 per correct answer, with lower accuracy. This is the kind of insight that saves real money.

Generated with examples/e2e_test.py — run it yourself with your own API keys.

🏗️ Architecture

litmusai/
├── assertions/      # 11 assertion types + 4 composites (1,400 lines)
├── core/
│   ├── agent.py     # Universal agent adapter — 7 factory methods (750 lines)
│   ├── runner.py     # Async eval runner with concurrency + logging (300 lines)
│   ├── scorer.py     # Assertion-aware scoring engine (220 lines)
│   └── suite.py      # Test suite management + YAML (130 lines)
├── safety/          # Red-team scanner — 46 attacks (800 lines)
├── scorers/         # LLM-as-Judge engine (640 lines)
├── benchmarks/      # Cost tracking + model comparison (620 lines)
├── ci/              # CI/CD regression detection (500 lines)
├── cli/             # CLI commands (350 lines)
└── suites/          # Built-in test suites (YAML)
    ├── coding/
    ├── research/
    ├── planning/
    └── safety/

5,900+ lines of code · 404 tests · 22 source files

🚀 Getting Started

Install

pip install litmuseval

Dev setup

git clone https://github.com/kutanti/litmusai.git
cd litmusai
pip install -e ".[dev]"
pytest                    # 404 tests
ruff check src/ tests/    # lint
mypy src/litmusai/        # type check

CLI

litmus run --suite coding --agent my_agent:agent
litmus run --suite research --agent my_agent:agent --format markdown

🗺️ Roadmap

Universal agent adapter (7 factory methods)
Assertion engine (11 types + 4 composites)
Scoring pipeline (assertions wired into runner)
Safety scanner (46 attacks, 3 depths)
Cost & latency benchmarking (real tokens)
LLM-as-Judge scoring
CI/CD regression detection
Result logging (JSON)
PyPI publish (pip install litmuseval)
Multiple runs with statistical reporting
HTML reports
Expanded test suites (50+ per domain)

🤝 Contributing

PRs welcome! See the open issues.

git clone https://github.com/kutanti/litmusai.git
cd litmusai
pip install -e ".[dev]"
pytest && ruff check src/ tests/ && mypy src/litmusai/

All changes go through PR review with automated CI (ruff, pytest, mypy across Python 3.10-3.12).

📜 License

MIT — see LICENSE.

Built by Kunal Tanti

If this helps you ship better agents, give it a ⭐

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.0

Apr 8, 2026

0.2.1

Apr 6, 2026

This version

0.2.0

Apr 4, 2026

0.1.0

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litmuseval-0.2.0.tar.gz (80.6 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

litmuseval-0.2.0-py3-none-any.whl (95.6 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file litmuseval-0.2.0.tar.gz.

File metadata

Download URL: litmuseval-0.2.0.tar.gz
Upload date: Apr 4, 2026
Size: 80.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for litmuseval-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`e8b04d5d83efe70c2ce2bbff1572ff591ef2e6a02ad8d3245c14200a0d4eba1e`
MD5	`6843ba944595a2af482e29ffa624346f`
BLAKE2b-256	`ca6981ad4d42a0ea7485546da45a725ff8db42c40783f93575cec64a6d2b73b8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for litmuseval-0.2.0.tar.gz:

Publisher: publish.yml on kutanti/litmusai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: litmuseval-0.2.0.tar.gz
- Subject digest: e8b04d5d83efe70c2ce2bbff1572ff591ef2e6a02ad8d3245c14200a0d4eba1e
- Sigstore transparency entry: 1233208878
- Sigstore integration time: Apr 4, 2026
Source repository:
- Permalink: kutanti/litmusai@2d1f9f8a9665f62009925b35d0df4e789166b415
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/kutanti
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2d1f9f8a9665f62009925b35d0df4e789166b415
- Trigger Event: release

File details

Details for the file litmuseval-0.2.0-py3-none-any.whl.

File metadata

Download URL: litmuseval-0.2.0-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 95.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for litmuseval-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c1c0792c91c7c7a978b989b4267efee9387de5795cd97b6a105888a8289cac1b`
MD5	`9ed2c9a2238ca7737295fd9209fb318a`
BLAKE2b-256	`8cbee835f0b537601865cb1b27d3ae74e76e3a411616378aed9dab59383ced0e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for litmuseval-0.2.0-py3-none-any.whl:

Publisher: publish.yml on kutanti/litmusai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: litmuseval-0.2.0-py3-none-any.whl
- Subject digest: c1c0792c91c7c7a978b989b4267efee9387de5795cd97b6a105888a8289cac1b
- Sigstore transparency entry: 1233208936
- Sigstore integration time: Apr 4, 2026
Source repository:
- Permalink: kutanti/litmusai@2d1f9f8a9665f62009925b35d0df4e789166b415
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/kutanti
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2d1f9f8a9665f62009925b35d0df4e789166b415
- Trigger Event: release

litmuseval 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

🧪 LitmusAI

Eval framework for AI agents that actually works

Why?

⚡ Quickstart

30-second eval

🔧 What's Inside

1. Universal Agent Adapter

2. Assertion Engine (11 Types)

Compose them

3. Safety Scanner (46 Attacks)

4. Cost & Latency Benchmarking

5. LLM-as-Judge

6. CI/CD Integration

7. Result Logging

🎯 Real Examples

Example 1: Compare 5 models

Example 2: Safety scan before deploy

Example 3: JSON API validation

Example 4: Weighted scoring for nuance

📊 Real Benchmark Data

🏗️ Architecture

🚀 Getting Started

Install

Dev setup

CLI

🗺️ Roadmap

🤝 Contributing

📜 License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance