Skip to main content

The open-source evaluation framework for AI agents โ€” test, compare, ship with confidence

Project description

๐Ÿงช LitmusAI

Eval framework for AI agents that actually works

CI Python 3.10+ Tests License: MIT

5,900+ lines of code ยท 404 tests ยท 11 assertion types ยท 46 safety attacks ยท 5-model benchmarks

Quickstart ยท Features ยท Assertions ยท Examples ยท Benchmark Results


Why?

You deployed an agent. It works in demos. But:

  • Does it still work after your last commit?
  • Is Claude actually better than GPT for your use case โ€” or just more expensive?
  • Did your prompt change introduce a safety regression?
  • What's your actual cost per correct answer?

LitmusAI answers these with one function call.


โšก Quickstart

pip install litmuseval

30-second eval

import asyncio
from litmusai import Agent, TestSuite, TestCase, Numeric, evaluate

# 1. Connect your agent (works with any OpenAI-compatible API)
agent = Agent.from_openai_chat(
    base_url="https://api.openai.com/v1",
    api_key="sk-...",
    model="gpt-4o",
)

# 2. Define tests with real assertions (not just substring matching)
suite = TestSuite(name="math-eval")
suite.add_case(TestCase(
    id="q1", name="Percentage",
    task="What is 15% of 240?",
    assertions=[Numeric(36, tolerance=0.01)],
))

# 3. Run
results = asyncio.run(evaluate(agent, suite))
print(results)
# โœ… 1/1 passed | ๐Ÿ’ฐ $0.0003 | โšก 1200ms avg | ๐Ÿ”ค 47 tokens

That $0.0003 is real โ€” not an estimate. from_openai_chat captures actual token usage from the API response.


๐Ÿ”ง What's Inside

1. Universal Agent Adapter

Connect any agent โ€” function, API, or framework:

# Any OpenAI-compatible API (OpenAI, Anthropic via proxy, LiteLLM, Ollama, vLLM)
agent = Agent.from_openai_chat(base_url="...", api_key="...", model="gpt-4o")

# Plain async function
agent = Agent.from_function(my_async_fn, name="my-agent")

# HTTP endpoint
agent = Agent.from_url("http://localhost:8000/chat")

# LangChain
agent = Agent.from_langchain(my_chain)

# CrewAI
agent = Agent.from_crewai(my_crew)

# Any callable
agent = Agent.from_callable(obj_with_call_method)

from_openai_chat is the recommended path โ€” it automatically captures real token counts and cost from the API response. No guessing.

2. Assertion Engine (11 Types)

The heart of the framework. Composable, type-safe scoring that goes way beyond substring matching:

from litmusai import (
    Numeric, Contains, NotContains, Exact, RegexMatch,
    JsonSchema, JsonPath, JsonValid,
    Semantic, LLMGrade, Custom,
    All, AnyOf, AtLeast, Weighted,
)

# โ”€โ”€ String assertions โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Exact("Paris")                            # exact match
Contains(["Orwell", "1949"], mode="all")  # all must appear
NotContains(["hack", "exploit"])          # must NOT appear
RegexMatch(r"\b\d{4}-\d{2}-\d{2}\b")     # date format

# โ”€โ”€ Numeric โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Numeric(36, tolerance=0.01)               # extracts numbers from text
Numeric(36)                               # handles "thirty-six" too

# โ”€โ”€ Structured โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
JsonValid()                               # valid JSON (handles ```json fences)
JsonSchema({"type": "object", "required": ["name", "age"]})
JsonPath("user.name", "Alice")            # check nested values
JsonPath("items[0].price", 10, operator="gt")

# โ”€โ”€ AI-powered โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Semantic("The answer is 36", threshold=0.85)  # embedding similarity
LLMGrade("Is the math correct?", model="gpt-4o-mini")

# โ”€โ”€ Custom โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Custom(lambda r: len(r.split()) >= 10, name="min_words")
Custom(lambda r: "def " in r and "return" in r, name="has_function")

Compose them

# All must pass
All(Numeric(36), NotContains(["sorry", "I can't"]))

# Any one is enough
AnyOf(Exact("36"), Numeric(36), Contains(["thirty-six"]))

# At least 2 of 3
AtLeast(2, [Exact("36"), Numeric(36), Contains(["36"])])

# Weighted scoring (no hard pass/fail)
Weighted([
    (Numeric(36), 0.6),           # 60% weight on correctness
    (NotContains(["sorry"]), 0.2), # 20% on confidence
    (Custom(lambda r: len(r) > 10, name="detail"), 0.2),
], threshold=0.7)

3. Safety Scanner (46 Attacks)

Red-team your agent automatically:

from litmusai.safety import SafetyScanner

scanner = SafetyScanner(depth="standard")  # "basic" | "standard" | "thorough"
report = await scanner.scan(agent)

print(f"Score: {report.safety_score}/100")
print(f"Verdict: {'โœ… SAFE' if report.is_safe else 'โŒ UNSAFE'}")

# Category breakdown
for cat, stats in report.categories.items():
    print(f"  {cat}: {stats.passed}/{stats.total}")

Attack categories: Prompt injection (8), Jailbreak (6), PII leak (5), Harmful content (5), Hallucination (5), Bias (5), Data exfiltration (3), Over-reliance (5)

Refusal detection built in โ€” an agent that says "I can't reveal my system prompt" won't be flagged as a failure.

4. Cost & Latency Benchmarking

Real token-level cost tracking, not estimates:

from litmusai.benchmarks import CostTracker, compare_models, register_pricing

# Register pricing ($/million tokens)
register_pricing("gpt-4o", input_cost=2.50, output_cost=10.0)
register_pricing("claude-sonnet-4", input_cost=3.0, output_cost=15.0)

# Track across runs
tracker = CostTracker(model="gpt-4o", agent_name="GPT-4o")
tracker.record("q1", task_name="Math", latency_ms=1200,
               passed=True, score=1.0, input_tokens=22, output_tokens=15)

# Compare models
comparison = compare_models(tracker_gpt, tracker_claude)
print(comparison.to_markdown())

5. LLM-as-Judge

Use an LLM to grade responses on criteria you define:

from litmusai.scorers import LLMJudge

judge = LLMJudge(
    provider="openai",
    model="gpt-4o-mini",
    criteria=["correctness", "helpfulness", "safety"],
)
result = await judge.score(
    task="Explain quantum computing",
    response=agent_response,
)
print(f"Score: {result.score}/5 โ€” {result.reason}")

6. CI/CD Integration

Catch regressions before they ship:

# .github/workflows/eval.yml
- uses: kutanti/litmusai@v1
  with:
    agent: my_agent:agent
    suite: coding
    threshold: 0.8
    format: github

Regression detection: flags >5% pass rate drops, >50% cost increases, >50% latency spikes.

7. Result Logging

Full audit trail of every evaluation:

results = await evaluate(agent, suite, log_dir="./eval-logs/")
# Saves: agent_name, task, response, tokens, cost, score, timestamp
# as structured JSON for reproducibility

๐ŸŽฏ Real Examples

Example 1: Compare 5 models

import asyncio
from litmusai import Agent, TestSuite, TestCase, Numeric, Contains, evaluate
from litmusai.benchmarks import register_pricing

# Register pricing
register_pricing("gpt-4o", 2.50, 10.0)
register_pricing("gpt-4.1", 2.0, 8.0)
register_pricing("claude-sonnet-4", 3.0, 15.0)

# Create agents
models = {
    "GPT-4o": Agent.from_openai_chat(
        base_url="https://api.openai.com/v1",
        api_key="sk-...", model="gpt-4o",
    ),
    "GPT-4.1": Agent.from_openai_chat(
        base_url="https://api.openai.com/v1",
        api_key="sk-...", model="gpt-4.1",
    ),
    "Claude Sonnet": Agent.from_openai_chat(
        base_url="https://api.anthropic.com/v1",
        api_key="sk-...", model="claude-sonnet-4-20250514",
    ),
}

# Test suite
suite = TestSuite(name="benchmark")
suite.add_case(TestCase(
    id="math", name="Math", task="What is 15% of 240?",
    assertions=[Numeric(36, tolerance=0.01)],
))
suite.add_case(TestCase(
    id="fact", name="Factual", task="Who wrote 1984?",
    assertions=[Contains(["Orwell", "1949"], mode="all")],
))

# Run all
async def main():
    for name, agent in models.items():
        results = await evaluate(agent, suite, verbose=True)
        print(f"{name}: {results.pass_rate:.0%} | ${results.total_cost:.4f}")

asyncio.run(main())

Example 2: Safety scan before deploy

from litmusai import Agent
from litmusai.safety import SafetyScanner

agent = Agent.from_openai_chat(
    base_url="https://api.openai.com/v1",
    api_key="sk-...", model="gpt-4o",
    system_prompt="You are a helpful customer service agent.",
)

scanner = SafetyScanner(depth="thorough")
report = await scanner.scan(agent)

assert report.is_safe, f"Safety score: {report.safety_score}/100"
assert len(report.critical_failures) == 0, "Critical vulnerabilities found!"

Example 3: JSON API validation

from litmusai import (
    Agent, TestSuite, TestCase,
    All, JsonValid, JsonSchema, JsonPath, evaluate,
)

suite = TestSuite(name="api-tests")
suite.add_case(TestCase(
    id="planets", name="Structured output",
    task="Return the 3 largest planets as a JSON array with name and diameter_km",
    assertions=[All(
        JsonValid(),
        JsonSchema({
            "type": "array", "minItems": 3,
            "items": {
                "type": "object",
                "required": ["name", "diameter_km"],
            },
        }),
        JsonPath("0.name", "Jupiter"),
    )],
))

results = await evaluate(agent, suite)

Example 4: Weighted scoring for nuance

from litmusai import TestCase, Numeric, NotContains, Custom, Weighted

TestCase(
    id="explain", name="Explain well",
    task="Explain why the sky is blue",
    assertions=[Weighted([
        # 50% โ€” mentions Rayleigh scattering
        (Contains(["scatter", "rayleigh", "wavelength"], mode="any"), 0.5),
        # 30% โ€” doesn't refuse or hedge
        (NotContains(["I'm not sure", "I can't"]), 0.3),
        # 20% โ€” substantive answer (>50 words)
        (Custom(lambda r: len(r.split()) >= 50, name="length"), 0.2),
    ], threshold=0.7)],
)

๐Ÿ“Š Real Benchmark Data

We benchmarked 5 models on 6 tasks with real token counts (not estimates):

Model Pass Rate Real Tokens Real Cost Cost/Pass Avg Latency
GPT-4.1 100% 501 $0.0031 $0.0005 ๐Ÿ† 2,269ms
GPT-4o 100% 575 $0.0046 $0.0008 1,616ms โšก
Claude Sonnet 4 100% 848 $0.0109 $0.0018 3,794ms
Claude Opus 4 83% 691 $0.0427 $0.0085 2,888ms
Gemini 2.5 Pro 50% 166 $0.0010 $0.0003* 7,263ms

*Gemini is cheap but only 50% pass rate on our test suite.

Key finding: Claude Opus costs 14x more than GPT-4.1 per correct answer, with lower accuracy. This is the kind of insight that saves real money.

Generated with examples/e2e_test.py โ€” run it yourself with your own API keys.


๐Ÿ—๏ธ Architecture

litmusai/
โ”œโ”€โ”€ assertions/      # 11 assertion types + 4 composites (1,400 lines)
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ agent.py     # Universal agent adapter โ€” 7 factory methods (750 lines)
โ”‚   โ”œโ”€โ”€ runner.py     # Async eval runner with concurrency + logging (300 lines)
โ”‚   โ”œโ”€โ”€ scorer.py     # Assertion-aware scoring engine (220 lines)
โ”‚   โ””โ”€โ”€ suite.py      # Test suite management + YAML (130 lines)
โ”œโ”€โ”€ safety/          # Red-team scanner โ€” 46 attacks (800 lines)
โ”œโ”€โ”€ scorers/         # LLM-as-Judge engine (640 lines)
โ”œโ”€โ”€ benchmarks/      # Cost tracking + model comparison (620 lines)
โ”œโ”€โ”€ ci/              # CI/CD regression detection (500 lines)
โ”œโ”€โ”€ cli/             # CLI commands (350 lines)
โ””โ”€โ”€ suites/          # Built-in test suites (YAML)
    โ”œโ”€โ”€ coding/
    โ”œโ”€โ”€ research/
    โ”œโ”€โ”€ planning/
    โ””โ”€โ”€ safety/

5,900+ lines of code ยท 404 tests ยท 22 source files


๐Ÿš€ Getting Started

Install

pip install litmuseval

Dev setup

git clone https://github.com/kutanti/litmusai.git
cd litmusai
pip install -e ".[dev]"
pytest                    # 404 tests
ruff check src/ tests/    # lint
mypy src/litmusai/        # type check

CLI

litmus run --suite coding --agent my_agent:agent
litmus run --suite research --agent my_agent:agent --format markdown

๐Ÿ—บ๏ธ Roadmap

  • Universal agent adapter (7 factory methods)
  • Assertion engine (11 types + 4 composites)
  • Scoring pipeline (assertions wired into runner)
  • Safety scanner (46 attacks, 3 depths)
  • Cost & latency benchmarking (real tokens)
  • LLM-as-Judge scoring
  • CI/CD regression detection
  • Result logging (JSON)
  • PyPI publish (pip install litmuseval)
  • Multiple runs with statistical reporting
  • HTML reports
  • Expanded test suites (50+ per domain)

๐Ÿค Contributing

PRs welcome! See the open issues.

git clone https://github.com/kutanti/litmusai.git
cd litmusai
pip install -e ".[dev]"
pytest && ruff check src/ tests/ && mypy src/litmusai/

All changes go through PR review with automated CI (ruff, pytest, mypy across Python 3.10-3.12).


๐Ÿ“œ License

MIT โ€” see LICENSE.


Built by Kunal Tanti

If this helps you ship better agents, give it a โญ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litmuseval-0.2.0.tar.gz (80.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

litmuseval-0.2.0-py3-none-any.whl (95.6 kB view details)

Uploaded Python 3

File details

Details for the file litmuseval-0.2.0.tar.gz.

File metadata

  • Download URL: litmuseval-0.2.0.tar.gz
  • Upload date:
  • Size: 80.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for litmuseval-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e8b04d5d83efe70c2ce2bbff1572ff591ef2e6a02ad8d3245c14200a0d4eba1e
MD5 6843ba944595a2af482e29ffa624346f
BLAKE2b-256 ca6981ad4d42a0ea7485546da45a725ff8db42c40783f93575cec64a6d2b73b8

See more details on using hashes here.

Provenance

The following attestation bundles were made for litmuseval-0.2.0.tar.gz:

Publisher: publish.yml on kutanti/litmusai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file litmuseval-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: litmuseval-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 95.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for litmuseval-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c1c0792c91c7c7a978b989b4267efee9387de5795cd97b6a105888a8289cac1b
MD5 9ed2c9a2238ca7737295fd9209fb318a
BLAKE2b-256 8cbee835f0b537601865cb1b27d3ae74e76e3a411616378aed9dab59383ced0e

See more details on using hashes here.

Provenance

The following attestation bundles were made for litmuseval-0.2.0-py3-none-any.whl:

Publisher: publish.yml on kutanti/litmusai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page