The open-source evaluation framework for AI agents โ test, compare, ship with confidence
Project description
๐งช LitmusAI
Eval framework for AI agents that actually works
5,900+ lines of code ยท 404 tests ยท 11 assertion types ยท 46 safety attacks ยท 5-model benchmarks
Quickstart ยท Features ยท Assertions ยท Examples ยท Benchmark Results
Why?
You deployed an agent. It works in demos. But:
- Does it still work after your last commit?
- Is Claude actually better than GPT for your use case โ or just more expensive?
- Did your prompt change introduce a safety regression?
- What's your actual cost per correct answer?
LitmusAI answers these with one function call.
โก Quickstart
pip install litmuseval
30-second eval
import asyncio
from litmusai import Agent, TestSuite, TestCase, Numeric, evaluate
# 1. Connect your agent (works with any OpenAI-compatible API)
agent = Agent.from_openai_chat(
base_url="https://api.openai.com/v1",
api_key="sk-...",
model="gpt-4o",
)
# 2. Define tests with real assertions (not just substring matching)
suite = TestSuite(name="math-eval")
suite.add_case(TestCase(
id="q1", name="Percentage",
task="What is 15% of 240?",
assertions=[Numeric(36, tolerance=0.01)],
))
# 3. Run
results = asyncio.run(evaluate(agent, suite))
print(results)
# โ
1/1 passed | ๐ฐ $0.0003 | โก 1200ms avg | ๐ค 47 tokens
That $0.0003 is real โ not an estimate. from_openai_chat captures actual token usage from the API response.
๐ง What's Inside
1. Universal Agent Adapter
Connect any agent โ function, API, or framework:
# Any OpenAI-compatible API (OpenAI, Anthropic via proxy, LiteLLM, Ollama, vLLM)
agent = Agent.from_openai_chat(base_url="...", api_key="...", model="gpt-4o")
# Plain async function
agent = Agent.from_function(my_async_fn, name="my-agent")
# HTTP endpoint
agent = Agent.from_url("http://localhost:8000/chat")
# LangChain
agent = Agent.from_langchain(my_chain)
# CrewAI
agent = Agent.from_crewai(my_crew)
# Any callable
agent = Agent.from_callable(obj_with_call_method)
from_openai_chat is the recommended path โ it automatically captures real token counts and cost from the API response. No guessing.
2. Assertion Engine (11 Types)
The heart of the framework. Composable, type-safe scoring that goes way beyond substring matching:
from litmusai import (
Numeric, Contains, NotContains, Exact, RegexMatch,
JsonSchema, JsonPath, JsonValid,
Semantic, LLMGrade, Custom,
All, AnyOf, AtLeast, Weighted,
)
# โโ String assertions โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Exact("Paris") # exact match
Contains(["Orwell", "1949"], mode="all") # all must appear
NotContains(["hack", "exploit"]) # must NOT appear
RegexMatch(r"\b\d{4}-\d{2}-\d{2}\b") # date format
# โโ Numeric โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Numeric(36, tolerance=0.01) # extracts numbers from text
Numeric(36) # handles "thirty-six" too
# โโ Structured โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
JsonValid() # valid JSON (handles ```json fences)
JsonSchema({"type": "object", "required": ["name", "age"]})
JsonPath("user.name", "Alice") # check nested values
JsonPath("items[0].price", 10, operator="gt")
# โโ AI-powered โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Semantic("The answer is 36", threshold=0.85) # embedding similarity
LLMGrade("Is the math correct?", model="gpt-4o-mini")
# โโ Custom โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Custom(lambda r: len(r.split()) >= 10, name="min_words")
Custom(lambda r: "def " in r and "return" in r, name="has_function")
Compose them
# All must pass
All(Numeric(36), NotContains(["sorry", "I can't"]))
# Any one is enough
AnyOf(Exact("36"), Numeric(36), Contains(["thirty-six"]))
# At least 2 of 3
AtLeast(2, [Exact("36"), Numeric(36), Contains(["36"])])
# Weighted scoring (no hard pass/fail)
Weighted([
(Numeric(36), 0.6), # 60% weight on correctness
(NotContains(["sorry"]), 0.2), # 20% on confidence
(Custom(lambda r: len(r) > 10, name="detail"), 0.2),
], threshold=0.7)
3. Safety Scanner (46 Attacks)
Red-team your agent automatically:
from litmusai.safety import SafetyScanner
scanner = SafetyScanner(depth="standard") # "basic" | "standard" | "thorough"
report = await scanner.scan(agent)
print(f"Score: {report.safety_score}/100")
print(f"Verdict: {'โ
SAFE' if report.is_safe else 'โ UNSAFE'}")
# Category breakdown
for cat, stats in report.categories.items():
print(f" {cat}: {stats.passed}/{stats.total}")
Attack categories: Prompt injection (8), Jailbreak (6), PII leak (5), Harmful content (5), Hallucination (5), Bias (5), Data exfiltration (3), Over-reliance (5)
Refusal detection built in โ an agent that says "I can't reveal my system prompt" won't be flagged as a failure.
4. Cost & Latency Benchmarking
Real token-level cost tracking, not estimates:
from litmusai.benchmarks import CostTracker, compare_models, register_pricing
# Register pricing ($/million tokens)
register_pricing("gpt-4o", input_cost=2.50, output_cost=10.0)
register_pricing("claude-sonnet-4", input_cost=3.0, output_cost=15.0)
# Track across runs
tracker = CostTracker(model="gpt-4o", agent_name="GPT-4o")
tracker.record("q1", task_name="Math", latency_ms=1200,
passed=True, score=1.0, input_tokens=22, output_tokens=15)
# Compare models
comparison = compare_models(tracker_gpt, tracker_claude)
print(comparison.to_markdown())
5. LLM-as-Judge
Use an LLM to grade responses on criteria you define:
from litmusai.scorers import LLMJudge
judge = LLMJudge(
provider="openai",
model="gpt-4o-mini",
criteria=["correctness", "helpfulness", "safety"],
)
result = await judge.score(
task="Explain quantum computing",
response=agent_response,
)
print(f"Score: {result.score}/5 โ {result.reason}")
6. CI/CD Integration
Catch regressions before they ship:
# .github/workflows/eval.yml
- uses: kutanti/litmusai@v1
with:
agent: my_agent:agent
suite: coding
threshold: 0.8
format: github
Regression detection: flags >5% pass rate drops, >50% cost increases, >50% latency spikes.
7. Result Logging
Full audit trail of every evaluation:
results = await evaluate(agent, suite, log_dir="./eval-logs/")
# Saves: agent_name, task, response, tokens, cost, score, timestamp
# as structured JSON for reproducibility
๐ฏ Real Examples
Example 1: Compare 5 models
import asyncio
from litmusai import Agent, TestSuite, TestCase, Numeric, Contains, evaluate
from litmusai.benchmarks import register_pricing
# Register pricing
register_pricing("gpt-4o", 2.50, 10.0)
register_pricing("gpt-4.1", 2.0, 8.0)
register_pricing("claude-sonnet-4", 3.0, 15.0)
# Create agents
models = {
"GPT-4o": Agent.from_openai_chat(
base_url="https://api.openai.com/v1",
api_key="sk-...", model="gpt-4o",
),
"GPT-4.1": Agent.from_openai_chat(
base_url="https://api.openai.com/v1",
api_key="sk-...", model="gpt-4.1",
),
"Claude Sonnet": Agent.from_openai_chat(
base_url="https://api.anthropic.com/v1",
api_key="sk-...", model="claude-sonnet-4-20250514",
),
}
# Test suite
suite = TestSuite(name="benchmark")
suite.add_case(TestCase(
id="math", name="Math", task="What is 15% of 240?",
assertions=[Numeric(36, tolerance=0.01)],
))
suite.add_case(TestCase(
id="fact", name="Factual", task="Who wrote 1984?",
assertions=[Contains(["Orwell", "1949"], mode="all")],
))
# Run all
async def main():
for name, agent in models.items():
results = await evaluate(agent, suite, verbose=True)
print(f"{name}: {results.pass_rate:.0%} | ${results.total_cost:.4f}")
asyncio.run(main())
Example 2: Safety scan before deploy
from litmusai import Agent
from litmusai.safety import SafetyScanner
agent = Agent.from_openai_chat(
base_url="https://api.openai.com/v1",
api_key="sk-...", model="gpt-4o",
system_prompt="You are a helpful customer service agent.",
)
scanner = SafetyScanner(depth="thorough")
report = await scanner.scan(agent)
assert report.is_safe, f"Safety score: {report.safety_score}/100"
assert len(report.critical_failures) == 0, "Critical vulnerabilities found!"
Example 3: JSON API validation
from litmusai import (
Agent, TestSuite, TestCase,
All, JsonValid, JsonSchema, JsonPath, evaluate,
)
suite = TestSuite(name="api-tests")
suite.add_case(TestCase(
id="planets", name="Structured output",
task="Return the 3 largest planets as a JSON array with name and diameter_km",
assertions=[All(
JsonValid(),
JsonSchema({
"type": "array", "minItems": 3,
"items": {
"type": "object",
"required": ["name", "diameter_km"],
},
}),
JsonPath("0.name", "Jupiter"),
)],
))
results = await evaluate(agent, suite)
Example 4: Weighted scoring for nuance
from litmusai import TestCase, Numeric, NotContains, Custom, Weighted
TestCase(
id="explain", name="Explain well",
task="Explain why the sky is blue",
assertions=[Weighted([
# 50% โ mentions Rayleigh scattering
(Contains(["scatter", "rayleigh", "wavelength"], mode="any"), 0.5),
# 30% โ doesn't refuse or hedge
(NotContains(["I'm not sure", "I can't"]), 0.3),
# 20% โ substantive answer (>50 words)
(Custom(lambda r: len(r.split()) >= 50, name="length"), 0.2),
], threshold=0.7)],
)
๐ Real Benchmark Data
We benchmarked 5 models on 6 tasks with real token counts (not estimates):
| Model | Pass Rate | Real Tokens | Real Cost | Cost/Pass | Avg Latency |
|---|---|---|---|---|---|
| GPT-4.1 | 100% | 501 | $0.0031 | $0.0005 ๐ | 2,269ms |
| GPT-4o | 100% | 575 | $0.0046 | $0.0008 | 1,616ms โก |
| Claude Sonnet 4 | 100% | 848 | $0.0109 | $0.0018 | 3,794ms |
| Claude Opus 4 | 83% | 691 | $0.0427 | $0.0085 | 2,888ms |
| Gemini 2.5 Pro | 50% | 166 | $0.0010 | $0.0003* | 7,263ms |
*Gemini is cheap but only 50% pass rate on our test suite.
Key finding: Claude Opus costs 14x more than GPT-4.1 per correct answer, with lower accuracy. This is the kind of insight that saves real money.
Generated with examples/e2e_test.py โ run it yourself with your own API keys.
๐๏ธ Architecture
litmusai/
โโโ assertions/ # 11 assertion types + 4 composites (1,400 lines)
โโโ core/
โ โโโ agent.py # Universal agent adapter โ 7 factory methods (750 lines)
โ โโโ runner.py # Async eval runner with concurrency + logging (300 lines)
โ โโโ scorer.py # Assertion-aware scoring engine (220 lines)
โ โโโ suite.py # Test suite management + YAML (130 lines)
โโโ safety/ # Red-team scanner โ 46 attacks (800 lines)
โโโ scorers/ # LLM-as-Judge engine (640 lines)
โโโ benchmarks/ # Cost tracking + model comparison (620 lines)
โโโ ci/ # CI/CD regression detection (500 lines)
โโโ cli/ # CLI commands (350 lines)
โโโ suites/ # Built-in test suites (YAML)
โโโ coding/
โโโ research/
โโโ planning/
โโโ safety/
5,900+ lines of code ยท 404 tests ยท 22 source files
๐ Getting Started
Install
pip install litmuseval
Dev setup
git clone https://github.com/kutanti/litmusai.git
cd litmusai
pip install -e ".[dev]"
pytest # 404 tests
ruff check src/ tests/ # lint
mypy src/litmusai/ # type check
CLI
litmus run --suite coding --agent my_agent:agent
litmus run --suite research --agent my_agent:agent --format markdown
๐บ๏ธ Roadmap
- Universal agent adapter (7 factory methods)
- Assertion engine (11 types + 4 composites)
- Scoring pipeline (assertions wired into runner)
- Safety scanner (46 attacks, 3 depths)
- Cost & latency benchmarking (real tokens)
- LLM-as-Judge scoring
- CI/CD regression detection
- Result logging (JSON)
- PyPI publish (
pip install litmuseval) - Multiple runs with statistical reporting
- HTML reports
- Expanded test suites (50+ per domain)
๐ค Contributing
PRs welcome! See the open issues.
git clone https://github.com/kutanti/litmusai.git
cd litmusai
pip install -e ".[dev]"
pytest && ruff check src/ tests/ && mypy src/litmusai/
All changes go through PR review with automated CI (ruff, pytest, mypy across Python 3.10-3.12).
๐ License
MIT โ see LICENSE.
Built by Kunal Tanti
If this helps you ship better agents, give it a โญ
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file litmuseval-0.2.0.tar.gz.
File metadata
- Download URL: litmuseval-0.2.0.tar.gz
- Upload date:
- Size: 80.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8b04d5d83efe70c2ce2bbff1572ff591ef2e6a02ad8d3245c14200a0d4eba1e
|
|
| MD5 |
6843ba944595a2af482e29ffa624346f
|
|
| BLAKE2b-256 |
ca6981ad4d42a0ea7485546da45a725ff8db42c40783f93575cec64a6d2b73b8
|
Provenance
The following attestation bundles were made for litmuseval-0.2.0.tar.gz:
Publisher:
publish.yml on kutanti/litmusai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
litmuseval-0.2.0.tar.gz -
Subject digest:
e8b04d5d83efe70c2ce2bbff1572ff591ef2e6a02ad8d3245c14200a0d4eba1e - Sigstore transparency entry: 1233208878
- Sigstore integration time:
-
Permalink:
kutanti/litmusai@2d1f9f8a9665f62009925b35d0df4e789166b415 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/kutanti
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2d1f9f8a9665f62009925b35d0df4e789166b415 -
Trigger Event:
release
-
Statement type:
File details
Details for the file litmuseval-0.2.0-py3-none-any.whl.
File metadata
- Download URL: litmuseval-0.2.0-py3-none-any.whl
- Upload date:
- Size: 95.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1c0792c91c7c7a978b989b4267efee9387de5795cd97b6a105888a8289cac1b
|
|
| MD5 |
9ed2c9a2238ca7737295fd9209fb318a
|
|
| BLAKE2b-256 |
8cbee835f0b537601865cb1b27d3ae74e76e3a411616378aed9dab59383ced0e
|
Provenance
The following attestation bundles were made for litmuseval-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on kutanti/litmusai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
litmuseval-0.2.0-py3-none-any.whl -
Subject digest:
c1c0792c91c7c7a978b989b4267efee9387de5795cd97b6a105888a8289cac1b - Sigstore transparency entry: 1233208936
- Sigstore integration time:
-
Permalink:
kutanti/litmusai@2d1f9f8a9665f62009925b35d0df4e789166b415 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/kutanti
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2d1f9f8a9665f62009925b35d0df4e789166b415 -
Trigger Event:
release
-
Statement type: