Open-source LLM evaluation framework — 50+ research-backed metrics for RAG, agents, safety, and more
Project description
llmgrader
Open-source LLM evaluation framework — 50+ research-backed metrics for RAG pipelines, AI agents, safety, and conversational systems. Pytest-native. Provider-agnostic.
pip install llmgrader
Quick Start
from llmgrader import LLMTestCase, assert_test
from llmgrader.metrics import AnswerRelevancyMetric, FaithfulnessMetric
tc = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris.",
retrieval_context=["France is a country in Western Europe. Its capital is Paris."],
)
assert_test(tc, metrics=[
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.8),
])
Features
| Category | What it does |
|---|---|
| RAG Evaluation | Answer relevancy, faithfulness, contextual precision/recall/relevancy |
| Custom Metrics | GEval (LLM-as-judge + CoT), DAG (deterministic decision-tree) |
| Safety | Hallucination, bias, toxicity, PII leakage, misuse detection |
| Agent Evaluation | Task completion, tool correctness, step efficiency, argument correctness |
| Conversational | Relevancy, completeness, role adherence, knowledge retention |
| Other | JSON correctness (with schema), summarization quality |
| Pytest Native | assert_test(), fixtures, parametrize helpers |
| Tracing | @observe decorator, span/trace trees, component-level evaluation |
| Bulk Evaluation | evaluate() with concurrent execution and aggregated reports |
| Dataset Tools | EvaluationDataset, versioned JSON storage, CSV import |
| Synthesizer | Auto-generate Goldens from documents (4-step pipeline) |
| Providers | OpenAI, Azure OpenAI, Anthropic, Ollama, custom LLM base class |
| Integrations | LangChain, LlamaIndex, CrewAI |
| CLI | llmgrader test, llmgrader set-openai, llmgrader list-metrics |
Installation
# Core
pip install llmgrader
# With LangChain integration
pip install "llmgrader[langchain]"
# With LlamaIndex
pip install "llmgrader[llamaindex]"
# Everything
pip install "llmgrader[all]"
Metrics Reference
RAG Metrics
from llmgrader.metrics import (
AnswerRelevancyMetric, # Is the answer relevant to the question?
FaithfulnessMetric, # Are claims grounded in retrieved context?
ContextualRelevancyMetric, # Are retrieved chunks relevant to the query?
ContextualPrecisionMetric, # Do relevant chunks rank higher?
ContextualRecallMetric, # Does context cover expected answer claims?
)
tc = LLMTestCase(
input="What causes rain?",
actual_output="Rain is caused by water vapor condensing in clouds.",
expected_output="Rain is caused by condensation of water vapor.",
retrieval_context=[
"The water cycle involves evaporation and condensation.",
"Rain forms when water vapor cools and condenses around particles.",
],
)
result = FaithfulnessMetric(threshold=0.8).measure(tc)
print(result.score, result.reason)
Custom: GEval (LLM-as-Judge)
from llmgrader import GEvalMetric, LLMTestCaseParams
metric = GEvalMetric(
name="Correctness",
criteria="The output should be factually correct and directly answer the question.",
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.EXPECTED_OUTPUT,
],
threshold=0.7,
)
result = metric.measure(tc)
Custom: DAG (Deterministic)
from llmgrader import DAGMetric
from llmgrader.metrics.custom.dag import DAGNode
dag = DAGNode(
condition=lambda tc: len(tc.actual_output) > 0,
score_if_false=0.0,
next_if_true=DAGNode(
condition=lambda tc: "error" not in tc.actual_output.lower(),
score_if_true=1.0,
score_if_false=0.2,
)
)
metric = DAGMetric(name="ResponseQuality", root=dag, threshold=0.5)
Safety Metrics
from llmgrader.metrics import (
HallucinationMetric, # Detects factual hallucinations vs context
BiasMetric, # Gender, racial, political, religious bias
ToxicityMetric, # Hate speech, harassment, harmful content
PIILeakageMetric, # SSN, email, phone, credit card detection
MisuseMetric, # Weapons, illegal activity enablement
)
result = BiasMetric(threshold=0.7).measure(tc)
Agentic Metrics
from llmgrader import ToolCall
from llmgrader.metrics import (
TaskCompletionMetric, # Did the agent accomplish the goal?
ToolCorrectnessMetric, # Were the right tools called?
StepEfficiencyMetric, # Were unnecessary steps avoided?
ArgumentCorrectnessMetric, # Were tool arguments correct?
)
tc = LLMTestCase(
input="Search for the latest news on AI and summarize it.",
actual_output="Here is a summary of recent AI news...",
tools_called=[
ToolCall(name="web_search", input_parameters={"query": "latest AI news"}),
ToolCall(name="summarize", input_parameters={"max_length": 200}),
],
expected_tools=["web_search", "summarize"],
)
result = ToolCorrectnessMetric(threshold=0.8).measure(tc)
Conversational Metrics
from llmgrader import ConversationalTestCase, Message
from llmgrader.metrics import (
ConversationalRelevancyMetric,
ConversationCompletenessMetric,
RoleAdherenceMetric,
KnowledgeRetentionMetric,
)
tc = ConversationalTestCase(
messages=[
Message(role="user", content="My name is Alice and I like Python."),
Message(role="assistant", content="Nice to meet you, Alice! Python is great."),
Message(role="user", content="What's my name again?"),
Message(role="assistant", content="Your name is Alice."),
],
chatbot_role="A helpful assistant that remembers user preferences.",
)
result = KnowledgeRetentionMetric(threshold=0.7).measure(tc)
Bulk Evaluation
from llmgrader import evaluate, LLMTestCase
from llmgrader.metrics import AnswerRelevancyMetric, JSONCorrectnessMetric
test_cases = [
LLMTestCase(input="What is 2+2?", actual_output="4"),
LLMTestCase(input="Capital of Japan?", actual_output="Tokyo"),
LLMTestCase(input="Return JSON", actual_output='{"status": "ok"}'),
]
result = evaluate(
test_cases=test_cases,
metrics=[AnswerRelevancyMetric(), JSONCorrectnessMetric()],
max_concurrent=4,
verbose=True,
)
print(f"Pass rate: {result.pass_rate:.1%}")
print(f"Overall score: {result.overall_score:.3f}")
result.print_summary()
Pytest Integration
# test_my_llm.py
import pytest
from llmgrader import LLMTestCase, assert_test
from llmgrader.metrics import AnswerRelevancyMetric, FaithfulnessMetric
def test_rag_answer():
tc = LLMTestCase(
input="What causes lightning?",
actual_output=my_rag_pipeline("What causes lightning?"),
retrieval_context=get_context("lightning"),
)
assert_test(tc, metrics=[
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.8),
])
# Run with: llmgrader test test_my_llm.py
# Or: pytest test_my_llm.py
Tracing & Component-Level Evaluation
from llmgrader import observe, Tracer, set_tracer, clear_tracer
tracer = Tracer()
set_tracer(tracer)
trace = tracer.start_trace()
@observe(span_type="retriever")
def retrieve(query: str) -> list:
return vector_db.search(query)
@observe(span_type="llm")
def generate(context: list, query: str) -> str:
return llm.generate(f"Context: {context}\nQuestion: {query}")
def rag_pipeline(query: str) -> str:
context = retrieve(query)
return generate(context, query)
answer = rag_pipeline("What is quantum computing?")
tracer.end_trace()
clear_tracer()
tracer.print_last_trace() # Shows span tree with latencies
Dataset Management
from llmgrader import EvaluationDataset, Golden
# Build a dataset
ds = EvaluationDataset()
ds.add_goldens([
Golden(input="What is AI?", expected_output="Artificial intelligence."),
Golden(input="Capital of Germany?", expected_output="Berlin"),
])
ds.save("my_dataset.json")
# Load and use
ds = EvaluationDataset.load("my_dataset.json")
test_cases = ds.to_test_cases(generate_fn=my_llm.generate)
Synthetic Dataset Generation
from llmgrader import Synthesizer
synth = Synthesizer()
docs = [
"The Python programming language was created by Guido van Rossum...",
"Machine learning is a branch of artificial intelligence...",
]
goldens = synth.generate_goldens_from_docs(
documents=docs,
max_goldens_per_doc=5,
filter_questions=True,
evolve_questions=True,
generate_expected_outputs=True,
)
print(f"Generated {len(goldens)} golden test cases")
for g in goldens[:2]:
print(f"Q: {g.input}")
print(f"A: {g.expected_output}\n")
LLM Providers
from llmgrader.providers import OpenAIProvider, AnthropicProvider, OllamaProvider
# OpenAI
provider = OpenAIProvider(model="gpt-4o", api_key="sk-...")
# Anthropic Claude
provider = AnthropicProvider(model="claude-sonnet-4-6")
# Ollama (local)
provider = OllamaProvider(model="llama3")
# Custom provider
from llmgrader.providers import LLMProvider
class MyProvider(LLMProvider):
def generate(self, prompt: str, **kwargs) -> str:
return my_llm_api.call(prompt)
# Use in any metric
metric = AnswerRelevancyMetric(model=MyProvider())
LangChain Integration
from langchain_openai import ChatOpenAI
from llmgrader.integrations.langchain import LangChainCallbackHandler, evaluate_chain
from llmgrader.metrics import AnswerRelevancyMetric
llm = ChatOpenAI(model="gpt-4o")
chain = llm # or any LangChain runnable
result = evaluate_chain(
chain=chain,
inputs=["What is the capital of France?", "Who invented Python?"],
metrics=[AnswerRelevancyMetric(threshold=0.7)],
)
CLI
# Run evaluation tests
llmgrader test tests/test_llm.py
llmgrader test tests/ -n 4 # 4 parallel workers
# Configure providers
llmgrader set-openai --key sk-... --model gpt-4o
llmgrader set-anthropic --key sk-... --model claude-sonnet-4-6
llmgrader set-ollama --model llama3
# List all metrics
llmgrader list-metrics
# Version
llmgrader version
License
Apache 2.0 — see LICENSE.
Author: Mahesh Makvana
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmgrader-1.0.1.tar.gz.
File metadata
- Download URL: llmgrader-1.0.1.tar.gz
- Upload date:
- Size: 45.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aebd8f9b915a39f2b21df46d2c7ef882f13cb3abc8d298e6e34531f23eeb86ab
|
|
| MD5 |
ee305dd811c0dcdcd3c47dbae727706e
|
|
| BLAKE2b-256 |
b83579ab7a256db3f5cfd93be2499571ff2742a50c4d1d0e56bbb06975a3535d
|
File details
Details for the file llmgrader-1.0.1-py3-none-any.whl.
File metadata
- Download URL: llmgrader-1.0.1-py3-none-any.whl
- Upload date:
- Size: 62.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18ea75a0afc1687731776e0e5f11ef80e08706b1411b17aa44d473a888edca5e
|
|
| MD5 |
ff53121e32476558dce92218410d21d8
|
|
| BLAKE2b-256 |
43b62bba8f220c77b85c160ee882bb0ef32ea8e2a355db7535062090396a3be7
|