Skip to main content

Open-source LLM evaluation framework — 50+ research-backed metrics for RAG, agents, safety, and more

Project description

llmeval

Open-source LLM evaluation framework — 50+ research-backed metrics for RAG pipelines, AI agents, safety, and conversational systems. Pytest-native. Provider-agnostic.

pip install llmeval

Quick Start

from llmeval import LLMTestCase, assert_test
from llmeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

tc = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    retrieval_context=["France is a country in Western Europe. Its capital is Paris."],
)

assert_test(tc, metrics=[
    AnswerRelevancyMetric(threshold=0.7),
    FaithfulnessMetric(threshold=0.8),
])

Features

Category What it does
RAG Evaluation Answer relevancy, faithfulness, contextual precision/recall/relevancy
Custom Metrics GEval (LLM-as-judge + CoT), DAG (deterministic decision-tree)
Safety Hallucination, bias, toxicity, PII leakage, misuse detection
Agent Evaluation Task completion, tool correctness, step efficiency, argument correctness
Conversational Relevancy, completeness, role adherence, knowledge retention
Other JSON correctness (with schema), summarization quality
Pytest Native assert_test(), fixtures, parametrize helpers
Tracing @observe decorator, span/trace trees, component-level evaluation
Bulk Evaluation evaluate() with concurrent execution and aggregated reports
Dataset Tools EvaluationDataset, versioned JSON storage, CSV import
Synthesizer Auto-generate Goldens from documents (4-step pipeline)
Providers OpenAI, Azure OpenAI, Anthropic, Ollama, custom LLM base class
Integrations LangChain, LlamaIndex, CrewAI
CLI llmeval test, llmeval set-openai, llmeval list-metrics

Installation

# Core
pip install llmeval

# With LangChain integration
pip install "llmeval[langchain]"

# With LlamaIndex
pip install "llmeval[llamaindex]"

# Everything
pip install "llmeval[all]"

Metrics Reference

RAG Metrics

from llmeval.metrics import (
    AnswerRelevancyMetric,       # Is the answer relevant to the question?
    FaithfulnessMetric,          # Are claims grounded in retrieved context?
    ContextualRelevancyMetric,   # Are retrieved chunks relevant to the query?
    ContextualPrecisionMetric,   # Do relevant chunks rank higher?
    ContextualRecallMetric,      # Does context cover expected answer claims?
)

tc = LLMTestCase(
    input="What causes rain?",
    actual_output="Rain is caused by water vapor condensing in clouds.",
    expected_output="Rain is caused by condensation of water vapor.",
    retrieval_context=[
        "The water cycle involves evaporation and condensation.",
        "Rain forms when water vapor cools and condenses around particles.",
    ],
)

result = FaithfulnessMetric(threshold=0.8).measure(tc)
print(result.score, result.reason)

Custom: GEval (LLM-as-Judge)

from llmeval import GEvalMetric, LLMTestCaseParams

metric = GEvalMetric(
    name="Correctness",
    criteria="The output should be factually correct and directly answer the question.",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT,
    ],
    threshold=0.7,
)

result = metric.measure(tc)

Custom: DAG (Deterministic)

from llmeval import DAGMetric
from llmeval.metrics.custom.dag import DAGNode

dag = DAGNode(
    condition=lambda tc: len(tc.actual_output) > 0,
    score_if_false=0.0,
    next_if_true=DAGNode(
        condition=lambda tc: "error" not in tc.actual_output.lower(),
        score_if_true=1.0,
        score_if_false=0.2,
    )
)
metric = DAGMetric(name="ResponseQuality", root=dag, threshold=0.5)

Safety Metrics

from llmeval.metrics import (
    HallucinationMetric,   # Detects factual hallucinations vs context
    BiasMetric,            # Gender, racial, political, religious bias
    ToxicityMetric,        # Hate speech, harassment, harmful content
    PIILeakageMetric,      # SSN, email, phone, credit card detection
    MisuseMetric,          # Weapons, illegal activity enablement
)

result = BiasMetric(threshold=0.7).measure(tc)

Agentic Metrics

from llmeval import ToolCall
from llmeval.metrics import (
    TaskCompletionMetric,      # Did the agent accomplish the goal?
    ToolCorrectnessMetric,     # Were the right tools called?
    StepEfficiencyMetric,      # Were unnecessary steps avoided?
    ArgumentCorrectnessMetric, # Were tool arguments correct?
)

tc = LLMTestCase(
    input="Search for the latest news on AI and summarize it.",
    actual_output="Here is a summary of recent AI news...",
    tools_called=[
        ToolCall(name="web_search", input_parameters={"query": "latest AI news"}),
        ToolCall(name="summarize", input_parameters={"max_length": 200}),
    ],
    expected_tools=["web_search", "summarize"],
)

result = ToolCorrectnessMetric(threshold=0.8).measure(tc)

Conversational Metrics

from llmeval import ConversationalTestCase, Message
from llmeval.metrics import (
    ConversationalRelevancyMetric,
    ConversationCompletenessMetric,
    RoleAdherenceMetric,
    KnowledgeRetentionMetric,
)

tc = ConversationalTestCase(
    messages=[
        Message(role="user", content="My name is Alice and I like Python."),
        Message(role="assistant", content="Nice to meet you, Alice! Python is great."),
        Message(role="user", content="What's my name again?"),
        Message(role="assistant", content="Your name is Alice."),
    ],
    chatbot_role="A helpful assistant that remembers user preferences.",
)

result = KnowledgeRetentionMetric(threshold=0.7).measure(tc)

Bulk Evaluation

from llmeval import evaluate, LLMTestCase
from llmeval.metrics import AnswerRelevancyMetric, JSONCorrectnessMetric

test_cases = [
    LLMTestCase(input="What is 2+2?", actual_output="4"),
    LLMTestCase(input="Capital of Japan?", actual_output="Tokyo"),
    LLMTestCase(input="Return JSON", actual_output='{"status": "ok"}'),
]

result = evaluate(
    test_cases=test_cases,
    metrics=[AnswerRelevancyMetric(), JSONCorrectnessMetric()],
    max_concurrent=4,
    verbose=True,
)

print(f"Pass rate: {result.pass_rate:.1%}")
print(f"Overall score: {result.overall_score:.3f}")
result.print_summary()

Pytest Integration

# test_my_llm.py
import pytest
from llmeval import LLMTestCase, assert_test
from llmeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

def test_rag_answer():
    tc = LLMTestCase(
        input="What causes lightning?",
        actual_output=my_rag_pipeline("What causes lightning?"),
        retrieval_context=get_context("lightning"),
    )
    assert_test(tc, metrics=[
        AnswerRelevancyMetric(threshold=0.7),
        FaithfulnessMetric(threshold=0.8),
    ])


# Run with: llmeval test test_my_llm.py
# Or:       pytest test_my_llm.py

Tracing & Component-Level Evaluation

from llmeval import observe, Tracer, set_tracer, clear_tracer

tracer = Tracer()
set_tracer(tracer)
trace = tracer.start_trace()

@observe(span_type="retriever")
def retrieve(query: str) -> list:
    return vector_db.search(query)

@observe(span_type="llm")
def generate(context: list, query: str) -> str:
    return llm.generate(f"Context: {context}\nQuestion: {query}")

def rag_pipeline(query: str) -> str:
    context = retrieve(query)
    return generate(context, query)

answer = rag_pipeline("What is quantum computing?")
tracer.end_trace()
clear_tracer()

tracer.print_last_trace()  # Shows span tree with latencies

Dataset Management

from llmeval import EvaluationDataset, Golden

# Build a dataset
ds = EvaluationDataset()
ds.add_goldens([
    Golden(input="What is AI?", expected_output="Artificial intelligence."),
    Golden(input="Capital of Germany?", expected_output="Berlin"),
])
ds.save("my_dataset.json")

# Load and use
ds = EvaluationDataset.load("my_dataset.json")
test_cases = ds.to_test_cases(generate_fn=my_llm.generate)

Synthetic Dataset Generation

from llmeval import Synthesizer

synth = Synthesizer()

docs = [
    "The Python programming language was created by Guido van Rossum...",
    "Machine learning is a branch of artificial intelligence...",
]

goldens = synth.generate_goldens_from_docs(
    documents=docs,
    max_goldens_per_doc=5,
    filter_questions=True,
    evolve_questions=True,
    generate_expected_outputs=True,
)

print(f"Generated {len(goldens)} golden test cases")
for g in goldens[:2]:
    print(f"Q: {g.input}")
    print(f"A: {g.expected_output}\n")

LLM Providers

from llmeval.providers import OpenAIProvider, AnthropicProvider, OllamaProvider

# OpenAI
provider = OpenAIProvider(model="gpt-4o", api_key="sk-...")

# Anthropic Claude
provider = AnthropicProvider(model="claude-sonnet-4-6")

# Ollama (local)
provider = OllamaProvider(model="llama3")

# Custom provider
from llmeval.providers import LLMProvider

class MyProvider(LLMProvider):
    def generate(self, prompt: str, **kwargs) -> str:
        return my_llm_api.call(prompt)

# Use in any metric
metric = AnswerRelevancyMetric(model=MyProvider())

LangChain Integration

from langchain_openai import ChatOpenAI
from llmeval.integrations.langchain import LangChainCallbackHandler, evaluate_chain
from llmeval.metrics import AnswerRelevancyMetric

llm = ChatOpenAI(model="gpt-4o")
chain = llm  # or any LangChain runnable

result = evaluate_chain(
    chain=chain,
    inputs=["What is the capital of France?", "Who invented Python?"],
    metrics=[AnswerRelevancyMetric(threshold=0.7)],
)

CLI

# Run evaluation tests
llmeval test tests/test_llm.py
llmeval test tests/ -n 4          # 4 parallel workers

# Configure providers
llmeval set-openai --key sk-... --model gpt-4o
llmeval set-anthropic --key sk-... --model claude-sonnet-4-6
llmeval set-ollama --model llama3

# List all metrics
llmeval list-metrics

# Version
llmeval version

License

Apache 2.0 — see LICENSE.

Author: Mahesh Makvana

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmgrader-1.0.0.tar.gz (45.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmgrader-1.0.0-py3-none-any.whl (61.9 kB view details)

Uploaded Python 3

File details

Details for the file llmgrader-1.0.0.tar.gz.

File metadata

  • Download URL: llmgrader-1.0.0.tar.gz
  • Upload date:
  • Size: 45.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llmgrader-1.0.0.tar.gz
Algorithm Hash digest
SHA256 8d9ee65ec08aa353b38e708fdcbbdcb9c6479b51663d542556299e512de31ca6
MD5 a960f2e9aa95f338f0ea636b816c10d0
BLAKE2b-256 a17393bdf0c0c5d269f29daa3531801bed2425f507b6087dab957536b79f6c69

See more details on using hashes here.

File details

Details for the file llmgrader-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: llmgrader-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 61.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llmgrader-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 64146422b59dc482879e05dad8f5cd5674b723bcc7b1e371a070d9e1d921432b
MD5 bd2090f3e240eaa26767dc5dbcf784fe
BLAKE2b-256 18a49fefaf5042401624a34189a1064719e59f0f2a1af1710892aa6e8f89e3b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page