Skip to main content

A comprehensive, reference-free LLM evaluation library for RAG pipelines, chatbots, and generative AI systems.

Project description

LLMEVAL

A Python library for evaluating LLM outputs. 15 built-in metrics. Works with or without an API key.

  • 7 math-based metrics: free, instant, runs offline
  • 8 LLM-as-judge metrics: uses any LLM provider to evaluate
  • Supports: OpenAI, Azure, Anthropic, Groq, Ollama, or no provider at all

Install

pip install llmevalkit

Quick start

Math evaluation (free, no API key needed)

from llmevalkit import Evaluator

evaluator = Evaluator(provider="none", preset="math")
result = evaluator.evaluate(
    question="What is Python?",
    answer="Python is a high-level programming language.",
    context="Python is a high-level, interpreted programming language."
)
print(result.overall_score)
print(result.summary())

LLM-as-judge evaluation (needs API key)

from llmevalkit import Evaluator

evaluator = Evaluator(provider="groq", model="llama-3.1-70b-versatile", preset="rag")
result = evaluator.evaluate(
    question="What is Python?",
    answer="Python is a programming language.",
    context="Python is a high-level, interpreted programming language."
)
print(result.summary())

Hybrid (math + LLM together)

from llmevalkit import (
    Evaluator, BLEUScore, ROUGEScore, TokenOverlap,
    Faithfulness, Hallucination, GEval,
)

evaluator = Evaluator(
    provider="groq",
    model="llama-3.1-70b-versatile",
    metrics=[
        BLEUScore(), ROUGEScore(), TokenOverlap(),
        Faithfulness(), Hallucination(),
        GEval(criteria="Is this helpful for a beginner?"),
    ],
)
result = evaluator.evaluate(question="...", answer="...", context="...")

All 15 metrics

See metrics/README.md for detailed documentation on each metric including what it measures, how it works, the formula, and a code example.

Math metrics (no API needed)

# Metric What it measures
1 BLEUScore N-gram precision between answer and reference
2 ROUGEScore Recall-oriented overlap (ROUGE-1, 2, L)
3 TokenOverlap Word-level F1 with stopword filtering
4 SemanticSimilarity Cosine similarity of text embeddings
5 KeywordCoverage Percentage of key terms covered
6 AnswerLength Whether answer meets min/max word count
7 ReadabilityScore Flesch-Kincaid readability grade level

LLM-as-judge metrics (needs API)

# Metric What it measures
8 Faithfulness Is the answer grounded in the context?
9 Hallucination Are there fabricated claims? (works without context)
10 AnswerRelevance Does the answer address the question?
11 ContextRelevance Is the retrieved context useful?
12 Coherence Is the answer logically structured?
13 Completeness Does the answer cover all aspects?
14 Toxicity Is the content safe and appropriate?
15 GEval Custom criteria you define

Providers

Evaluator(provider="openai", model="gpt-4o-mini")
Evaluator(provider="groq", model="llama-3.1-70b-versatile")
Evaluator(provider="anthropic", model="claude-sonnet-4-20250514")
Evaluator(provider="ollama", model="llama3.1")
Evaluator(provider="none", preset="math")   # no API needed

Presets

Evaluator(preset="rag")           # Faithfulness, AnswerRelevance, ContextRelevance, Hallucination
Evaluator(preset="chatbot")       # AnswerRelevance, Coherence, Toxicity, Hallucination
Evaluator(preset="math")          # All 7 math metrics
Evaluator(preset="hybrid_rag")    # Math + LLM combined

Batch evaluation

batch = evaluator.evaluate_batch([
    {"question": "What is AI?", "answer": "AI is artificial intelligence.", "context": "..."},
    {"question": "What is ML?", "answer": "ML uses data to learn.", "context": "..."},
])
print(batch.pass_rate)
df = batch.to_dataframe()  # needs pandas
df.to_csv("results.csv")

License

MIT

Author

Venkatkumar Rajan - https://linkedin.com/in/venkatkumarvk

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmevalkit-1.0.0.tar.gz (31.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmevalkit-1.0.0-py3-none-any.whl (30.5 kB view details)

Uploaded Python 3

File details

Details for the file llmevalkit-1.0.0.tar.gz.

File metadata

  • Download URL: llmevalkit-1.0.0.tar.gz
  • Upload date:
  • Size: 31.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for llmevalkit-1.0.0.tar.gz
Algorithm Hash digest
SHA256 d416191f4e4d12c0503e8969bfb3e056bf7cd83f35bec3d600ba0aaad2c50cb7
MD5 935c9dd99093713aabc13640a5c304d7
BLAKE2b-256 3381281c210b1fa79cc43893ec26d9dd7036ae665b69885720ffde2c39fa8b9b

See more details on using hashes here.

File details

Details for the file llmevalkit-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: llmevalkit-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 30.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for llmevalkit-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f2589aa214dc9d4c2ad6c303b0845fcaa5789a32339842940696985d364d6df9
MD5 aaa52ba61967aa402fd43fbd2193abf2
BLAKE2b-256 3c8811a5e84322830a64574b6fd5511bfedc80a11cfc2aef668e1ac950a68232

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page