Skip to main content

LLM evaluation and compliance testing library. Quality metrics + PII detection + HIPAA/GDPR/DPDP/EU AI Act compliance. Works with or without API.

Project description

llmevalkit

A Python library for evaluating LLM outputs. 15 built-in metrics. Works with or without an API key.

  • 7 math-based metrics: free, instant, runs offline
  • 8 LLM-as-judge metrics: uses any LLM provider to evaluate
  • Supports: OpenAI, Azure, Anthropic, Groq, Ollama, or no provider at all

Open In Colab

Install

pip install llmevalkit

Quick start

Math evaluation (free, no API key needed)

from llmevalkit import Evaluator

evaluator = Evaluator(provider="none", preset="math")
result = evaluator.evaluate(
    question="What is Python?",
    answer="Python is a high-level programming language.",
    context="Python is a high-level, interpreted programming language."
)
print(result.overall_score)
print(result.summary())

LLM-as-judge evaluation (needs API key)

from llmevalkit import Evaluator

evaluator = Evaluator(provider="groq", model="llama-3.3-70b-versatile", preset="rag")
result = evaluator.evaluate(
    question="What is Python?",
    answer="Python is a programming language.",
    context="Python is a high-level, interpreted programming language."
)
print(result.summary())

Hybrid (math + LLM together)

from llmevalkit import (
    Evaluator, BLEUScore, ROUGEScore, TokenOverlap,
    Faithfulness, Hallucination, GEval,
)

evaluator = Evaluator(
    provider="groq",
    model="llama-3.3-70b-versatile",
    metrics=[
        BLEUScore(), ROUGEScore(), TokenOverlap(),
        Faithfulness(), Hallucination(),
        GEval(criteria="Is this helpful for a beginner?"),
    ],
)
result = evaluator.evaluate(question="...", answer="...", context="...")

All 15 metrics

See metrics/README.md for detailed documentation on each metric including what it measures, how it works, the formula, and a code example.

Math metrics (no API needed)

S.No. Metric What it measures
1 BLEUScore N-gram precision between answer and reference
2 ROUGEScore Recall-oriented overlap (ROUGE-1, 2, L)
3 TokenOverlap Word-level F1 with stopword filtering
4 SemanticSimilarity Cosine similarity of text embeddings
5 KeywordCoverage Percentage of key terms covered
6 AnswerLength Whether answer meets min/max word count
7 ReadabilityScore Flesch-Kincaid readability grade level

LLM-as-judge metrics (needs API)

S.No. Metric What it measures
8 Faithfulness Is the answer grounded in the context?
9 Hallucination Are there fabricated claims? (works without context)
10 AnswerRelevance Does the answer address the question?
11 ContextRelevance Is the retrieved context useful?
12 Coherence Is the answer logically structured?
13 Completeness Does the answer cover all aspects?
14 Toxicity Is the content safe and appropriate?
15 GEval Custom criteria you define

Supported providers

S.No. Provider Example
1 OpenAI Evaluator(provider="openai", model="gpt-4o-mini")
2 Azure OpenAI Evaluator(provider="azure", model="gpt-4o-mini", api_key="...", base_url="...")
3 Groq Evaluator(provider="groq", model="llama-3.3-70b-versatile")
4 Anthropic Evaluator(provider="anthropic", model="claude-sonnet-4-20250514")
5 HuggingFace Evaluator(provider="huggingface", model="meta-llama/Llama-3.1-8B-Instruct")
6 Ollama Evaluator(provider="ollama", model="llama3.1")
7 Custom Evaluator(provider="custom", model="my-model", base_url="http://localhost:8000/v1")
8 None (math only) Evaluator(provider="none", preset="math")

Presets

S.No. Preset Metrics included
1 rag Faithfulness, AnswerRelevance, ContextRelevance, Hallucination
2 chatbot AnswerRelevance, Coherence, Toxicity, Hallucination
3 safety Toxicity, Hallucination
4 summarization Faithfulness, Completeness, Coherence
5 math All 7 math metrics
6 math_minimal TokenOverlap, AnswerLength
7 hybrid_rag TokenOverlap, BLEU, KeywordCoverage, Faithfulness, Hallucination

Batch evaluation

from llmevalkit import Evaluator

evaluator = Evaluator(provider="none", preset="math")
batch = evaluator.evaluate_batch([
    {"question": "What is AI?", "answer": "AI is artificial intelligence.", "context": "..."},
    {"question": "What is ML?", "answer": "ML uses data to learn.", "context": "..."},
])
print(batch.pass_rate)
df = batch.to_dataframe()  # needs pandas
df.to_csv("results.csv")

CLI

llmevalkit evaluate --question "What is AI?" --answer "AI is artificial intelligence." --preset math
llmevalkit evaluate --file test_cases.json --output results.json
llmevalkit info

Project structure

llmevalkit/
    __init__.py
    evaluator.py
    models.py
    llm_client.py
    prompts.py
    cli.py
    metrics/
        README.md
        base.py
        faithfulness.py
        hallucination.py
        answer_relevance.py
        context_relevance.py
        coherence.py
        completeness.py
        toxicity.py
        geval.py
        math_metrics.py
    utils/
        token_counter.py
tests/
    test_llmeval.py
examples/
    all_15_metrics.py

License

MIT

Author

Venkatkumar(VK) - https://linkedin.com/in/venkatkumarvk

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmevalkit-2.0.0.tar.gz (48.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmevalkit-2.0.0-py3-none-any.whl (50.6 kB view details)

Uploaded Python 3

File details

Details for the file llmevalkit-2.0.0.tar.gz.

File metadata

  • Download URL: llmevalkit-2.0.0.tar.gz
  • Upload date:
  • Size: 48.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for llmevalkit-2.0.0.tar.gz
Algorithm Hash digest
SHA256 86f2d7ada2b3a795da0842a27477ede05ca31d9a74149bf9caf8b670f1b90414
MD5 65142640a54ce4c815ab93959266b36f
BLAKE2b-256 fb4e54ae61c2e3d8e10058f6188d2a39277c29daefcc81a9c1b407d2ae9128de

See more details on using hashes here.

File details

Details for the file llmevalkit-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: llmevalkit-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 50.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for llmevalkit-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3f4ae6787b3f882686aada22ef4ebda50bca0c0fbf2270f3316626e2bfb51f68
MD5 7e051f56e429667cbd0eae595f4900ec
BLAKE2b-256 14458f3c1f72d8bc60f2b2c428ae031f3e36a3613f95ab6b9ea8bc785a0db338

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page