A comprehensive, reference-free LLM evaluation library for RAG pipelines, chatbots, and generative AI systems.

These details have not been verified by PyPI

Project links

Project description

llmevalkit

A Python library for evaluating LLM outputs. 15 built-in metrics. Works with or without an API key.

7 math-based metrics: free, instant, runs offline
8 LLM-as-judge metrics: uses any LLM provider to evaluate
Supports: OpenAI, Azure, Anthropic, Groq, Ollama, or no provider at all

Install

pip install llmevalkit

Quick start

Math evaluation (free, no API key needed)

from llmevalkit import Evaluator

evaluator = Evaluator(provider="none", preset="math")
result = evaluator.evaluate(
    question="What is Python?",
    answer="Python is a high-level programming language.",
    context="Python is a high-level, interpreted programming language."
)
print(result.overall_score)
print(result.summary())

LLM-as-judge evaluation (needs API key)

from llmevalkit import Evaluator

evaluator = Evaluator(provider="groq", model="llama-3.1-70b-versatile", preset="rag")
result = evaluator.evaluate(
    question="What is Python?",
    answer="Python is a programming language.",
    context="Python is a high-level, interpreted programming language."
)
print(result.summary())

Hybrid (math + LLM together)

from llmevalkit import (
    Evaluator, BLEUScore, ROUGEScore, TokenOverlap,
    Faithfulness, Hallucination, GEval,
)

evaluator = Evaluator(
    provider="groq",
    model="llama-3.1-70b-versatile",
    metrics=[
        BLEUScore(), ROUGEScore(), TokenOverlap(),
        Faithfulness(), Hallucination(),
        GEval(criteria="Is this helpful for a beginner?"),
    ],
)
result = evaluator.evaluate(question="...", answer="...", context="...")

All 15 metrics

See metrics/README.md for detailed documentation on each metric including what it measures, how it works, the formula, and a code example.

Math metrics (no API needed)

S.No.	Metric	What it measures
1	BLEUScore	N-gram precision between answer and reference
2	ROUGEScore	Recall-oriented overlap (ROUGE-1, 2, L)
3	TokenOverlap	Word-level F1 with stopword filtering
4	SemanticSimilarity	Cosine similarity of text embeddings
5	KeywordCoverage	Percentage of key terms covered
6	AnswerLength	Whether answer meets min/max word count
7	ReadabilityScore	Flesch-Kincaid readability grade level

LLM-as-judge metrics (needs API)

S.No.	Metric	What it measures
8	Faithfulness	Is the answer grounded in the context?
9	Hallucination	Are there fabricated claims? (works without context)
10	AnswerRelevance	Does the answer address the question?
11	ContextRelevance	Is the retrieved context useful?
12	Coherence	Is the answer logically structured?
13	Completeness	Does the answer cover all aspects?
14	Toxicity	Is the content safe and appropriate?
15	GEval	Custom criteria you define

Supported providers

S.No.	Provider	Example
1	OpenAI	`Evaluator(provider="openai", model="gpt-4o-mini")`
2	Azure OpenAI	`Evaluator(provider="azure", model="gpt-4o-mini", api_key="...", base_url="...")`
3	Groq	`Evaluator(provider="groq", model="llama-3.1-70b-versatile")`
4	Anthropic	`Evaluator(provider="anthropic", model="claude-sonnet-4-20250514")`
5	HuggingFace	`Evaluator(provider="huggingface", model="meta-llama/Llama-3.1-8B-Instruct")`
6	Ollama	`Evaluator(provider="ollama", model="llama3.1")`
7	Custom	`Evaluator(provider="custom", model="my-model", base_url="http://localhost:8000/v1")`
8	None (math only)	`Evaluator(provider="none", preset="math")`

Presets

S.No.	Preset	Metrics included
1	rag	Faithfulness, AnswerRelevance, ContextRelevance, Hallucination
2	chatbot	AnswerRelevance, Coherence, Toxicity, Hallucination
3	safety	Toxicity, Hallucination
4	summarization	Faithfulness, Completeness, Coherence
5	math	All 7 math metrics
6	math_minimal	TokenOverlap, AnswerLength
7	hybrid_rag	TokenOverlap, BLEU, KeywordCoverage, Faithfulness, Hallucination

Batch evaluation

from llmevalkit import Evaluator

evaluator = Evaluator(provider="none", preset="math")
batch = evaluator.evaluate_batch([
    {"question": "What is AI?", "answer": "AI is artificial intelligence.", "context": "..."},
    {"question": "What is ML?", "answer": "ML uses data to learn.", "context": "..."},
])
print(batch.pass_rate)
df = batch.to_dataframe()  # needs pandas
df.to_csv("results.csv")

CLI

llmevalkit evaluate --question "What is AI?" --answer "AI is artificial intelligence." --preset math
llmevalkit evaluate --file test_cases.json --output results.json
llmevalkit info

Project structure

llmevalkit/
    __init__.py
    evaluator.py
    models.py
    llm_client.py
    prompts.py
    cli.py
    metrics/
        README.md
        base.py
        faithfulness.py
        hallucination.py
        answer_relevance.py
        context_relevance.py
        coherence.py
        completeness.py
        toxicity.py
        geval.py
        math_metrics.py
    utils/
        token_counter.py
tests/
    test_llmeval.py
examples/
    all_15_metrics.py

License

MIT

Author

Venkatkumar Rajan(VK) - https://linkedin.com/in/venkatkumarvk

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

5.0.2

Apr 25, 2026

5.0.1

Apr 25, 2026

5.0.0

Apr 25, 2026

4.0.1

Apr 18, 2026

4.0.0

Apr 18, 2026

3.0.4

Apr 3, 2026

3.0.3

Apr 3, 2026

3.0.2

Apr 3, 2026

3.0.1

Apr 3, 2026

3.0.0

Apr 3, 2026

2.0.3

Mar 29, 2026

2.0.2

Mar 29, 2026

2.0.1

Mar 29, 2026

2.0.0

Mar 29, 2026

1.0.3

Mar 21, 2026

This version

1.0.2

Mar 21, 2026

1.0.1

Mar 21, 2026

1.0.0

Mar 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmevalkit-1.0.2.tar.gz (34.0 kB view details)

Uploaded Mar 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmevalkit-1.0.2-py3-none-any.whl (31.5 kB view details)

Uploaded Mar 21, 2026 Python 3

File details

Details for the file llmevalkit-1.0.2.tar.gz.

File metadata

Download URL: llmevalkit-1.0.2.tar.gz
Upload date: Mar 21, 2026
Size: 34.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for llmevalkit-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`1a3dc4e56688426a15b3edecaa717440d5973cfe31aabc168e1f5d9038a6cff4`
MD5	`61a0bdd7f12cc65b7f607c0d52fc034d`
BLAKE2b-256	`f30bc6deb0112b59849c381790c44c8aa9b6b7b133ba269e7a127f2f1896c6eb`

See more details on using hashes here.

File details

Details for the file llmevalkit-1.0.2-py3-none-any.whl.

File metadata

Download URL: llmevalkit-1.0.2-py3-none-any.whl
Upload date: Mar 21, 2026
Size: 31.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for llmevalkit-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7b38e4bdc987afae04e70125f8a73b2634f6f2f2a120eb17cae418784226b4d5`
MD5	`3c6aafa7e125a972ba304e61302d7771`
BLAKE2b-256	`89ccd9db130a485bfce8dfc538d697d4fb511f404b3a818b6b581f625d1afbde`

See more details on using hashes here.

llmevalkit 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llmevalkit

Install

Quick start

Math evaluation (free, no API key needed)

LLM-as-judge evaluation (needs API key)

Hybrid (math + LLM together)

All 15 metrics

Math metrics (no API needed)

LLM-as-judge metrics (needs API)

Supported providers

Presets

Batch evaluation

CLI

Project structure

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes