LLM evaluation and compliance testing library. Quality metrics + PII detection + HIPAA/GDPR/DPDP/EU AI Act compliance. Works with or without API.

These details have not been verified by PyPI

Project links

Project description

llmevalkit

LLM evaluation and compliance testing library for Python. 21 built-in metrics: 15 quality + 6 compliance. Works with or without an API key.

7 local quality metrics: free, instant, runs offline
8 API quality metrics: uses any LLM provider to evaluate
6 compliance metrics: PII, HIPAA, GDPR, DPDP Act, EU AI Act, Custom Rules
Parallel execution: API metrics run simultaneously for speed
8 providers: OpenAI, Azure, Groq, Anthropic, HuggingFace, Ollama, Custom, None

Install

pip install llmevalkit

For deeper PII detection with NLP (optional):

pip install llmevalkit[nlp]
python -m spacy download en_core_web_sm

Quick Start

Quality evaluation (free, no API)

from llmevalkit import Evaluator

evaluator = Evaluator(provider="none", preset="math")
result = evaluator.evaluate(
    question="What is Python?",
    answer="Python is a high-level programming language.",
    context="Python is a high-level, interpreted programming language."
)
print(result.overall_score)
print(result.summary())

LLM-as-judge evaluation (needs API key)

from llmevalkit import Evaluator

evaluator = Evaluator(provider="groq", model="llama-3.3-70b-versatile", preset="rag")
result = evaluator.evaluate(
    question="What is Python?",
    answer="Python is a programming language.",
    context="Python is a high-level, interpreted programming language."
)
print(result.summary())

Compliance testing (free, no API)

from llmevalkit import Evaluator

evaluator = Evaluator(provider="none", preset="hipaa")
result = evaluator.evaluate(
    answer="Patient John Smith, SSN 123-45-6789, was admitted on 03/15/1980."
)
print(result.summary())
# Score: 0.0 -- HIPAA identifiers detected

Quality + Compliance together

from llmevalkit import Evaluator, BLEUScore, ROUGEScore
from llmevalkit.compliance import PIIDetector, HIPAACheck

evaluator = Evaluator(
    provider="none",
    metrics=[BLEUScore(), ROUGEScore(), PIIDetector(), HIPAACheck()],
)
result = evaluator.evaluate(
    answer="Solar energy reduces carbon emissions.",
    context="Solar energy is a renewable source."
)
for name, m in result.metrics.items():
    print("{:<22} {:.3f}".format(name, m.score))

Custom metrics (pick and choose)

from llmevalkit import (
    Evaluator, BLEUScore, ROUGEScore, TokenOverlap,
    Faithfulness, Hallucination, GEval,
)

evaluator = Evaluator(
    provider="groq",
    model="llama-3.3-70b-versatile",
    metrics=[
        BLEUScore(), ROUGEScore(), TokenOverlap(),
        Faithfulness(), Hallucination(),
        GEval(criteria="Is this helpful for a beginner?"),
    ],
)
result = evaluator.evaluate(question="...", answer="...", context="...")

All 21 Metrics

Local quality metrics (no API needed)

S.No.	Metric	What it measures
1	BLEUScore	N-gram precision between answer and reference
2	ROUGEScore	Recall-oriented overlap (ROUGE-1, 2, L)
3	TokenOverlap	Word-level F1 with stopword filtering
4	SemanticSimilarity	Cosine similarity of text embeddings
5	KeywordCoverage	Percentage of key terms covered
6	AnswerLength	Whether answer meets min/max word count
7	ReadabilityScore	Flesch-Kincaid readability grade level

API quality metrics (needs provider)

S.No.	Metric	What it measures
8	Faithfulness	Is the answer grounded in the context?
9	Hallucination	Are there fabricated claims? (works without context)
10	AnswerRelevance	Does the answer address the question?
11	ContextRelevance	Is the retrieved context useful?
12	Coherence	Is the answer logically structured?
13	Completeness	Does the answer cover all aspects?
14	Toxicity	Is the content safe and appropriate?
15	GEval	Custom criteria you define

Compliance metrics (works without API or with API for deeper analysis)

S.No.	Metric	What it checks	Regulation
16	PIIDetector	Names, SSN, Aadhaar, PAN, email, phone, credit card, IP	Universal
17	HIPAACheck	All 18 Safe Harbor identifiers (45 CFR 164.514)	US HIPAA
18	GDPRCheck	Data minimization, consent, right to erasure, transparency	EU GDPR
19	DPDPCheck	Aadhaar/PAN exposure, consent, children's data, data principal rights	India DPDP Act 2023
20	EUAIActCheck	Risk classification (4 levels), transparency, human oversight, prohibited practices	EU AI Act
21	CustomRule	Any compliance rule you define (keyword-based or LLM-based)	User-defined

Quality Metric Examples

Individual local metrics

from llmevalkit import BLEUScore, ROUGEScore, TokenOverlap, KeywordCoverage

answer = "Python is a high-level programming language for web and data science."
context = "Python is a high-level, interpreted programming language."

bleu = BLEUScore()
r = bleu.evaluate(answer=answer, context=context)
print("BLEU: {:.3f}".format(r.score))
print("Precisions: {}".format(r.details["precisions"]))

rouge = ROUGEScore()
r = rouge.evaluate(answer=answer, context=context)
print("ROUGE: {:.3f}".format(r.score))
print("ROUGE-1 F1: {}".format(r.details["rouge1"]["f1"]))

overlap = TokenOverlap()
r = overlap.evaluate(answer=answer, context=context)
print("Token Overlap: {:.3f}".format(r.score))

kw = KeywordCoverage()
r = kw.evaluate(answer=answer, context=context)
print("Keyword Coverage: {:.3f}".format(r.score))
print("Missing: {}".format(r.details["missing"]))

Semantic similarity and readability

from llmevalkit import SemanticSimilarity, ReadabilityScore, AnswerLength

sim = SemanticSimilarity()
r = sim.evaluate(answer="Python is a coding language.", context="Python is a programming language.")
print("Similarity: {:.3f}".format(r.score))

read = ReadabilityScore()
r = read.evaluate(answer="Python is a simple language for beginners.")
print("Readability: {:.3f}".format(r.score))
print("Grade level: {}".format(r.details.get("flesch_kincaid_grade")))

length = AnswerLength(min_words=10, max_words=200)
r = length.evaluate(answer="Yes.")
print("Length score: {:.3f}, words: {}".format(r.score, r.details.get("word_count")))

LLM-as-judge metrics

from llmevalkit import Evaluator, Faithfulness, Hallucination, AnswerRelevance, GEval

evaluator = Evaluator(
    provider="groq",
    model="llama-3.3-70b-versatile",
    metrics=[Faithfulness(), Hallucination(), AnswerRelevance()],
)
result = evaluator.evaluate(
    question="What are the benefits of solar energy?",
    answer="Solar energy is renewable and reduces electricity bills.",
    context="Solar energy is a renewable source that lowers electricity costs."
)
for name, m in result.metrics.items():
    print("{}: {:.3f} - {}".format(name, m.score, m.reason[:80]))

GEval with custom criteria

from llmevalkit import Evaluator, GEval

evaluator = Evaluator(
    provider="groq",
    model="llama-3.3-70b-versatile",
    metrics=[
        GEval(criteria="Is the response helpful for someone considering solar energy?"),
        GEval(criteria="Does the answer include specific facts or numbers?"),
    ],
)
result = evaluator.evaluate(
    question="What are the benefits of solar energy?",
    answer="Solar panels can last 25-30 years and reduce electricity bills by 50-75%."
)

Compliance Metric Examples

PIIDetector

from llmevalkit.compliance import PIIDetector

pii = PIIDetector()                # pattern + NLP, free
pii = PIIDetector(use_llm=True)    # pattern + NLP + LLM, deeper

result = pii.evaluate(
    answer="Contact raj@gmail.com or call +91 98765 43210. PAN: ABCDE1234F."
)
print("Score: {}".format(result.score))  # 0.0 = PII found
for item in result.details["pii_found"]:
    print("  {}: {}".format(item["type"], item["value"]))

HIPAACheck

from llmevalkit.compliance import HIPAACheck

hipaa = HIPAACheck()                # pattern + NLP, free
hipaa = HIPAACheck(use_llm=True)    # adds LLM for contextual analysis

result = hipaa.evaluate(
    answer="Patient SSN: 123-45-6789, MRN: 12345678"
)
print("Identifiers found: {}".format(result.details["identifiers_found"]))
# [7, 8] -- SSN is #7, MRN is #8 in HIPAA's 18 identifiers

GDPRCheck

from llmevalkit.compliance import GDPRCheck

gdpr = GDPRCheck()
result = gdpr.evaluate(
    question="How do I delete my data?",
    answer="We store all data securely."
)
# Flags: Article 17 right to erasure not acknowledged

DPDPCheck

from llmevalkit.compliance import DPDPCheck

dpdp = DPDPCheck()
result = dpdp.evaluate(
    answer="We collect student data for targeted advertising to children."
)
# Flags: Section 9 children's data violation

EUAIActCheck

from llmevalkit.compliance import EUAIActCheck

eu = EUAIActCheck()
result = eu.evaluate(
    answer="We calculate a social score for each citizen."
)
print("Risk level: {}".format(result.details["risk_level"]))  # unacceptable

CustomRule

from llmevalkit.compliance import CustomRule

rule = CustomRule(
    rule="Output must not contain API keys or secrets",
    keywords=["api_key", "secret", "password", "sk-"],
    use_llm=False,
)
result = rule.evaluate(answer="Set your api_key=sk-12345")
# Score: 0.0 (keyword matched)

Supported Providers

S.No.	Provider	Example
1	OpenAI	`Evaluator(provider="openai", model="gpt-4o-mini")`
2	Azure OpenAI	`Evaluator(provider="azure", model="gpt-4o-mini", api_key="...", base_url="...")`
3	Groq	`Evaluator(provider="groq", model="llama-3.3-70b-versatile")`
4	Anthropic	`Evaluator(provider="anthropic", model="claude-sonnet-4-20250514")`
5	HuggingFace	`Evaluator(provider="huggingface", model="meta-llama/Llama-3.1-8B-Instruct")`
6	Ollama	`Evaluator(provider="ollama", model="llama3.1")`
7	Custom	`Evaluator(provider="custom", model="my-model", base_url="http://localhost:8000/v1")`
8	None (local only)	`Evaluator(provider="none", preset="math")`

Presets

Quality presets

S.No.	Preset	Metrics included
1	math / local	BLEUScore, ROUGEScore, TokenOverlap, KeywordCoverage, AnswerLength, ReadabilityScore
2	rag	Faithfulness, AnswerRelevance, ContextRelevance, Hallucination
3	chatbot	AnswerRelevance, Coherence, Toxicity, Hallucination
4	summarization	Faithfulness, Completeness, Coherence
5	safety	Toxicity, Hallucination
6	hybrid_rag	TokenOverlap, BLEU, KeywordCoverage, Faithfulness, Hallucination

Compliance presets

S.No.	Preset	Metrics included
7	pii	PIIDetector
8	hipaa	PIIDetector, HIPAACheck
9	gdpr	PIIDetector, GDPRCheck
10	india / dpdp	PIIDetector, DPDPCheck
11	eu_ai	PIIDetector, GDPRCheck, EUAIActCheck
12	compliance_all	PIIDetector, HIPAACheck, GDPRCheck, DPDPCheck, EUAIActCheck

Combined presets (quality + compliance)

S.No.	Preset	Metrics included
13	rag_hipaa	Faithfulness, Hallucination, AnswerRelevance, PIIDetector, HIPAACheck
14	rag_gdpr	Faithfulness, Hallucination, AnswerRelevance, PIIDetector, GDPRCheck
15	rag_india	Faithfulness, Hallucination, AnswerRelevance, PIIDetector, DPDPCheck

Batch Evaluation

from llmevalkit import Evaluator

# Quality batch
evaluator = Evaluator(provider="none", preset="math")
batch = evaluator.evaluate_batch([
    {"question": "What is AI?", "answer": "AI is artificial intelligence.", "context": "AI is..."},
    {"question": "What is AI?", "answer": "Yes.", "context": "AI is..."},
])
print("Pass rate: {:.0%}".format(batch.pass_rate))

# Compliance batch
evaluator = Evaluator(provider="none", preset="hipaa")
batch = evaluator.evaluate_batch([
    {"answer": "Recovery rate improved by 20%."},
    {"answer": "Patient John, SSN 123-45-6789."},
])
print("Pass rate: {:.0%}".format(batch.pass_rate))

CLI

llmevalkit evaluate --question "What is AI?" --answer "AI is artificial intelligence." --preset math
llmevalkit info

Disclaimer

llmevalkit is a testing and evaluation tool. It helps developers detect potential compliance issues in LLM outputs. It does not provide legal advice, regulatory certification, or compliance guarantees.

HIPAA, GDPR, DPDP Act, EU AI Act, and NIST AI RMF are government regulations and frameworks. llmevalkit is not affiliated with, endorsed by, or certified by any government body.

Using this library does not make your system compliant with any regulation. Consult qualified legal and compliance professionals for compliance decisions.

License

MIT

Author

Venkatkumar Rajan - https://linkedin.com/in/venkatkumarvk | https://github.com/VK-Ant

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

5.0.2

Apr 25, 2026

5.0.1

Apr 25, 2026

5.0.0

Apr 25, 2026

4.0.1

Apr 18, 2026

4.0.0

Apr 18, 2026

3.0.4

Apr 3, 2026

3.0.3

Apr 3, 2026

3.0.2

Apr 3, 2026

3.0.1

Apr 3, 2026

3.0.0

Apr 3, 2026

This version

2.0.3

Mar 29, 2026

2.0.2

Mar 29, 2026

2.0.1

Mar 29, 2026

2.0.0

Mar 29, 2026

1.0.3

Mar 21, 2026

1.0.2

Mar 21, 2026

1.0.1

Mar 21, 2026

1.0.0

Mar 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmevalkit-2.0.3.tar.gz (59.7 kB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmevalkit-2.0.3-py3-none-any.whl (52.7 kB view details)

Uploaded Mar 29, 2026 Python 3

File details

Details for the file llmevalkit-2.0.3.tar.gz.

File metadata

Download URL: llmevalkit-2.0.3.tar.gz
Upload date: Mar 29, 2026
Size: 59.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for llmevalkit-2.0.3.tar.gz
Algorithm	Hash digest
SHA256	`65ccc10d1ad423fd27bbc25fd12c11d691e8b20611df639d538390e0eaefc226`
MD5	`ea348c925e006d7de200f61b597b6058`
BLAKE2b-256	`978b56c07ff01f167944286954cd4a541454aade852b3773d96f9a1ef9a9c8a8`

See more details on using hashes here.

File details

Details for the file llmevalkit-2.0.3-py3-none-any.whl.

File metadata

Download URL: llmevalkit-2.0.3-py3-none-any.whl
Upload date: Mar 29, 2026
Size: 52.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for llmevalkit-2.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a798c06de78e0949d88fa35a3583898e8d4585c9bb6b3f3f7f267814fe1000ef`
MD5	`54a3ee07f34dd2e8efbd83cc93b2a1b9`
BLAKE2b-256	`eb2b2c7952c7baf7f8f974643b2978b783e79bedff694fafb508e5d5dead05ea`

See more details on using hashes here.

llmevalkit 2.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llmevalkit

Install

Quick Start

Quality evaluation (free, no API)

LLM-as-judge evaluation (needs API key)

Compliance testing (free, no API)

Quality + Compliance together

Custom metrics (pick and choose)

All 21 Metrics

Local quality metrics (no API needed)

API quality metrics (needs provider)

Compliance metrics (works without API or with API for deeper analysis)

Quality Metric Examples

Individual local metrics

Semantic similarity and readability

LLM-as-judge metrics

GEval with custom criteria

Compliance Metric Examples

PIIDetector

HIPAACheck

GDPRCheck

DPDPCheck

EUAIActCheck

CustomRule

Supported Providers

Presets

Quality presets

Compliance presets

Combined presets (quality + compliance)

Batch Evaluation

CLI

Disclaimer

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes