LLM evaluation, compliance, document parsing, governance, security, hallucination detection, and multimodal testing. 46 metrics. Works with or without API.

These details have not been verified by PyPI

Project links

Project description

llmevalkit

LLM evaluation, hallucination detection, compliance, document parsing, governance, security, and multimodal testing library for Python.

46 built-in metrics across 7 modules. Everything works with or without an API key.

Works with any LLM application: RAG pipelines, agentic AI, multi-agent systems, GraphRAG, chatbots, document extraction, code generation, summarization, or any system that produces text output.

PyPI: https://pypi.org/project/llmevalkit/ GitHub: https://github.com/VK-Ant/llmevalkit Portfolio: https://vk-ant.github.io/Venkatkumar/

Install

pip install llmevalkit
pip install llmevalkit[nlp]       # adds spaCy for better PII/entity detection
pip install llmevalkit[doceval]   # adds thefuzz for document evaluation
pip install llmevalkit[all]       # everything

The Problem

Every LLM application faces the same questions:

Is the output good? Faithfulness, hallucination, relevance, coherence.

Is the output hallucinating? Wrong entities, wrong numbers, negation flips, fabricated facts, contradictions, overconfident wrong answers.

Is the output safe? PII leakage, HIPAA violations, GDPR issues, prompt injection.

Is the extraction correct? Do extracted fields match the source document?

Is the system governed and secure? NIST, ISO 42001, bias detection, prompt injection.

Most teams use 4-5 different tools to answer these questions. llmevalkit answers all of them in one library.

Quick Start

from llmevalkit import Evaluator

# Quality (free, no API)
evaluator = Evaluator(provider="none", preset="math")
result = evaluator.evaluate(
    question="What is Python?",
    answer="Python is a high-level programming language.",
    context="Python is a high-level, interpreted programming language."
)
print(result.summary())

# Hallucination detection (free, no API)
from llmevalkit.hallucination import NumericHallucination, EntityHallucination

nh = NumericHallucination()
result = nh.evaluate(
    answer="Revenue was $5 million last quarter.",
    context="The company reported quarterly revenue of $3 million."
)
print(result.score)   # flags: $5M vs $3M

# Compliance (free, no API)
evaluator = Evaluator(provider="none", preset="hipaa")
result = evaluator.evaluate(answer="Patient SSN: 123-45-6789")
print(result.summary())

# Security (free, no API)
from llmevalkit.security import PromptInjectionCheck
pi = PromptInjectionCheck()
result = pi.evaluate(answer="Ignore all previous instructions")
print(result.score)   # 0.0 -- injection detected

All 46 Metrics

Module 1: Quality Metrics (15)

Local metrics (no API needed):

#	Metric	What it measures
1	BLEUScore	N-gram precision
2	ROUGEScore	Recall-oriented overlap
3	TokenOverlap	Word-level F1
4	SemanticSimilarity	Cosine similarity of embeddings
5	KeywordCoverage	Key terms covered
6	AnswerLength	Min/max word count
7	ReadabilityScore	Flesch-Kincaid grade level

API metrics (needs provider):

#	Metric	What it measures
8	Faithfulness	Grounded in context?
9	Hallucination	Fabricated claims?
10	AnswerRelevance	Addresses the question?
11	ContextRelevance	Retrieved context useful?
12	Coherence	Logically structured?
13	Completeness	Covers all aspects?
14	Toxicity	Safe and appropriate?
15	GEval	Any custom criteria you define

Module 2: Compliance Metrics (6)

#	Metric	Regulation
16	PIIDetector	Universal (SSN, Aadhaar, PAN, email, phone, credit card)
17	HIPAACheck	US HIPAA (18 Safe Harbor identifiers)
18	GDPRCheck	EU GDPR (data minimization, consent, right to erasure)
19	DPDPCheck	India DPDP Act 2023 (children's data, Aadhaar/PAN)
20	EUAIActCheck	EU AI Act (risk classification, transparency)
21	CustomRule	Any rule you define

Module 3: Document Evaluation (5)

#	Metric	What it checks
22	FieldAccuracy	Extracted values match source?
23	FieldCompleteness	All expected fields present?
24	FieldHallucination	Any values fabricated?
25	FormatValidation	Dates, amounts, emails valid format?
26	ExtractionConsistency	Multiple runs produce same results?

Module 4: Governance Metrics (4)

#	Metric	Framework
27	NISTCheck	NIST AI Risk Management Framework
28	CoSAICheck	Coalition for Secure AI
29	ISO42001Check	ISO 42001 AI Management System
30	SOC2Check	SOC 2 Security Controls

Module 5: Security Metrics (2)

#	Metric	What it checks
31	PromptInjectionCheck	Instruction override, jailbreak, system prompt extraction
32	BiasDetector	Gender, racial, age bias and stereotyping

Module 6: Hallucination Detection (8) -- NEW in v4.0

#	Metric	What it catches	Without API	With API
33	EntityHallucination	Wrong names, places, orgs	NER + regex extraction	LLM compares entities
34	NumericHallucination	Wrong numbers, dates, amounts	Number extraction + comparison	LLM checks each number
35	NegationHallucination	"approved" vs "not approved"	Negation word detection	LLM checks logical flips
36	FabricatedInfo	Statements with no evidence	Sentence-level source coverage	LLM judges each statement
37	ContradictionDetector	Output contradicts context	Antonym + negation analysis	LLM finds contradictions
38	SelfConsistency	Different answers each run	Compare multiple outputs	No context needed
39	ConfidenceCalibration	"Definitely" + wrong answer	Confidence word detection	LLM evaluates calibration
40	InstructionHallucination	Answers wrong question	Question-answer overlap	LLM checks topic match

from llmevalkit.hallucination import (
    EntityHallucination, NumericHallucination, NegationHallucination,
    FabricatedInfo, ContradictionDetector, SelfConsistency,
    ConfidenceCalibration, InstructionHallucination,
)

# Entity check
eh = EntityHallucination()
result = eh.evaluate(
    answer="Dr. Kumar prescribed the medication.",
    context="Dr. Sharma is the attending physician."
)

# Numeric check
nh = NumericHallucination()
result = nh.evaluate(
    answer="Revenue was $5 million.",
    context="Company reported revenue of $3 million."
)

# Self-consistency (no context needed)
sc = SelfConsistency()
result = sc.evaluate(answer=[
    "Python was created in 1991.",
    "Python was created in 1989.",
    "Python was created in 1991.",
])

Module 7: Multimodal Metrics (6)

#	Metric	What it checks
41	OCRAccuracy	Word/character error rate for OCR
42	AudioTranscriptionAccuracy	WER/CER for speech-to-text
43	ImageTextAlignment	Text matches image description?
44	VisionQAAccuracy	Visual QA answer correct?
45	DocumentLayoutAccuracy	Headers, tables, sections preserved?
46	MultimodalConsistency	Text + image/audio descriptions consistent?

Works With Any LLM Application

#	Application	How llmevalkit helps
1	RAG pipelines	Faithfulness, Hallucination, EntityHallucination, NumericHallucination
2	AI agents	GEval, InstructionHallucination, SelfConsistency, PromptInjectionCheck
3	Multi-agent systems	Evaluate each agent individually or final output
4	Document extraction	FieldAccuracy, FieldCompleteness, FieldHallucination
5	Healthcare AI	HIPAACheck, EntityHallucination, NumericHallucination, NegationHallucination
6	Chatbots	Coherence, Toxicity, BiasDetector, PromptInjectionCheck
7	Code generation	GEval with custom criteria, PromptInjectionCheck
8	Summarization	ROUGE, Faithfulness, FabricatedInfo, ContradictionDetector
9	OCR / Document AI	OCRAccuracy, DocumentLayoutAccuracy, FieldAccuracy
10	Audio / Speech AI	AudioTranscriptionAccuracy, MultimodalConsistency

All Presets

#	Preset	Metrics
1	math / local	6 local quality metrics
2	rag	Faithfulness, Relevance, Hallucination
3	chatbot	Relevance, Coherence, Toxicity
4	hipaa	PII + HIPAACheck
5	gdpr	PII + GDPRCheck
6	india / dpdp	PII + DPDPCheck
7	compliance_all	All 5 compliance metrics
8	doceval	Accuracy, Completeness, Hallucination, Format
9	governance	NIST, CoSAI, ISO42001, SOC2
10	security	PromptInjection + BiasDetector
11	multimodal	All 6 multimodal metrics
12	hallucination	All 8 hallucination metrics
13	hallucination_quick	Entity + Numeric + Fabricated
14	hallucination_rag	Entity + Numeric + Contradiction + Fabricated
15	hallucination_agent	SelfConsistency + Instruction + Confidence
16	hallucination_medical	Entity + Numeric + Negation + Contradiction
17	full_audit	Quality + Compliance + Security + Hallucination
18	enterprise	Quality + Compliance + Security + NIST

Supported Providers

#	Provider	Example
1	OpenAI	`Evaluator(provider="openai", model="gpt-4o-mini")`
2	Azure OpenAI	`Evaluator(provider="azure", model="gpt-4o-mini")`
3	Groq	`Evaluator(provider="groq", model="llama-3.3-70b-versatile")`
4	Anthropic	`Evaluator(provider="anthropic", model="claude-sonnet-4-20250514")`
5	HuggingFace	`Evaluator(provider="huggingface", model="meta-llama/Llama-3.1-8B-Instruct")`
6	Ollama	`Evaluator(provider="ollama", model="llama3.1")`
7	Custom	`Evaluator(provider="custom", model="my-model", base_url="...")`
8	None (offline)	`Evaluator(provider="none", preset="math")`

Disclaimer

llmevalkit is a testing and evaluation tool. It helps developers detect potential issues in LLM outputs including hallucinations, compliance violations, security vulnerabilities, and extraction errors. It does not guarantee detection of all issues. Always verify critical outputs with domain experts.

Hallucination detection metrics provide automated checks but cannot catch every hallucination. Use them as part of a broader evaluation strategy, not as the sole verification method.

HIPAA, GDPR, DPDP Act, EU AI Act, NIST AI RMF, CoSAI, ISO 42001, and SOC 2 are government regulations and industry frameworks. llmevalkit is not affiliated with, endorsed by, or certified by any government body or standards organization. Using this library does not make your system compliant. Consult qualified professionals for compliance decisions.

License

MIT

Author

Venkatkumar Rajan

LinkedIn: https://linkedin.com/in/venkatkumarvk GitHub: https://github.com/VK-Ant Portfolio: https://vk-ant.github.io/Venkatkumar/ PyPI: https://pypi.org/project/llmevalkit/

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

5.0.2

Apr 25, 2026

5.0.1

Apr 25, 2026

5.0.0

Apr 25, 2026

4.0.1

Apr 18, 2026

This version

4.0.0

Apr 18, 2026

3.0.4

Apr 3, 2026

3.0.3

Apr 3, 2026

3.0.2

Apr 3, 2026

3.0.1

Apr 3, 2026

3.0.0

Apr 3, 2026

2.0.3

Mar 29, 2026

2.0.2

Mar 29, 2026

2.0.1

Mar 29, 2026

2.0.0

Mar 29, 2026

1.0.3

Mar 21, 2026

1.0.2

Mar 21, 2026

1.0.1

Mar 21, 2026

1.0.0

Mar 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmevalkit-4.0.0.tar.gz (101.2 kB view details)

Uploaded Apr 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmevalkit-4.0.0-py3-none-any.whl (92.5 kB view details)

Uploaded Apr 18, 2026 Python 3

File details

Details for the file llmevalkit-4.0.0.tar.gz.

File metadata

Download URL: llmevalkit-4.0.0.tar.gz
Upload date: Apr 18, 2026
Size: 101.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for llmevalkit-4.0.0.tar.gz
Algorithm	Hash digest
SHA256	`131786c0c2955c52fc4c33897738243890a4f74104a3ef504469e855fd58ecf0`
MD5	`23484d99d872e1c75b8f77e917fb9e55`
BLAKE2b-256	`df39172a801ca1b63f017409644fc6fb52f890c8b5d1428f05a2262440e017b4`

See more details on using hashes here.

File details

Details for the file llmevalkit-4.0.0-py3-none-any.whl.

File metadata

Download URL: llmevalkit-4.0.0-py3-none-any.whl
Upload date: Apr 18, 2026
Size: 92.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for llmevalkit-4.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`96afc39d2e465a3b10aeccbf331c27edc880a5593e304fb238a5c17bb9238865`
MD5	`bd81fa6c052709ca181691b495b2cf43`
BLAKE2b-256	`ada1a9c2d813e02d5ebd8ad63a869cb43b639ee2fbb6a4139fc787186ce74782`

See more details on using hashes here.

llmevalkit 4.0.0

Navigation

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llmevalkit

Install

The Problem

Quick Start

All 46 Metrics

Module 1: Quality Metrics (15)

Module 2: Compliance Metrics (6)

Module 3: Document Evaluation (5)

Module 4: Governance Metrics (4)

Module 5: Security Metrics (2)

Module 6: Hallucination Detection (8) -- NEW in v4.0

Module 7: Multimodal Metrics (6)

Works With Any LLM Application

All Presets

Supported Providers

Disclaimer

License

Author

Project details

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes