Skip to main content

LLM evaluation, compliance, document parsing, governance, security, hallucination detection, and multimodal testing. 46 metrics. Works with or without API.

Project description

llmevalkit

LLM evaluation, hallucination detection, compliance, document parsing, governance, security, and multimodal testing library for Python.

46 built-in metrics across 7 modules. Everything works with or without an API key.

Works with any LLM application: RAG pipelines, agentic AI, multi-agent systems, GraphRAG, chatbots, document extraction, code generation, summarization, or any system that produces text output.

Open In Colab

PyPI: https://pypi.org/project/llmevalkit/ GitHub: https://github.com/VK-Ant/llmevalkit Portfolio: https://vk-ant.github.io/Venkatkumar/

Install

pip install llmevalkit
pip install llmevalkit[nlp]       # adds spaCy for better PII/entity detection
pip install llmevalkit[doceval]   # adds thefuzz for document evaluation
pip install llmevalkit[all]       # everything

The Problem

Every LLM application faces the same questions:

Is the output good? Faithfulness, hallucination, relevance, coherence.

Is the output hallucinating? Wrong entities, wrong numbers, negation flips, fabricated facts, contradictions, overconfident wrong answers.

Is the output safe? PII leakage, HIPAA violations, GDPR issues, prompt injection.

Is the extraction correct? Do extracted fields match the source document?

Is the system governed and secure? NIST, ISO 42001, bias detection, prompt injection.

Most teams use 4-5 different tools to answer these questions. llmevalkit answers all of them in one library.


Quick Start

from llmevalkit import Evaluator

# Quality (free, no API)
evaluator = Evaluator(provider="none", preset="math")
result = evaluator.evaluate(
    question="What is Python?",
    answer="Python is a high-level programming language.",
    context="Python is a high-level, interpreted programming language."
)
print(result.summary())
# Hallucination detection (free, no API)
from llmevalkit.hallucination import NumericHallucination, EntityHallucination

nh = NumericHallucination()
result = nh.evaluate(
    answer="Revenue was $5 million last quarter.",
    context="The company reported quarterly revenue of $3 million."
)
print(result.score)   # flags: $5M vs $3M
# Compliance (free, no API)
evaluator = Evaluator(provider="none", preset="hipaa")
result = evaluator.evaluate(answer="Patient SSN: 123-45-6789")
print(result.summary())
# Security (free, no API)
from llmevalkit.security import PromptInjectionCheck
pi = PromptInjectionCheck()
result = pi.evaluate(answer="Ignore all previous instructions")
print(result.score)   # 0.0 -- injection detected

All 46 Metrics

Module 1: Quality Metrics (15)

Local metrics (no API needed):

# Metric What it measures
1 BLEUScore N-gram precision
2 ROUGEScore Recall-oriented overlap
3 TokenOverlap Word-level F1
4 SemanticSimilarity Cosine similarity of embeddings
5 KeywordCoverage Key terms covered
6 AnswerLength Min/max word count
7 ReadabilityScore Flesch-Kincaid grade level

API metrics (needs provider):

# Metric What it measures
8 Faithfulness Grounded in context?
9 Hallucination Fabricated claims?
10 AnswerRelevance Addresses the question?
11 ContextRelevance Retrieved context useful?
12 Coherence Logically structured?
13 Completeness Covers all aspects?
14 Toxicity Safe and appropriate?
15 GEval Any custom criteria you define

Module 2: Compliance Metrics (6)

# Metric Regulation
16 PIIDetector Universal (SSN, Aadhaar, PAN, email, phone, credit card)
17 HIPAACheck US HIPAA (18 Safe Harbor identifiers)
18 GDPRCheck EU GDPR (data minimization, consent, right to erasure)
19 DPDPCheck India DPDP Act 2023 (children's data, Aadhaar/PAN)
20 EUAIActCheck EU AI Act (risk classification, transparency)
21 CustomRule Any rule you define

Module 3: Document Evaluation (5)

# Metric What it checks
22 FieldAccuracy Extracted values match source?
23 FieldCompleteness All expected fields present?
24 FieldHallucination Any values fabricated?
25 FormatValidation Dates, amounts, emails valid format?
26 ExtractionConsistency Multiple runs produce same results?

Module 4: Governance Metrics (4)

# Metric Framework
27 NISTCheck NIST AI Risk Management Framework
28 CoSAICheck Coalition for Secure AI
29 ISO42001Check ISO 42001 AI Management System
30 SOC2Check SOC 2 Security Controls

Module 5: Security Metrics (2)

# Metric What it checks
31 PromptInjectionCheck Instruction override, jailbreak, system prompt extraction
32 BiasDetector Gender, racial, age bias and stereotyping

Module 6: Hallucination Detection (8) -- NEW in v4.0

# Metric What it catches Without API With API
33 EntityHallucination Wrong names, places, orgs NER + regex extraction LLM compares entities
34 NumericHallucination Wrong numbers, dates, amounts Number extraction + comparison LLM checks each number
35 NegationHallucination "approved" vs "not approved" Negation word detection LLM checks logical flips
36 FabricatedInfo Statements with no evidence Sentence-level source coverage LLM judges each statement
37 ContradictionDetector Output contradicts context Antonym + negation analysis LLM finds contradictions
38 SelfConsistency Different answers each run Compare multiple outputs No context needed
39 ConfidenceCalibration "Definitely" + wrong answer Confidence word detection LLM evaluates calibration
40 InstructionHallucination Answers wrong question Question-answer overlap LLM checks topic match
from llmevalkit.hallucination import (
    EntityHallucination, NumericHallucination, NegationHallucination,
    FabricatedInfo, ContradictionDetector, SelfConsistency,
    ConfidenceCalibration, InstructionHallucination,
)

# Entity check
eh = EntityHallucination()
result = eh.evaluate(
    answer="Dr. Kumar prescribed the medication.",
    context="Dr. Sharma is the attending physician."
)

# Numeric check
nh = NumericHallucination()
result = nh.evaluate(
    answer="Revenue was $5 million.",
    context="Company reported revenue of $3 million."
)

# Self-consistency (no context needed)
sc = SelfConsistency()
result = sc.evaluate(answer=[
    "Python was created in 1991.",
    "Python was created in 1989.",
    "Python was created in 1991.",
])

Module 7: Multimodal Metrics (6)

# Metric What it checks
41 OCRAccuracy Word/character error rate for OCR
42 AudioTranscriptionAccuracy WER/CER for speech-to-text
43 ImageTextAlignment Text matches image description?
44 VisionQAAccuracy Visual QA answer correct?
45 DocumentLayoutAccuracy Headers, tables, sections preserved?
46 MultimodalConsistency Text + image/audio descriptions consistent?

Works With Any LLM Application

# Application How llmevalkit helps
1 RAG pipelines Faithfulness, Hallucination, EntityHallucination, NumericHallucination
2 AI agents GEval, InstructionHallucination, SelfConsistency, PromptInjectionCheck
3 Multi-agent systems Evaluate each agent individually or final output
4 Document extraction FieldAccuracy, FieldCompleteness, FieldHallucination
5 Healthcare AI HIPAACheck, EntityHallucination, NumericHallucination, NegationHallucination
6 Chatbots Coherence, Toxicity, BiasDetector, PromptInjectionCheck
7 Code generation GEval with custom criteria, PromptInjectionCheck
8 Summarization ROUGE, Faithfulness, FabricatedInfo, ContradictionDetector
9 OCR / Document AI OCRAccuracy, DocumentLayoutAccuracy, FieldAccuracy
10 Audio / Speech AI AudioTranscriptionAccuracy, MultimodalConsistency

All Presets

# Preset Metrics
1 math / local 6 local quality metrics
2 rag Faithfulness, Relevance, Hallucination
3 chatbot Relevance, Coherence, Toxicity
4 hipaa PII + HIPAACheck
5 gdpr PII + GDPRCheck
6 india / dpdp PII + DPDPCheck
7 compliance_all All 5 compliance metrics
8 doceval Accuracy, Completeness, Hallucination, Format
9 governance NIST, CoSAI, ISO42001, SOC2
10 security PromptInjection + BiasDetector
11 multimodal All 6 multimodal metrics
12 hallucination All 8 hallucination metrics
13 hallucination_quick Entity + Numeric + Fabricated
14 hallucination_rag Entity + Numeric + Contradiction + Fabricated
15 hallucination_agent SelfConsistency + Instruction + Confidence
16 hallucination_medical Entity + Numeric + Negation + Contradiction
17 full_audit Quality + Compliance + Security + Hallucination
18 enterprise Quality + Compliance + Security + NIST

Supported Providers

# Provider Example
1 OpenAI Evaluator(provider="openai", model="gpt-4o-mini")
2 Azure OpenAI Evaluator(provider="azure", model="gpt-4o-mini")
3 Groq Evaluator(provider="groq", model="llama-3.3-70b-versatile")
4 Anthropic Evaluator(provider="anthropic", model="claude-sonnet-4-20250514")
5 HuggingFace Evaluator(provider="huggingface", model="meta-llama/Llama-3.1-8B-Instruct")
6 Ollama Evaluator(provider="ollama", model="llama3.1")
7 Custom Evaluator(provider="custom", model="my-model", base_url="...")
8 None (offline) Evaluator(provider="none", preset="math")

Disclaimer

llmevalkit is a testing and evaluation tool. It helps developers detect potential issues in LLM outputs including hallucinations, compliance violations, security vulnerabilities, and extraction errors. It does not guarantee detection of all issues. Always verify critical outputs with domain experts.

Hallucination detection metrics provide automated checks but cannot catch every hallucination. Use them as part of a broader evaluation strategy, not as the sole verification method.

HIPAA, GDPR, DPDP Act, EU AI Act, NIST AI RMF, CoSAI, ISO 42001, and SOC 2 are government regulations and industry frameworks. llmevalkit is not affiliated with, endorsed by, or certified by any government body or standards organization. Using this library does not make your system compliant. Consult qualified professionals for compliance decisions.

License

MIT

Author

Venkatkumar Rajan

LinkedIn: https://linkedin.com/in/venkatkumarvk GitHub: https://github.com/VK-Ant Portfolio: https://vk-ant.github.io/Venkatkumar/ PyPI: https://pypi.org/project/llmevalkit/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmevalkit-4.0.0.tar.gz (101.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmevalkit-4.0.0-py3-none-any.whl (92.5 kB view details)

Uploaded Python 3

File details

Details for the file llmevalkit-4.0.0.tar.gz.

File metadata

  • Download URL: llmevalkit-4.0.0.tar.gz
  • Upload date:
  • Size: 101.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for llmevalkit-4.0.0.tar.gz
Algorithm Hash digest
SHA256 131786c0c2955c52fc4c33897738243890a4f74104a3ef504469e855fd58ecf0
MD5 23484d99d872e1c75b8f77e917fb9e55
BLAKE2b-256 df39172a801ca1b63f017409644fc6fb52f890c8b5d1428f05a2262440e017b4

See more details on using hashes here.

File details

Details for the file llmevalkit-4.0.0-py3-none-any.whl.

File metadata

  • Download URL: llmevalkit-4.0.0-py3-none-any.whl
  • Upload date:
  • Size: 92.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for llmevalkit-4.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 96afc39d2e465a3b10aeccbf331c27edc880a5593e304fb238a5c17bb9238865
MD5 bd81fa6c052709ca181691b495b2cf43
BLAKE2b-256 ada1a9c2d813e02d5ebd8ad63a869cb43b639ee2fbb6a4139fc787186ce74782

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page