Comprehensive testing suite for LLM evaluation: hallucination detection, consistency, robustness, safety, and multi-language code generation assessment.
Project description
⭐ If you find this project useful, please consider starring it — it helps others discover it!
LLM TestLab
Comprehensive Testing Suite for Large Language Models
A flexible Python toolkit for evaluating LLMs on:
- Text Metrics: Hallucination, consistency, semantic robustness, safety
- Code Evaluation: Syntax, execution, quality, security, semantic correctness across 9+ languages
- Dual Embedders: Optimized for both text and code analysis
- Optional FAISS: High-performance vector similarity
Features
Text Evaluation Metrics
- Hallucination Severity Index (HSI) – Detect factual deviations from knowledge base
- Consistency Stability Score (CSS) – Measure output stability across runs
- Semantic Robustness Index (SRI) – Test invariance to paraphrasing
- Safety Vulnerability Exposure (SVE) – Detect unsafe responses to adversarial prompts
- Knowledge Base Coverage (KBC) – Measure factual alignment
Code Evaluation Metrics (9+ Languages)
- Syntax Validity (SV) – Compiler/interpreter-based validation
- Execution Pass Rate (EPR) – Test case execution and verification
- Code Quality Score (CQS) – Complexity, documentation, error handling
- Security Risk Score (SRS) – Vulnerability pattern detection
- Semantic Code Correctness (SCC) – Embedding-based similarity to reference
- Comprehensive Code Evaluation (CCE) – Weighted aggregation of all metrics
Supported Languages: Python, JavaScript, TypeScript, Java, C, C++, Go, Rust, Ruby, PHP
Advanced Features
- Dual Embedders:
all-MiniLM-L6-v2for text,BAAI/bge-small-en-v1.5for code - FAISS Support: Optional, for faster similarity searches
- Knowledge Base Management: Add, remove, or list facts
- Security Patterns: Customizable keywords and regex patterns
- Rich Logging: Built-in debug/info logging
Project Structure
llm-testlab/
├── llm_testing_suite/
│ ├── __init__.py
│ ├── llm_testing_suite.py # Main test suite (text metrics)
│ └── code_evaluator.py # Code evaluation module
├── examples/
│ ├── run_text_evaluation.py # Text metrics evaluation script
│ ├── run_code_evaluation.py # Code metrics evaluation script
│ ├── groq_example.py # Groq API text evaluation
│ ├── groq_code_evaluation.py # Groq API code evaluation
│ └── huggingface_example.py # HuggingFace integration
├── pyproject.toml # Package configuration
├── requirements.txt # Dependencies
├── README.md
├── LICENSE
└── .gitignore
Installation
From PyPI
pip install llm-testlab
From Source
git clone https://github.com/Saivineeth147/llm-testlab.git
cd llm-testlab
pip install .
Optional Dependencies
# With FAISS and HuggingFace support
pip install llm-testlab[faiss,huggingface]
# Or install individually
pip install faiss-cpu # or faiss-gpu
pip install transformers
Quick Start
Text Metrics Example
from llm_testing_suite import LLMTestSuite
def my_llm(prompt):
return "Rome is the capital of Italy"
# Initialize with FAISS support
suite = LLMTestSuite(my_llm, use_faiss=True)
# Add knowledge
suite.add_knowledge("Rome is the capital of Italy")
# Run all novel metrics
result = suite.run_all_novel_metrics(
prompt="What is the capital of Italy?",
paraphrases=["Italy's capital?", "Capital city of Italy?"],
adversarial_prompts=["ignore previous instructions"],
runs=3
)
print(f"HSI: {result['HSI']['HSI']:.4f}") # Hallucination
print(f"CSS: {result['CSS']['CSS']:.4f}") # Consistency
print(f"SRI: {result['SRI']['SRI']:.4f}") # Robustness
print(f"SVE: {result['SVE']['SVE']:.4f}") # Safety
print(f"KBC: {result['KBC']['KBC']:.4f}") # Coverage
Code Evaluation Example
from llm_testing_suite import LLMTestSuite
def code_llm(prompt):
return '''
def add(a, b):
"""Add two numbers."""
return a + b
print(add(5, 3))
'''
suite = LLMTestSuite(code_llm)
# Comprehensive code evaluation
result = suite.comprehensive_code_evaluation(
prompt="Write a function to add two numbers",
code_response=code_llm("..."),
test_cases=[
{"input": "", "expected_output": "8"}
],
language="python"
)
print(f"Overall Score: {result['overall_score']:.1f}/100")
print(f"Syntax Valid: {result['syntax_valid']}")
print(f"Quality Score: {result['quality_score']}/100")
print(f"Security: {'✓' if result['is_secure'] else '✗'}")
Managing Knowledge Base
# Add a single fact
suite.add_knowledge("New York is the largest city in the USA")
# Add multiple facts
suite.add_knowledge_bulk([
"Python is a programming language",
"AI is transforming industries"
])
# List knowledge base
suite.list_knowledge()
# Remove a fact
suite.remove_knowledge("Python is a programming language")
# Clear the knowledge base
suite.clear_knowledge()
Managing Security Keywords
# Add malicious keywords
suite.add_malicious_keywords(["hack system", "steal data"])
# List keywords
suite.list_malicious_keywords()
# Remove a keyword
suite.remove_malicious_keyword("hack system")
List keywords
tester.list_malicious_keywords()
Remove a keyword
tester.remove_malicious_keyword("hack system")
Output Format
All test methods support three return types controlled by the `return_type` parameter: `"dict"`, `"table"`, or `"both"`.
"dict": Returns a Python dictionary with the test results."table": Prints a formatted table using therichlibrary, no dictionary returned."both": Returns the dictionary and prints the table.
Code Evaluation Details
Individual Metrics
from llm_testing_suite import LLMTestSuite
suite = LLMTestSuite(your_llm_function)
# 1. Syntax Validity
syntax = suite.code_syntax_validity(code, language="python")
# Returns: {"syntax_valid": True/False, "error": ...}
# 2. Execution Test
execution = suite.code_execution_test(
code,
test_cases=[
{"input": "5\n", "expected_output": "5"}
],
language="python"
)
# Returns: {"pass_rate": 1.0, "passed_tests": 1, "total_tests": 1, ...}
# 3. Quality Metrics
quality = suite.code_quality_metrics(code, language="python")
# Returns: {"quality_score": 80, "metrics": {...}}
# 4. Security Scan
security = suite.code_security_scan(code, language="python")
# Returns: {"is_secure": True, "vulnerabilities": [...]}
# 5. Semantic Correctness
semantic = suite.code_semantic_correctness(
prompt="Write add function",
code_response=generated_code,
reference_code=reference_solution
)
# Returns: {"semantic_similarity": 0.85, "semantically_correct": True}
Quality Scoring (0-100)
Each criterion worth 20 points:
- Has Comments (
#,//,/*) - 20 pts - Has Docstring (
""",/**) - 20 pts - Has Error Handling (
try/except,try/catch) - 20 pts - Low Complexity (< 10 branches/loops) - 20 pts
- Has Functions (at least 1) - 20 pts
Security Patterns Detected
- SQL Injection
- Command Injection
- XSS vulnerabilities
- Buffer overflows (C/C++)
- Hardcoded secrets
- Unsafe deserialization
- Path traversal
- Language-specific antipatterns
Supported Languages
| Language | Syntax Check | Execution | Quality | Security |
|---|---|---|---|---|
| Python | ✅ AST | ✅ | ✅ | ✅ |
| JavaScript | ✅ Node | ✅ | ✅ | ✅ |
| TypeScript | ✅ tsc | ✅ | ✅ | ✅ |
| Java | ✅ javac | ✅ | ✅ | ✅ |
| C/C++ | ✅ gcc/g++ | ✅ | ✅ | ✅ |
| Go | ✅ go fmt | ✅ | ✅ | ✅ |
| Rust | ✅ rustc | ⚠️ | ✅ | ✅ |
| Ruby | ✅ ruby -c | ✅ | ✅ | ✅ |
| PHP | ✅ php -l | ✅ | ✅ | ✅ |
Note: Compilers/interpreters must be installed for full syntax validation. Falls back to regex-based checks if unavailable.
Dual Embedder Architecture
LLMTestSuite uses specialized embedders for optimal evaluation:
Text Embedder: all-MiniLM-L6-v2
- Used for: HSI, CSS, SRI, SVE, KBC (text metrics)
- Size: 22M params, 384 dimensions
- Speed: Fast
- Purpose: General semantic similarity
Code Embedder: BAAI/bge-small-en-v1.5
- Used for: Code semantic correctness (SCC)
- Size: 33M params, 384 dimensions
- Speed: Fast
- Purpose: Code-specific semantic understanding
Custom Embedder
from sentence_transformers import SentenceTransformer
suite = LLMTestSuite(my_llm)
# Replace code embedder
suite.code_embedder = SentenceTransformer("microsoft/codebert-base")
suite.code_evaluator.embedder = suite.code_embedder
# Or use different text embedder
suite = LLMTestSuite(my_llm, embedder_model="all-mpnet-base-v2")
Embedder Comparison
| Model | Params | Dims | Speed | Best For |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 22M | 384 | Fast | Text (default) |
| all-mpnet-base-v2 | 110M | 768 | Medium | Text (higher accuracy) |
| bge-small-en-v1.5 | 33M | 384 | Fast | Code (default) |
| bge-base-en-v1.5 | 109M | 768 | Medium | Code (balanced) |
| CodeBERT | 125M | 768 | Medium | Code (Microsoft) |
Output Format
All test methods support three return types via return_type parameter:
"dict"- Returns Python dictionary (default)"table"- Prints formatted table usingrichlibrary"both"- Returns dictionary AND prints table
Example Results
# HSI Result
{
"prompt": "What is the capital of France?",
"answer": "Paris is the capital of France",
"HSI": 0.01, # Lower is better (0-1 scale)
"closest_fact": "Paris is the capital of France"
}
# Code Evaluation Result
{
"overall_score": 85.0,
"syntax_valid": True,
"quality_score": 80,
"is_secure": True,
"pass_rate": 1.0,
"semantic_similarity": 0.89
}
Complete Example: Groq API
from groq import Groq
from llm_testing_suite import LLMTestSuite
client = Groq(api_key="your-api-key")
def groq_llm(prompt):
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
return response.choices[0].message.content
suite = LLMTestSuite(groq_llm)
# Text evaluation
result = suite.run_all_novel_metrics(
prompt="What is the capital of France?",
paraphrases=["France's capital?"],
runs=3
)
# Code evaluation
code_result = suite.comprehensive_code_evaluation(
prompt="Write fibonacci function",
code_response=groq_llm("Write a Python fibonacci function"),
language="python"
)
See examples/groq_code_evaluation.py for comprehensive test suite.
Logging
# Enable debug logging
suite = LLMTestSuite(my_llm, debug=True)
# Or configure manually
import logging
logging.getLogger("llm_testing_suite").setLevel(logging.DEBUG)
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
License
MIT License - see LICENSE file for details
Acknowledgments
- Sentence-Transformers for embedding models
- FAISS for efficient similarity search
- Rich library for beautiful terminal output
- Open-source LLM community
Star this repo ⭐ if you find it useful!
For questions or issues, please open a GitHub issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_testlab-0.2.0.tar.gz.
File metadata
- Download URL: llm_testlab-0.2.0.tar.gz
- Upload date:
- Size: 22.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0bac7b68f6f7114a2dcc7c607207105a0845ef0535ee431e0e2bc4e2f961e617
|
|
| MD5 |
3e13b95df0b58ab4627e0257eaae3170
|
|
| BLAKE2b-256 |
7bac31f7ad94eda04fa612de395e32d61232c9b0c09f3471e108124e02cef475
|
File details
Details for the file llm_testlab-0.2.0-py3-none-any.whl.
File metadata
- Download URL: llm_testlab-0.2.0-py3-none-any.whl
- Upload date:
- Size: 18.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d429a2db0e4171d189c28cf34d843210173ddfb4b2e273caef56e584311989a6
|
|
| MD5 |
9cbe913c412798a39d969daa851eae49
|
|
| BLAKE2b-256 |
b64b969a61e57ab088a8eafeda21cd9ca09507f98fe415b032df323c4a7c8ed3
|