Professional-grade LLM evaluation framework with ZENO-style HTML reports and enhanced sample analysis
Project description
๐ LLM-Eval: Professional LLM Evaluation Framework
A professional-grade LLM evaluation framework with beautiful HTML reports, designed for researchers, developers, and businesses who need publication-quality evaluation results.
โจ Features
๐จ ZENO-Style Professional Reports - Beautiful card-based layout with enhanced sample analysis
๐ Professional Choice Display - Clear A/B/C/D choice presentation with visual indicators
โก High Performance - Optimized for GPU evaluation with batch processing
๐ง Easy Integration - Simple Python API and CLI for seamless workflows
๐ฑ Mobile-Friendly - Enhanced responsive design for viewing reports on any device
๐ผ Business-Ready - Commercial-grade presentation quality for client deliverables
โ
Enhanced Sample Analysis - Comprehensive question, choice, and confidence visualization
๐ฏ Smart Highlighting - Color-coded correct answers and model selections
๐ Quick Start
Installation Options
๐ค Automatic Installation (Recommended)
# Install the package
pip install llm-testkit
# Auto-install PyTorch with CUDA 12.8 for optimal performance
python -c "import llm_testkit; llm_testkit.install_pytorch_for_gpu()"
This will:
- ๐ Detect your GPU automatically
- ๐ Show compatibility information
- ๐ Install PyTorch with CUDA 12.8 (optimal for all modern GPUs)
- โ Verify the installation
๐ฏ Manual Installation
# Install PyTorch with CUDA 12.8 support (optimal for all NVIDIA GPUs)
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# Install LLM Testkit
pip install llm-testkit
๐ฅ๏ธ Quick Installation
# Install LLM Testkit (then manually install PyTorch CUDA 12.8)
pip install llm-testkit
๐ป CPU-Only Installation
# For CPU-only evaluation (no CUDA) - install PyTorch CPU version manually
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install llm-testkit
๐ Why CUDA 12.8? CUDA 12.8 provides the best performance and is backward compatible with all modern NVIDIA GPUs (RTX 20 series and newer). It's required for RTX 5090+ but optimizes performance for all GPUs.
๐ Check GPU Compatibility Only
import llm_testkit
# Check what GPU you have and get installation recommendations
gpu_info = llm_testkit.check_gpu_compatibility()
print(f"GPUs detected: {gpu_info['gpus_detected']}")
print(f"Recommendation: {gpu_info['recommendation']}")
print(f"Installation command: {gpu_info['installation_command']}")
CLI Usage
# Basic evaluation
llm-eval --model hf --model_name mistralai/Mistral-7B-v0.1 --tasks arc_easy --limit 100
# Multiple tasks with professional reports
llm-eval --model hf --model_name microsoft/DialoGPT-small --tasks arc_easy,hellaswag --report_format professional
# GPU-optimized evaluation
llm-eval --model hf --model_name mistralai/Mistral-7B-v0.1 --tasks mmlu --device cuda:0 --batch_size 8
Python API
import llm_testkit
# Quick evaluation
results = llm_testkit.quick_eval(
model_name="mistralai/Mistral-7B-v0.1",
tasks="arc_easy",
limit=100
)
# Evaluation with automatic HTML report
results, report_path = llm_testkit.quick_html_report(
model_name="mistralai/Mistral-7B-v0.1",
tasks="arc_easy,hellaswag",
limit=100
)
print(f"๐ Results: {results['results']}")
print(f"๐ Report: {report_path}")
๐ Supported Tasks
- Reasoning: ARC, HellaSwag, PIQA, SIQA, CommonsenseQA
- Knowledge: MMLU, TruthfulQA, LAMBADA
- Math: GSM8K, MATH, MathQA
- Code: HumanEval, MBPP
- Language: WinoGrande, SuperGLUE
- And 35+ more tasks
๐จ Sample Reports
The framework generates ZENO-style professional HTML reports with:
- ๐จ Professional Card Layout - ZENO-inspired sample presentation with hover effects
- ๐ Enhanced Question Display - Clear section headers for questions and contexts
- ๐ค Professional Choice Grid - Prominent A/B/C/D labels with visual styling
- โ Smart Answer Highlighting - Green backgrounds for correct, blue for selected answers
- ๐ Confidence Visualization - Detailed probability scores for all choices
- ๐ท๏ธ Activity Badges - Professional labels for HellaSwag activity categories
- ๐ Interactive Charts - Performance visualizations with Chart.js
- ๐ Performance Badges - Excellent/Good/Needs Improvement indicators
- ๐ Executive Summaries - Business-ready insights and recommendations
- ๐ฑ Responsive Design - Perfect viewing on desktop, tablet, and mobile
๐ป CLI Commands
# Main evaluation
llm-eval --model hf --model_name MODEL --tasks TASKS
# GPU detection and PyTorch setup
llm-eval-gpu-setup
# Generate reports from existing results
llm-eval-demo --latest
# Convert JSON results to HTML
llm-eval-html results.json -o report.html
# Showcase framework capabilities
llm-eval-showcase
๐ง Requirements
- Python: 3.8+
- PyTorch: 2.7.0+ with CUDA 12.8 (recommended for all NVIDIA GPUs)
- Best Performance: Install with
--index-url https://download.pytorch.org/whl/cu128
- Best Performance: Install with
- Memory: 16GB+ RAM for 7B models
- GPU: CUDA-capable GPU recommended for optimal performance
- CUDA 12.8: Provides best performance for all modern NVIDIA GPUs (RTX 20 series+)
- RTX 5090: Requires CUDA 12.8 (compute capability sm_120)
- Older GPUs: Still benefit from CUDA 12.8 optimizations
๐ Use Cases
๐ฌ Research & Development
- Model Comparison: Compare different model architectures and sizes
- Performance Analysis: Detailed task-by-task breakdown and insights
- Publication Materials: Professional reports ready for academic papers
๐ผ Commercial Applications
- Client Demonstrations: Impressive HTML reports for stakeholder presentations
- Consulting Deliverables: Business-ready evaluation reports and recommendations
- Proof of Concepts: Quick evaluation capabilities for rapid prototyping
๐ Educational Use
- Teaching Materials: Clear examples and comprehensive documentation
- Student Projects: Easy-to-use evaluation framework for coursework
- Research Training: Professional-grade tools for academic research
๐ง Advanced Usage
Custom Evaluation Pipeline
from llm_testkit import evaluate_model
# Advanced evaluation with custom settings
results, output_path = evaluate_model(
model_type="hf",
model_name="mistralai/Mistral-7B-v0.1",
tasks=["arc_easy", "hellaswag", "mmlu"],
num_fewshot=5,
batch_size=8,
device="cuda:0",
generate_report=True,
report_format="professional"
)
๐ Comprehensive Configuration Example
The enhanced LLM Testkit automatically captures 60+ configuration parameters for detailed reporting. Here's a complete example showcasing all configuration attributes:
import llm_testkit
# ๐ฏ Complete evaluation with all configuration options
results, report_path = llm_testkit.quick_html_report(
# Basic Model Configuration
model_name="mistralai/Mistral-7B-v0.1",
model_type="vllm", # or "hf" for HuggingFace
# Evaluation Settings
tasks="hellaswag,arc_easy,mmlu,gsm8k",
limit=500, # samples per task
# Performance & Hardware Configuration
tensor_parallel_size=2, # Multi-GPU setup
gpu_memory_utilization=0.8, # GPU memory usage
batch_size=16, # Batch processing
device="cuda", # Device selection
# Generation Parameters
temperature=0.7, # Sampling temperature
top_p=0.9, # Nucleus sampling
top_k=50, # Top-k sampling
max_new_tokens=512, # Max output length
do_sample=True, # Enable sampling
repetition_penalty=1.1, # Prevent repetition
# Advanced Model Configuration
dtype="auto", # Data type optimization
trust_remote_code=True, # Enable custom code
use_flash_attention_2=True, # Flash attention optimization
# Quantization (for HuggingFace models)
quantize=True, # Enable quantization
quantization_method="4bit", # 4-bit quantization
# vLLM Specific Settings
max_model_len=4096, # Context length
swap_space=4, # Swap space (GB)
enable_prefix_caching=True, # Prefix caching
# Evaluation Configuration
num_fewshot=0, # Few-shot examples
preserve_default_fewshot=False, # Use task defaults
# Output Settings
output_dir="comprehensive_reports",
generate_report=True,
report_format="professional"
)
print(f"๐ Comprehensive evaluation completed!")
print(f"๐ Professional report: {report_path}")
๐ What Gets Captured Automatically
The enhanced framework automatically extracts and displays all configuration details in beautiful HTML reports:
๐ง Basic Model Information
# Automatically detected and displayed:
{
"name": "mistralai/Mistral-7B-v0.1",
"architecture": "Mistral (Transformer)",
"parameters": "~7 billion",
"context_length": "32,768 tokens",
"backend": "VLLM",
"quantization": "4-bit",
"data_type": "auto"
}
๐ฅ๏ธ Hardware & Performance Configuration
# Multi-GPU and performance settings:
{
"device_mapping": "Multi-GPU (TP=2)",
"tensor_parallel_size": 2,
"gpu_memory_utilization": "0.80",
"max_model_len": "4096",
"batch_size": "16",
"evaluation_device": "cuda"
}
โก Advanced Features & Optimization
# Advanced model features captured:
{
"attention_implementation": "flash_attention_2",
"use_flash_attention": "True",
"trust_remote_code": "True",
"enable_prefix_caching": "True",
"swap_space": "4 GB",
"use_cache": "True"
}
๐ฏ Generation Parameters
# All generation settings tracked:
{
"temperature": "0.7",
"top_p": "0.9",
"top_k": "50",
"max_new_tokens": "512",
"do_sample": "True",
"repetition_penalty": "1.1",
"num_beams": "1"
}
๐๏ธ Architecture Details
# Model family specific information:
{
"family": "Mistral",
"attention": "Grouped-query attention (GQA) with sliding window",
"activation": "SwiGLU",
"positional_encoding": "RoPE (Rotary Position Embedding)",
"vocab_size": "32,000",
"special_features": "Sliding window attention (4096 tokens)"
}
๐จ Enhanced HTML Reports
The comprehensive configuration example above generates professional HTML reports with:
- ๐ Executive Summary - Overall performance with badges and insights
- โ๏ธ Model Configuration - 6 detailed sections with 60+ parameters:
- Basic Model Information
- Hardware & Performance Configuration
- Advanced Features & Optimization
- Generation Parameters
- Evaluation Configuration
- Architecture Details
- ๐ Performance Charts - Interactive radar and bar charts
- ๐ Sample Analysis - Detailed per-task breakdowns with proper HellaSwag context display
- ๐ฑ Responsive Design - Perfect on desktop, tablet, and mobile
๐ Production-Ready Example
For production use with maximum performance:
import llm_testkit
# Production evaluation with optimal settings
results, report_path = llm_testkit.quick_html_report(
model_name="mistralai/Mistral-7B-v0.1",
model_type="vllm",
tasks="hellaswag,arc_easy,mmlu,truthfulqa",
# High-performance configuration
tensor_parallel_size=4, # 4-GPU setup
gpu_memory_utilization=0.95, # Max GPU usage
batch_size=32, # Large batches
max_model_len=8192, # Extended context
# Optimized generation
temperature=0.0, # Deterministic
max_new_tokens=256, # Efficient generation
enable_prefix_caching=True, # Speed optimization
# Professional reporting
limit=1000, # Comprehensive evaluation
report_format="professional",
output_dir="production_reports"
)
print(f"๐ฏ Production evaluation complete: {report_path}")
Key Benefits:
- โ Zero Configuration Loss - All 60+ parameters automatically captured
- โ Professional Reporting - Publication-ready HTML with detailed sections
- โ Architecture Intelligence - Automatic model family detection and optimization
- โ Performance Optimization - GPU validation and memory management
- โ Complete Traceability - Full configuration tracking for reproducibility
Batch Processing
import llm_testkit
models = [
"mistralai/Mistral-7B-v0.1",
"microsoft/DialoGPT-medium",
"facebook/opt-1.3b"
]
for model in models:
results, report = llm_testkit.quick_html_report(
model_name=model,
tasks="arc_easy,hellaswag",
output_dir=f"reports/{model.replace('/', '_')}"
)
๐ง Troubleshooting
CUDA Compatibility Issues
Problem: Getting a warning like:
NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.
Solution: Install PyTorch with CUDA 12.8 support:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
Why CUDA 12.8 for Everyone?
- โ Required for RTX 5090+: Only CUDA 12.8 supports compute capability sm_120
- โ Optimal for all GPUs: Provides best performance even for older GPUs
- โ Backward Compatible: Works with RTX 20 series and newer
- โ Latest Optimizations: Most recent performance improvements from NVIDIA
Memory Issues
Problem: Out of memory errors during evaluation.
Solutions:
- Reduce
batch_sizeparameter - Use quantization:
quantize=True, quantization_method="4bit" - For vLLM: reduce
gpu_memory_utilization(default 0.9) - Use tensor parallelism across multiple GPUs
Performance Optimization
For maximum performance:
- Use vLLM backend for inference:
model_type="vllm" - Enable tensor parallelism:
tensor_parallel_size=2(or higher) - Use Flash Attention:
use_flash_attention_2=True - Optimize memory:
gpu_memory_utilization=0.95
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Built on top of the excellent lm-evaluation-harness
- Inspired by the need for professional-quality LLM evaluation reports
- Special thanks to the open-source ML community
๐ Contact
Matthias De Paolis
- GitHub: @mattdepaolis
- Blog: mattdepaolis.github.io/blog
- HuggingFace: @llmat
โญ Star this repository if you find it useful! โญ
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_testkit-1.1.1.tar.gz.
File metadata
- Download URL: llm_testkit-1.1.1.tar.gz
- Upload date:
- Size: 55.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
154137994c658445444534c54bce5dd69753e2ea617c0b7bd9d45e91effbb086
|
|
| MD5 |
02523c077e0b30335a6e8734996047a1
|
|
| BLAKE2b-256 |
4939a1ebb517a66422e0c797552e6b4f362c1ced5b0f79be4dce988253385c4c
|
File details
Details for the file llm_testkit-1.1.1-py3-none-any.whl.
File metadata
- Download URL: llm_testkit-1.1.1-py3-none-any.whl
- Upload date:
- Size: 57.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8b50d07da74566d30d237f8ab4e5f970be5180c4a9f61f8ab80406a76cef197
|
|
| MD5 |
81f4ce64642145d9c80809f6477f4cc0
|
|
| BLAKE2b-256 |
27fbe16b6d1e8baf6873a14eda15771f3ddf930700c01c152de0c8613569eac1
|