Professional-grade LLM evaluation framework with ZENO-style HTML reports and enhanced sample analysis

These details have not been verified by PyPI

Project links

Project description

🚀 LLM-Eval: Professional LLM Evaluation Framework

LLM Evaluation Framework

A professional-grade LLM evaluation framework with beautiful HTML reports, designed for researchers, developers, and businesses who need publication-quality evaluation results.

✨ Features

🎨 ZENO-Style Professional Reports - Beautiful card-based layout with enhanced sample analysis
📊 Professional Choice Display - Clear A/B/C/D choice presentation with visual indicators
⚡ High Performance - Optimized for GPU evaluation with batch processing
🔧 Easy Integration - Simple Python API and CLI for seamless workflows
📱 Mobile-Friendly - Enhanced responsive design for viewing reports on any device
💼 Business-Ready - Commercial-grade presentation quality for client deliverables
✅ Enhanced Sample Analysis - Comprehensive question, choice, and confidence visualization
🎯 Smart Highlighting - Color-coded correct answers and model selections

🚀 Quick Start

Installation Options

🤖 Automatic Installation (Recommended)

# Install the package
pip install llm-testkit

# Auto-install PyTorch with CUDA 12.8 for optimal performance
python -c "import llm_testkit; llm_testkit.install_pytorch_for_gpu()"

This will:

🔍 Detect your GPU automatically
📋 Show compatibility information
🚀 Install PyTorch with CUDA 12.8 (optimal for all modern GPUs)
✅ Verify the installation

🎯 Manual Installation

# Install PyTorch with CUDA 12.8 support (optimal for all NVIDIA GPUs)
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# Install LLM Testkit
pip install llm-testkit

🖥️ Quick Installation

# Install LLM Testkit (then manually install PyTorch CUDA 12.8)
pip install llm-testkit

💻 CPU-Only Installation

# For CPU-only evaluation (no CUDA) - install PyTorch CPU version manually
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install llm-testkit

🚀 Why CUDA 12.8? CUDA 12.8 provides the best performance and is backward compatible with all modern NVIDIA GPUs (RTX 20 series and newer). It's required for RTX 5090+ but optimizes performance for all GPUs.

🔍 Check GPU Compatibility Only

import llm_testkit

# Check what GPU you have and get installation recommendations
gpu_info = llm_testkit.check_gpu_compatibility()
print(f"GPUs detected: {gpu_info['gpus_detected']}")
print(f"Recommendation: {gpu_info['recommendation']}")
print(f"Installation command: {gpu_info['installation_command']}")

CLI Usage

# Basic evaluation
llm-eval --model hf --model_name mistralai/Mistral-7B-v0.1 --tasks arc_easy --limit 100

# Multiple tasks with professional reports
llm-eval --model hf --model_name microsoft/DialoGPT-small --tasks arc_easy,hellaswag --report_format professional

# GPU-optimized evaluation
llm-eval --model hf --model_name mistralai/Mistral-7B-v0.1 --tasks mmlu --device cuda:0 --batch_size 8

Python API

import llm_testkit

# Quick evaluation
results = llm_testkit.quick_eval(
    model_name="mistralai/Mistral-7B-v0.1",
    tasks="arc_easy",
    limit=100
)

# Evaluation with automatic HTML report
results, report_path = llm_testkit.quick_html_report(
    model_name="mistralai/Mistral-7B-v0.1",
    tasks="arc_easy,hellaswag",
    limit=100
)

print(f"📊 Results: {results['results']}")
print(f"📄 Report: {report_path}")

📊 Supported Tasks

Reasoning: ARC, HellaSwag, PIQA, SIQA, CommonsenseQA
Knowledge: MMLU, TruthfulQA, LAMBADA
Math: GSM8K, MATH, MathQA
Code: HumanEval, MBPP
Language: WinoGrande, SuperGLUE
And 35+ more tasks

🎨 Sample Reports

The framework generates ZENO-style professional HTML reports with:

🎨 Professional Card Layout - ZENO-inspired sample presentation with hover effects
📋 Enhanced Question Display - Clear section headers for questions and contexts
🔤 Professional Choice Grid - Prominent A/B/C/D labels with visual styling
✅ Smart Answer Highlighting - Green backgrounds for correct, blue for selected answers
📊 Confidence Visualization - Detailed probability scores for all choices
🏷️ Activity Badges - Professional labels for HellaSwag activity categories
📈 Interactive Charts - Performance visualizations with Chart.js
🏆 Performance Badges - Excellent/Good/Needs Improvement indicators
📋 Executive Summaries - Business-ready insights and recommendations
📱 Responsive Design - Perfect viewing on desktop, tablet, and mobile

💻 CLI Commands

# Main evaluation
llm-eval --model hf --model_name MODEL --tasks TASKS

# GPU detection and PyTorch setup
llm-eval-gpu-setup

# Generate reports from existing results  
llm-eval-demo --latest

# Convert JSON results to HTML
llm-eval-html results.json -o report.html

# Showcase framework capabilities
llm-eval-showcase

🔧 Requirements

Python: 3.8+
PyTorch: 2.7.0+ with CUDA 12.8 (recommended for all NVIDIA GPUs)
- Best Performance: Install with --index-url https://download.pytorch.org/whl/cu128
Memory: 16GB+ RAM for 7B models
GPU: CUDA-capable GPU recommended for optimal performance
- CUDA 12.8: Provides best performance for all modern NVIDIA GPUs (RTX 20 series+)
- RTX 5090: Requires CUDA 12.8 (compute capability sm_120)
- Older GPUs: Still benefit from CUDA 12.8 optimizations

📈 Use Cases

🔬 Research & Development

Model Comparison: Compare different model architectures and sizes
Performance Analysis: Detailed task-by-task breakdown and insights
Publication Materials: Professional reports ready for academic papers

💼 Commercial Applications

Client Demonstrations: Impressive HTML reports for stakeholder presentations
Consulting Deliverables: Business-ready evaluation reports and recommendations
Proof of Concepts: Quick evaluation capabilities for rapid prototyping

🎓 Educational Use

Teaching Materials: Clear examples and comprehensive documentation
Student Projects: Easy-to-use evaluation framework for coursework
Research Training: Professional-grade tools for academic research

🔧 Advanced Usage

Custom Evaluation Pipeline

from llm_testkit import evaluate_model

# Advanced evaluation with custom settings
results, output_path = evaluate_model(
    model_type="hf",
    model_name="mistralai/Mistral-7B-v0.1",
    tasks=["arc_easy", "hellaswag", "mmlu"],
    num_fewshot=5,
    batch_size=8,
    device="cuda:0",
    generate_report=True,
    report_format="professional"
)

📋 Comprehensive Configuration Example

The enhanced LLM Testkit automatically captures 60+ configuration parameters for detailed reporting. Here's a complete example showcasing all configuration attributes:

import llm_testkit

# 🎯 Complete evaluation with all configuration options
results, report_path = llm_testkit.quick_html_report(
    # Basic Model Configuration
    model_name="mistralai/Mistral-7B-v0.1",
    model_type="vllm",  # or "hf" for HuggingFace
    
    # Evaluation Settings
    tasks="hellaswag,arc_easy,mmlu,gsm8k",
    limit=500,  # samples per task
    
    # Performance & Hardware Configuration
    tensor_parallel_size=2,           # Multi-GPU setup
    gpu_memory_utilization=0.8,       # GPU memory usage
    batch_size=16,                    # Batch processing
    device="cuda",                    # Device selection
    
    # Generation Parameters
    temperature=0.7,                  # Sampling temperature
    top_p=0.9,                       # Nucleus sampling
    top_k=50,                        # Top-k sampling
    max_new_tokens=512,              # Max output length
    do_sample=True,                  # Enable sampling
    repetition_penalty=1.1,          # Prevent repetition
    
    # Advanced Model Configuration
    dtype="auto",                    # Data type optimization
    trust_remote_code=True,          # Enable custom code
    use_flash_attention_2=True,      # Flash attention optimization
    
    # Quantization (for HuggingFace models)
    quantize=True,                   # Enable quantization
    quantization_method="4bit",      # 4-bit quantization
    
    # vLLM Specific Settings
    max_model_len=4096,             # Context length
    swap_space=4,                   # Swap space (GB)
    enable_prefix_caching=True,     # Prefix caching
    
    # Evaluation Configuration
    num_fewshot=0,                  # Few-shot examples
    preserve_default_fewshot=False, # Use task defaults
    
    # Output Settings
    output_dir="comprehensive_reports",
    generate_report=True,
    report_format="professional"
)

print(f"📊 Comprehensive evaluation completed!")
print(f"📄 Professional report: {report_path}")

📈 What Gets Captured Automatically

The enhanced framework automatically extracts and displays all configuration details in beautiful HTML reports:

🔧 Basic Model Information

# Automatically detected and displayed:
{
    "name": "mistralai/Mistral-7B-v0.1",
    "architecture": "Mistral (Transformer)", 
    "parameters": "~7 billion",
    "context_length": "32,768 tokens",
    "backend": "VLLM",
    "quantization": "4-bit",
    "data_type": "auto"
}

🖥️ Hardware & Performance Configuration

# Multi-GPU and performance settings:
{
    "device_mapping": "Multi-GPU (TP=2)",
    "tensor_parallel_size": 2,
    "gpu_memory_utilization": "0.80",
    "max_model_len": "4096",
    "batch_size": "16",
    "evaluation_device": "cuda"
}

⚡ Advanced Features & Optimization

# Advanced model features captured:
{
    "attention_implementation": "flash_attention_2",
    "use_flash_attention": "True",
    "trust_remote_code": "True",
    "enable_prefix_caching": "True",
    "swap_space": "4 GB",
    "use_cache": "True"
}

🎯 Generation Parameters

# All generation settings tracked:
{
    "temperature": "0.7",
    "top_p": "0.9", 
    "top_k": "50",
    "max_new_tokens": "512",
    "do_sample": "True",
    "repetition_penalty": "1.1",
    "num_beams": "1"
}

🏗️ Architecture Details

# Model family specific information:
{
    "family": "Mistral",
    "attention": "Grouped-query attention (GQA) with sliding window",
    "activation": "SwiGLU", 
    "positional_encoding": "RoPE (Rotary Position Embedding)",
    "vocab_size": "32,000",
    "special_features": "Sliding window attention (4096 tokens)"
}

🎨 Enhanced HTML Reports

The comprehensive configuration example above generates professional HTML reports with:

📋 Executive Summary - Overall performance with badges and insights
⚙️ Model Configuration - 6 detailed sections with 60+ parameters:
- Basic Model Information
- Hardware & Performance Configuration
- Advanced Features & Optimization
- Generation Parameters
- Evaluation Configuration
- Architecture Details
📊 Performance Charts - Interactive radar and bar charts
🔍 Sample Analysis - Detailed per-task breakdowns with proper HellaSwag context display
📱 Responsive Design - Perfect on desktop, tablet, and mobile

🚀 Production-Ready Example

For production use with maximum performance:

import llm_testkit

# Production evaluation with optimal settings
results, report_path = llm_testkit.quick_html_report(
    model_name="mistralai/Mistral-7B-v0.1",
    model_type="vllm", 
    tasks="hellaswag,arc_easy,mmlu,truthfulqa",
    
    # High-performance configuration
    tensor_parallel_size=4,           # 4-GPU setup
    gpu_memory_utilization=0.95,      # Max GPU usage
    batch_size=32,                    # Large batches
    max_model_len=8192,              # Extended context
    
    # Optimized generation
    temperature=0.0,                  # Deterministic
    max_new_tokens=256,              # Efficient generation
    enable_prefix_caching=True,       # Speed optimization
    
    # Professional reporting
    limit=1000,                       # Comprehensive evaluation
    report_format="professional",
    output_dir="production_reports"
)

print(f"🎯 Production evaluation complete: {report_path}")

Key Benefits:

✅ Zero Configuration Loss - All 60+ parameters automatically captured
✅ Professional Reporting - Publication-ready HTML with detailed sections
✅ Architecture Intelligence - Automatic model family detection and optimization
✅ Performance Optimization - GPU validation and memory management
✅ Complete Traceability - Full configuration tracking for reproducibility

Batch Processing

import llm_testkit

models = [
    "mistralai/Mistral-7B-v0.1",
    "microsoft/DialoGPT-medium",
    "facebook/opt-1.3b"
]

for model in models:
    results, report = llm_testkit.quick_html_report(
        model_name=model,
        tasks="arc_easy,hellaswag",
        output_dir=f"reports/{model.replace('/', '_')}"
    )

🔧 Troubleshooting

CUDA Compatibility Issues

Problem: Getting a warning like:

NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.

Solution: Install PyTorch with CUDA 12.8 support:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

Why CUDA 12.8 for Everyone?

✅ Required for RTX 5090+: Only CUDA 12.8 supports compute capability sm_120
✅ Optimal for all GPUs: Provides best performance even for older GPUs
✅ Backward Compatible: Works with RTX 20 series and newer
✅ Latest Optimizations: Most recent performance improvements from NVIDIA

Memory Issues

Problem: Out of memory errors during evaluation.

Solutions:

Reduce batch_size parameter
Use quantization: quantize=True, quantization_method="4bit"
For vLLM: reduce gpu_memory_utilization (default 0.9)
Use tensor parallelism across multiple GPUs

Performance Optimization

For maximum performance:

Use vLLM backend for inference: model_type="vllm"
Enable tensor parallelism: tensor_parallel_size=2 (or higher)
Use Flash Attention: use_flash_attention_2=True
Optimize memory: gpu_memory_utilization=0.95

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built on top of the excellent lm-evaluation-harness
Inspired by the need for professional-quality LLM evaluation reports
Special thanks to the open-source ML community

📞 Contact

Matthias De Paolis

GitHub: @mattdepaolis
Blog: mattdepaolis.github.io/blog
HuggingFace: @llmat

⭐ Star this repository if you find it useful! ⭐

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.1

Jun 2, 2025

1.1.0

Jun 2, 2025

1.0.9

Jun 2, 2025

1.0.8

Jun 2, 2025

1.0.7

Jun 2, 2025

1.0.6

Jun 2, 2025

1.0.5

Jun 2, 2025

1.0.2

Jun 2, 2025

1.0.0

Jun 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_testkit-1.1.1.tar.gz (55.4 kB view details)

Uploaded Jun 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_testkit-1.1.1-py3-none-any.whl (57.5 kB view details)

Uploaded Jun 2, 2025 Python 3

File details

Details for the file llm_testkit-1.1.1.tar.gz.

File metadata

Download URL: llm_testkit-1.1.1.tar.gz
Upload date: Jun 2, 2025
Size: 55.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llm_testkit-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`154137994c658445444534c54bce5dd69753e2ea617c0b7bd9d45e91effbb086`
MD5	`02523c077e0b30335a6e8734996047a1`
BLAKE2b-256	`4939a1ebb517a66422e0c797552e6b4f362c1ced5b0f79be4dce988253385c4c`

See more details on using hashes here.

File details

Details for the file llm_testkit-1.1.1-py3-none-any.whl.

File metadata

Download URL: llm_testkit-1.1.1-py3-none-any.whl
Upload date: Jun 2, 2025
Size: 57.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llm_testkit-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b8b50d07da74566d30d237f8ab4e5f970be5180c4a9f61f8ab80406a76cef197`
MD5	`81f4ce64642145d9c80809f6477f4cc0`
BLAKE2b-256	`27fbe16b6d1e8baf6873a14eda15771f3ddf930700c01c152de0c8613569eac1`

See more details on using hashes here.

llm-testkit 1.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🚀 LLM-Eval: Professional LLM Evaluation Framework

✨ Features

🚀 Quick Start

Installation Options

🤖 Automatic Installation (Recommended)

🎯 Manual Installation

🖥️ Quick Installation

💻 CPU-Only Installation

🔍 Check GPU Compatibility Only

CLI Usage

Python API

📊 Supported Tasks

🎨 Sample Reports

💻 CLI Commands

🔧 Requirements

📈 Use Cases

🔬 Research & Development

💼 Commercial Applications

🎓 Educational Use

🔧 Advanced Usage

Custom Evaluation Pipeline

📋 Comprehensive Configuration Example

📈 What Gets Captured Automatically

🔧 Basic Model Information

🖥️ Hardware & Performance Configuration

⚡ Advanced Features & Optimization

🎯 Generation Parameters

🏗️ Architecture Details

🎨 Enhanced HTML Reports

🚀 Production-Ready Example

Batch Processing

🔧 Troubleshooting

CUDA Compatibility Issues

Memory Issues

Performance Optimization

🤝 Contributing

📝 License

🙏 Acknowledgments

📞 Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes