Skip to main content

Professional-grade LLM evaluation framework with ZENO-style HTML reports and enhanced sample analysis

Project description

๐Ÿš€ LLM-Eval: Professional LLM Evaluation Framework

LLM Evaluation Framework

Python 3.8+ PyPI version License: MIT Torch 2.7+

A professional-grade LLM evaluation framework with beautiful HTML reports, designed for researchers, developers, and businesses who need publication-quality evaluation results.

โœจ Features

๐ŸŽจ ZENO-Style Professional Reports - Beautiful card-based layout with enhanced sample analysis
๐Ÿ“Š Professional Choice Display - Clear A/B/C/D choice presentation with visual indicators
โšก High Performance - Optimized for GPU evaluation with batch processing
๐Ÿ”ง Easy Integration - Simple Python API and CLI for seamless workflows
๐Ÿ“ฑ Mobile-Friendly - Enhanced responsive design for viewing reports on any device
๐Ÿ’ผ Business-Ready - Commercial-grade presentation quality for client deliverables
โœ… Enhanced Sample Analysis - Comprehensive question, choice, and confidence visualization
๐ŸŽฏ Smart Highlighting - Color-coded correct answers and model selections

๐Ÿš€ Quick Start

Installation Options

๐Ÿค– Automatic Installation (Recommended)

# Install the package
pip install llm-testkit

# Auto-install PyTorch with CUDA 12.8 for optimal performance
python -c "import llm_testkit; llm_testkit.install_pytorch_for_gpu()"

This will:

  • ๐Ÿ” Detect your GPU automatically
  • ๐Ÿ“‹ Show compatibility information
  • ๐Ÿš€ Install PyTorch with CUDA 12.8 (optimal for all modern GPUs)
  • โœ… Verify the installation

๐ŸŽฏ Manual Installation

# Install PyTorch with CUDA 12.8 support (optimal for all NVIDIA GPUs)
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# Install LLM Testkit
pip install llm-testkit

๐Ÿ–ฅ๏ธ Quick Installation

# Install LLM Testkit (then manually install PyTorch CUDA 12.8)
pip install llm-testkit

๐Ÿ’ป CPU-Only Installation

# For CPU-only evaluation (no CUDA) - install PyTorch CPU version manually
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install llm-testkit

๐Ÿš€ Why CUDA 12.8? CUDA 12.8 provides the best performance and is backward compatible with all modern NVIDIA GPUs (RTX 20 series and newer). It's required for RTX 5090+ but optimizes performance for all GPUs.

๐Ÿ” Check GPU Compatibility Only

import llm_testkit

# Check what GPU you have and get installation recommendations
gpu_info = llm_testkit.check_gpu_compatibility()
print(f"GPUs detected: {gpu_info['gpus_detected']}")
print(f"Recommendation: {gpu_info['recommendation']}")
print(f"Installation command: {gpu_info['installation_command']}")

CLI Usage

# Basic evaluation
llm-eval --model hf --model_name mistralai/Mistral-7B-v0.1 --tasks arc_easy --limit 100

# Multiple tasks with professional reports
llm-eval --model hf --model_name microsoft/DialoGPT-small --tasks arc_easy,hellaswag --report_format professional

# GPU-optimized evaluation
llm-eval --model hf --model_name mistralai/Mistral-7B-v0.1 --tasks mmlu --device cuda:0 --batch_size 8

Python API

import llm_testkit

# Quick evaluation
results = llm_testkit.quick_eval(
    model_name="mistralai/Mistral-7B-v0.1",
    tasks="arc_easy",
    limit=100
)

# Evaluation with automatic HTML report
results, report_path = llm_testkit.quick_html_report(
    model_name="mistralai/Mistral-7B-v0.1",
    tasks="arc_easy,hellaswag",
    limit=100
)

print(f"๐Ÿ“Š Results: {results['results']}")
print(f"๐Ÿ“„ Report: {report_path}")

๐Ÿ“Š Supported Tasks

  • Reasoning: ARC, HellaSwag, PIQA, SIQA, CommonsenseQA
  • Knowledge: MMLU, TruthfulQA, LAMBADA
  • Math: GSM8K, MATH, MathQA
  • Code: HumanEval, MBPP
  • Language: WinoGrande, SuperGLUE
  • And 35+ more tasks

๐ŸŽจ Sample Reports

The framework generates ZENO-style professional HTML reports with:

  • ๐ŸŽจ Professional Card Layout - ZENO-inspired sample presentation with hover effects
  • ๐Ÿ“‹ Enhanced Question Display - Clear section headers for questions and contexts
  • ๐Ÿ”ค Professional Choice Grid - Prominent A/B/C/D labels with visual styling
  • โœ… Smart Answer Highlighting - Green backgrounds for correct, blue for selected answers
  • ๐Ÿ“Š Confidence Visualization - Detailed probability scores for all choices
  • ๐Ÿท๏ธ Activity Badges - Professional labels for HellaSwag activity categories
  • ๐Ÿ“ˆ Interactive Charts - Performance visualizations with Chart.js
  • ๐Ÿ† Performance Badges - Excellent/Good/Needs Improvement indicators
  • ๐Ÿ“‹ Executive Summaries - Business-ready insights and recommendations
  • ๐Ÿ“ฑ Responsive Design - Perfect viewing on desktop, tablet, and mobile

๐Ÿ’ป CLI Commands

# Main evaluation
llm-eval --model hf --model_name MODEL --tasks TASKS

# GPU detection and PyTorch setup
llm-eval-gpu-setup

# Generate reports from existing results  
llm-eval-demo --latest

# Convert JSON results to HTML
llm-eval-html results.json -o report.html

# Showcase framework capabilities
llm-eval-showcase

๐Ÿ”ง Requirements

  • Python: 3.8+
  • PyTorch: 2.7.0+ with CUDA 12.8 (recommended for all NVIDIA GPUs)
    • Best Performance: Install with --index-url https://download.pytorch.org/whl/cu128
  • Memory: 16GB+ RAM for 7B models
  • GPU: CUDA-capable GPU recommended for optimal performance
    • CUDA 12.8: Provides best performance for all modern NVIDIA GPUs (RTX 20 series+)
    • RTX 5090: Requires CUDA 12.8 (compute capability sm_120)
    • Older GPUs: Still benefit from CUDA 12.8 optimizations

๐Ÿ“ˆ Use Cases

๐Ÿ”ฌ Research & Development

  • Model Comparison: Compare different model architectures and sizes
  • Performance Analysis: Detailed task-by-task breakdown and insights
  • Publication Materials: Professional reports ready for academic papers

๐Ÿ’ผ Commercial Applications

  • Client Demonstrations: Impressive HTML reports for stakeholder presentations
  • Consulting Deliverables: Business-ready evaluation reports and recommendations
  • Proof of Concepts: Quick evaluation capabilities for rapid prototyping

๐ŸŽ“ Educational Use

  • Teaching Materials: Clear examples and comprehensive documentation
  • Student Projects: Easy-to-use evaluation framework for coursework
  • Research Training: Professional-grade tools for academic research

๐Ÿ”ง Advanced Usage

Custom Evaluation Pipeline

from llm_testkit import evaluate_model

# Advanced evaluation with custom settings
results, output_path = evaluate_model(
    model_type="hf",
    model_name="mistralai/Mistral-7B-v0.1",
    tasks=["arc_easy", "hellaswag", "mmlu"],
    num_fewshot=5,
    batch_size=8,
    device="cuda:0",
    generate_report=True,
    report_format="professional"
)

๐Ÿ“‹ Comprehensive Configuration Example

The enhanced LLM Testkit automatically captures 60+ configuration parameters for detailed reporting. Here's a complete example showcasing all configuration attributes:

import llm_testkit

# ๐ŸŽฏ Complete evaluation with all configuration options
results, report_path = llm_testkit.quick_html_report(
    # Basic Model Configuration
    model_name="mistralai/Mistral-7B-v0.1",
    model_type="vllm",  # or "hf" for HuggingFace
    
    # Evaluation Settings
    tasks="hellaswag,arc_easy,mmlu,gsm8k",
    limit=500,  # samples per task
    
    # Performance & Hardware Configuration
    tensor_parallel_size=2,           # Multi-GPU setup
    gpu_memory_utilization=0.8,       # GPU memory usage
    batch_size=16,                    # Batch processing
    device="cuda",                    # Device selection
    
    # Generation Parameters
    temperature=0.7,                  # Sampling temperature
    top_p=0.9,                       # Nucleus sampling
    top_k=50,                        # Top-k sampling
    max_new_tokens=512,              # Max output length
    do_sample=True,                  # Enable sampling
    repetition_penalty=1.1,          # Prevent repetition
    
    # Advanced Model Configuration
    dtype="auto",                    # Data type optimization
    trust_remote_code=True,          # Enable custom code
    use_flash_attention_2=True,      # Flash attention optimization
    
    # Quantization (for HuggingFace models)
    quantize=True,                   # Enable quantization
    quantization_method="4bit",      # 4-bit quantization
    
    # vLLM Specific Settings
    max_model_len=4096,             # Context length
    swap_space=4,                   # Swap space (GB)
    enable_prefix_caching=True,     # Prefix caching
    
    # Evaluation Configuration
    num_fewshot=0,                  # Few-shot examples
    preserve_default_fewshot=False, # Use task defaults
    
    # Output Settings
    output_dir="comprehensive_reports",
    generate_report=True,
    report_format="professional"
)

print(f"๐Ÿ“Š Comprehensive evaluation completed!")
print(f"๐Ÿ“„ Professional report: {report_path}")

๐Ÿ“ˆ What Gets Captured Automatically

The enhanced framework automatically extracts and displays all configuration details in beautiful HTML reports:

๐Ÿ”ง Basic Model Information

# Automatically detected and displayed:
{
    "name": "mistralai/Mistral-7B-v0.1",
    "architecture": "Mistral (Transformer)", 
    "parameters": "~7 billion",
    "context_length": "32,768 tokens",
    "backend": "VLLM",
    "quantization": "4-bit",
    "data_type": "auto"
}

๐Ÿ–ฅ๏ธ Hardware & Performance Configuration

# Multi-GPU and performance settings:
{
    "device_mapping": "Multi-GPU (TP=2)",
    "tensor_parallel_size": 2,
    "gpu_memory_utilization": "0.80",
    "max_model_len": "4096",
    "batch_size": "16",
    "evaluation_device": "cuda"
}

โšก Advanced Features & Optimization

# Advanced model features captured:
{
    "attention_implementation": "flash_attention_2",
    "use_flash_attention": "True",
    "trust_remote_code": "True",
    "enable_prefix_caching": "True",
    "swap_space": "4 GB",
    "use_cache": "True"
}

๐ŸŽฏ Generation Parameters

# All generation settings tracked:
{
    "temperature": "0.7",
    "top_p": "0.9", 
    "top_k": "50",
    "max_new_tokens": "512",
    "do_sample": "True",
    "repetition_penalty": "1.1",
    "num_beams": "1"
}

๐Ÿ—๏ธ Architecture Details

# Model family specific information:
{
    "family": "Mistral",
    "attention": "Grouped-query attention (GQA) with sliding window",
    "activation": "SwiGLU", 
    "positional_encoding": "RoPE (Rotary Position Embedding)",
    "vocab_size": "32,000",
    "special_features": "Sliding window attention (4096 tokens)"
}

๐ŸŽจ Enhanced HTML Reports

The comprehensive configuration example above generates professional HTML reports with:

  • ๐Ÿ“‹ Executive Summary - Overall performance with badges and insights
  • โš™๏ธ Model Configuration - 6 detailed sections with 60+ parameters:
    • Basic Model Information
    • Hardware & Performance Configuration
    • Advanced Features & Optimization
    • Generation Parameters
    • Evaluation Configuration
    • Architecture Details
  • ๐Ÿ“Š Performance Charts - Interactive radar and bar charts
  • ๐Ÿ” Sample Analysis - Detailed per-task breakdowns with proper HellaSwag context display
  • ๐Ÿ“ฑ Responsive Design - Perfect on desktop, tablet, and mobile

๐Ÿš€ Production-Ready Example

For production use with maximum performance:

import llm_testkit

# Production evaluation with optimal settings
results, report_path = llm_testkit.quick_html_report(
    model_name="mistralai/Mistral-7B-v0.1",
    model_type="vllm", 
    tasks="hellaswag,arc_easy,mmlu,truthfulqa",
    
    # High-performance configuration
    tensor_parallel_size=4,           # 4-GPU setup
    gpu_memory_utilization=0.95,      # Max GPU usage
    batch_size=32,                    # Large batches
    max_model_len=8192,              # Extended context
    
    # Optimized generation
    temperature=0.0,                  # Deterministic
    max_new_tokens=256,              # Efficient generation
    enable_prefix_caching=True,       # Speed optimization
    
    # Professional reporting
    limit=1000,                       # Comprehensive evaluation
    report_format="professional",
    output_dir="production_reports"
)

print(f"๐ŸŽฏ Production evaluation complete: {report_path}")

Key Benefits:

  • โœ… Zero Configuration Loss - All 60+ parameters automatically captured
  • โœ… Professional Reporting - Publication-ready HTML with detailed sections
  • โœ… Architecture Intelligence - Automatic model family detection and optimization
  • โœ… Performance Optimization - GPU validation and memory management
  • โœ… Complete Traceability - Full configuration tracking for reproducibility

Batch Processing

import llm_testkit

models = [
    "mistralai/Mistral-7B-v0.1",
    "microsoft/DialoGPT-medium",
    "facebook/opt-1.3b"
]

for model in models:
    results, report = llm_testkit.quick_html_report(
        model_name=model,
        tasks="arc_easy,hellaswag",
        output_dir=f"reports/{model.replace('/', '_')}"
    )

๐Ÿ”ง Troubleshooting

CUDA Compatibility Issues

Problem: Getting a warning like:

NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.

Solution: Install PyTorch with CUDA 12.8 support:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

Why CUDA 12.8 for Everyone?

  • โœ… Required for RTX 5090+: Only CUDA 12.8 supports compute capability sm_120
  • โœ… Optimal for all GPUs: Provides best performance even for older GPUs
  • โœ… Backward Compatible: Works with RTX 20 series and newer
  • โœ… Latest Optimizations: Most recent performance improvements from NVIDIA

Memory Issues

Problem: Out of memory errors during evaluation.

Solutions:

  • Reduce batch_size parameter
  • Use quantization: quantize=True, quantization_method="4bit"
  • For vLLM: reduce gpu_memory_utilization (default 0.9)
  • Use tensor parallelism across multiple GPUs

Performance Optimization

For maximum performance:

  • Use vLLM backend for inference: model_type="vllm"
  • Enable tensor parallelism: tensor_parallel_size=2 (or higher)
  • Use Flash Attention: use_flash_attention_2=True
  • Optimize memory: gpu_memory_utilization=0.95

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Built on top of the excellent lm-evaluation-harness
  • Inspired by the need for professional-quality LLM evaluation reports
  • Special thanks to the open-source ML community

๐Ÿ“ž Contact

Matthias De Paolis


โญ Star this repository if you find it useful! โญ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_testkit-1.1.1.tar.gz (55.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_testkit-1.1.1-py3-none-any.whl (57.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_testkit-1.1.1.tar.gz.

File metadata

  • Download URL: llm_testkit-1.1.1.tar.gz
  • Upload date:
  • Size: 55.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llm_testkit-1.1.1.tar.gz
Algorithm Hash digest
SHA256 154137994c658445444534c54bce5dd69753e2ea617c0b7bd9d45e91effbb086
MD5 02523c077e0b30335a6e8734996047a1
BLAKE2b-256 4939a1ebb517a66422e0c797552e6b4f362c1ced5b0f79be4dce988253385c4c

See more details on using hashes here.

File details

Details for the file llm_testkit-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: llm_testkit-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 57.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llm_testkit-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b8b50d07da74566d30d237f8ab4e5f970be5180c4a9f61f8ab80406a76cef197
MD5 81f4ce64642145d9c80809f6477f4cc0
BLAKE2b-256 27fbe16b6d1e8baf6873a14eda15771f3ddf930700c01c152de0c8613569eac1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page