NVIDIA: Benchmark for language models - Fork of Stanford CRFM HELM
Project description
NVIDIA HELM Benchmark Framework
This directory contains the HELM (Holistic Evaluation of Language Models) framework for evaluating large language models in medical applications across various healthcare tasks.
Overview
The HELM framework provides a comprehensive evaluation system for medical AI models, supporting multiple benchmark datasets and evaluation scenarios. It's designed to work with the EvalFactory infrastructure for standardized model evaluation.
Available Benchmarks
The framework supports the following medical evaluation benchmarks:
| Benchmark | Description | Type |
|---|---|---|
| medcalc_bench | Medical calculation benchmark with patient notes and ground truth answers | Medical QA |
| medec | Medical error detection and correction pairs | Error Detection |
| head_qa | Biomedical multiple-choice questions for medical knowledge testing | Medical QA |
| medbullets | USMLE-style medical questions with explanations | Medical QA |
| pubmed_qa | PubMed abstracts with yes/no/maybe questions | Medical QA |
| ehr_sql | Natural language to SQL query generation for clinical research | SQL Generation |
| race_based_med | Detection of race-based biases in medical LLM outputs | Bias Detection |
| medhallu | Classification of factual vs hallucinated medical answers | Hallucination Detection |
Quick Start
1. Environment Setup
First, ensure you have the required environment variables set:
# Set your API keys
export OPENAI_API_KEY="your-api-key-here"
# Set Python path if necessary
export PYTHONPATH=$PYTHONPATH:$.
2. Running Your First Benchmark
Method 1: Using eval-factory (Recommended)
eval-factory is a wrapper that simplifies the HELM benchmark process by handling configuration generation, benchmark execution, and result formatting automatically.
What eval-factory does internally:
- Configuration Processing: Loads your YAML config and merges it with framework defaults
- Dynamic Config Generation: Creates the necessary HELM model configurations dynamically
- Benchmark Execution: Runs the HELM benchmark with proper parameters
- Result Processing: Formats and saves results in standardized YAML format
Create a configuration file (e.g., my_test.yml):
config:
type: medcalc_bench # Choose from available benchmarks
output_dir: results/my_test
target:
api_endpoint:
url: https://api.openai.com/v1
model_id: gpt-4
type: chat
api_key: OPENAI_API_KEY
Run the evaluation:
eval-factory run_eval \
--output_dir results/my_test \
--run_config my_test.yml
Internal Process Breakdown:
-
Config Loading & Validation:
- Loads your YAML configuration
- Validates against framework schema
- Merges with default parameters from
framework.yml
-
Dynamic Model Config Generation:
- Calls
scripts/generate_dynamic_model_configs.py - Creates model-specific configuration files
- Handles provider-specific API endpoints and authentication
- Calls
-
HELM Benchmark Execution:
- Executes
helm-runwith generated configurations - Downloads and prepares benchmark datasets
- Runs evaluations with specified parameters
- Caches responses for efficiency
- Executes
-
Result Processing:
- Collects raw benchmark results
- Formats into standardized YAML output
- Saves results in your specified output directory
Method 2: Using helm-run directly
helm-run \
--run-entries medcalc_bench:model=openai/gpt-4 \
--suite my-suite \
--max-eval-instances 10 \
--num-train-trials 1 \
-o results/my_test
Comparison: eval-factory vs helm-run
| Feature | eval-factory |
helm-run |
|---|---|---|
| Configuration | Simple YAML config | Complex command-line arguments |
| Model Setup | Automatic config generation | Manual model registration required |
| Provider Support | Built-in adapter handling | Requires custom model configs |
| Results Format | Standardized YAML output | Native HELM format only |
| Ease of Use | Beginner-friendly | Advanced users only |
| Integration | EvalFactory compatible | HELM-specific |
Recommendation: Use eval-factory for most use cases, especially when working with EvalFactory. Use helm-run only when you need fine-grained control over HELM's native features.
3. Understanding the Output
After running a benchmark, you'll find results in your specified output directory:
results/my_test/
├── responses/ # Raw model responses
├── cache.db # Cached responses for efficiency
├── instances.jsonl # Evaluation instances
├── results.jsonl # Final evaluation results
├── model_configs/ # Generated HELM model configurations
└── evaluation_config.yaml # Standardized evaluation results
Generated Files Explanation:
responses/: Contains raw API responses from the model for each evaluation instancecache.db: SQLite database caching responses to avoid re-running identical queriesinstances.jsonl: The evaluation instances (questions, prompts, etc.) used in the benchmarkresults.jsonl: HELM's native results format with detailed metricsmodel_configs/: Dynamically generated configuration files for the specific model and providerevaluation_config.yaml: Standardized results in YAML format compatible with EvalFactory
Key Advantage: eval-factory automatically handles the complexity of HELM configuration generation, making it much easier to run benchmarks compared to using helm-run directly.
Step-by-Step Guide
Step 1: Choose Your Benchmark
Select from the available benchmarks based on your evaluation needs:
- For general medical QA:
medcalc_bench,head_qa,medbullets - For error detection:
medec - For research applications:
pubmed_qa,ehr_sql - For safety evaluation:
race_based_med,medhallu
Step 2: Configure Your Model
Create a YAML configuration file with your model details. Here are examples for different providers:
OpenAI Configuration
config:
type: medcalc_bench
output_dir: results/openai_test
target:
api_endpoint:
url: https://api.openai.com/v1
model_id: gpt-4
type: chat
api_key: OPENAI_API_KEY
NVIDIA AI Foundation Models (build.nvidia.com)
config:
type: pubmed_qa
output_dir: results/nim_test
target:
api_endpoint:
url: https://integrate.api.nvidia.com/v1
model_id: nvdev/meta/llama-3.3-70b-instruct
type: chat
api_key: OPENAI_API_KEY
NVIDIA Cloud Function (nvcf)
config:
type: ehr_sql
output_dir: results/nvcf_test
target:
api_endpoint:
url: https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/13e4f873-9d52-4ba9-8194-61baf8dc2bc9/
model_id: meta-llama/Llama-3.3-70B-Instruct
type: chat
api_key: OPENAI_API_KEY
adapter_config:
use_nvcf: true
Model Naming Conventions
Different providers use different model ID formats:
- OpenAI:
gpt-4,gpt-3.5-turbo,text-davinci-003 - NVIDIA:
meta-llama/Llama-3.3-70B-Instruct,mistral-7b-instruct
Note: NVCF requires a specific function ID in the URL and the use_nvcf: true adapter configuration.
Step 3: Set Up API Credentials
Ensure your API credentials are properly configured:
# For OpenAI models
export OPENAI_API_KEY="<very-long-sequence>"
# For NVIDIA AI Foundation Models (build.nvidia.com)
export OPENAI_API_KEY="nvapi-..." # Uses same env var as OpenAI
# For NVIDIA Cloud Function (nvcf)
export OPENAI_API_KEY="nvapi-..." # Uses same env var as OpenAI
# Note: NVIDIA services typically use the same OPENAI_API_KEY environment variable
# but with NVIDIA-specific API keys (nvapi-... format)
Step 4: Run the Evaluation
Execute the benchmark using one of the methods above. The framework will:
- Load the configuration and validate parameters
- Generate model configs dynamically for the specified model
- Download and prepare the benchmark dataset
- Run evaluations on the specified number of instances
- Cache responses for efficiency and reproducibility
- Generate results in standardized format
Step 5: Analyze Results
Review the generated results:
# View raw results
cat results/my_test/results.jsonl
# Use HELM tools for analysis
helm-summarize --suite my-suite
helm-server # Start web interface to view results
Advanced Configuration
Customizing Evaluation Parameters
You can customize various parameters in your configuration:
config:
type: medcalc_bench
output_dir: results/advanced_test
params:
limit_samples: 100 # Limit number of evaluation instances
parallelism: 4 # Number of parallel threads
extra:
num_train_trials: 3 # Number of training trials
max_length: 2048 # Maximum token length
target:
api_endpoint:
url: https://api.openai.com/v1
model_id: gpt-4
type: chat
api_key: OPENAI_API_KEY
Advanced Configuration Parameters
The config.params.extra section provides additional parameters for fine-tuning evaluations:
data_path
- Purpose: Custom data path for scenarios that support it
- Supported Scenarios:
ehrshot,clear,medalign,n2c2_ct_matching - Example:
"/path/to/custom/data" - Description: Overrides the default data location for the scenario
num_output_tokens
- Purpose: Maximum number of tokens the model is allowed to generate in its response
- Scope: Controls only the output length, not the total sequence length
- Example:
1000limits model responses to 1000 tokens - Use Case: Useful for controlling response length in generation tasks
max_length
- Purpose: Maximum total length for the entire input-output sequence (input + output combined)
- Scope: Controls the combined length of both prompt and response
- Example:
2048limits total conversation to 2048 tokens - Difference from num_output_tokens: This controls total sequence length, while num_output_tokens only controls response length
subject
- Purpose: Specific task or subset to evaluate within a scenario
- Examples by Scenario:
- ehrshot:
"guo_readmission","new_hypertension","lab_anemia" - n2c2_ct_matching:
"ABDOMINAL","ADVANCED-CAD","CREATININE" - clear:
"major_depression","bipolar_disorder","substance_use_disorder"
- ehrshot:
- Description: Filters the evaluation to a specific prediction task or medical condition
condition
- Purpose: Specific condition or scenario variant to evaluate
- Supported Scenarios:
clear - Examples:
"alcohol_dependence","chronic_pain","homelessness" - Description: Used by scenarios like 'clear' to specify medical conditions for evaluation
num_train_trials
- Purpose: Number of training trials for few-shot evaluation
- Behavior: Each trial samples a different set of in-context examples
- Example:
3runs the evaluation 3 times with different examples - Use Case: Useful for robust evaluation with multiple few-shot configurations
Example Configuration with All Parameters
config:
type: ehrshot
output_dir: results/ehrshot_evaluation
params:
limit_samples: 500
parallelism: 2
extra:
data_path: "/custom/path/to/ehrshot/data"
num_output_tokens: 1000
max_length: 4096
subject: "guo_readmission"
num_train_trials: 3
target:
api_endpoint:
url: https://api.openai.com/v1
model_id: gpt-4
type: chat
api_key: OPENAI_API_KEY
Running Multiple Benchmarks
To run multiple benchmarks on the same model:
# Create separate config files for each benchmark
eval-factory run_eval --output_dir results/medcalc_test --run_config medcalc_config.yml
eval-factory run_eval --output_dir results/medec_test --run_config medec_config.yml
eval-factory run_eval --output_dir results/head_qa_test --run_config head_qa_config.yml
Dry Run Mode
Test your configuration without running the full evaluation:
eval-factory run_eval \
--output_dir results/test \
--run_config my_config.yml \
--dry_run
This will show you the rendered configuration and command without executing the benchmark.
Troubleshooting
Common Issues
- API Key Errors: Ensure your API keys are properly set and valid
- Model Not Found: Verify the model ID and endpoint URL are correct
- Memory Issues: Reduce
parallelismorlimit_samplesfor large models - Timeout Errors: Increase timeout settings or reduce batch sizes
Debug Mode
Enable debug logging for detailed information:
eval-factory --debug run_eval \
--output_dir results/debug_test \
--run_config debug_config.yml
Checking Available Tasks
List all available evaluation types:
eval-factory ls
Examples from commands.sh
Here are some practical examples from the project:
Basic Medical Calculation Benchmark
eval-factory run_eval \
--output_dir test_cases/test_case_nim_llama_3_1_8b_medcalc_bench \
--run_config test_cases/test_case_nim_llama_3_1_8b_medcalc_bench.yml
Medical Error Detection
eval-factory run_eval \
--output_dir test_cases/test_case_nim_llama_3_1_8b_medec \
--run_config test_cases/test_case_nim_llama_3_1_8b_medec.yml
Biomedical QA
eval-factory run_eval \
--output_dir test_cases/test_case_nim_llama_3_1_8b_head_qa \
--run_config test_cases/test_case_nim_llama_3_1_8b_head_qa.yml
Running Evaluations with Judges
The HELM framework supports multi-judge evaluations for scenarios that require human-like assessment of model outputs. This is particularly useful for tasks like medical treatment plan generation, where multiple AI judges can provide more robust and reliable evaluations.
Overview of Multi-Judge Setup
The framework supports three types of judges:
- GPT Judge: Uses OpenAI GPT models for evaluation
- Llama Judge: Uses Llama models for evaluation
- Claude Judge: Uses Anthropic Claude models for evaluation
Each judge can use different API keys, providing better rate limiting, cost tracking, and flexibility.
Authentication Systems
The framework supports two authentication methods for judge models:
1. Direct Judge API Keys (Recommended for Production)
Set individual API keys for each judge type:
# API key for the main model being evaluated
export OPENAI_API_KEY="your-main-model-api-key"
# API keys for the three judges (annotators)
export GPT_JUDGE_API_KEY="your-gpt-judge-api-key"
export LLAMA_JUDGE_API_KEY="your-llama-judge-api-key"
export CLAUDE_JUDGE_API_KEY="your-claude-judge-api-key"
2. OAuth 2.0 Client Credentials Flow (Advanced)
Use NVIDIA's OAuth system for automatic token management:
# OAuth 2.0 credentials for automatic token generation
export OPENAI_CLIENT_ID="nvssa-prd-your-client-id"
export OPENAI_CLIENT_SECRET="ssap-your-client-secret"
export OPENAI_TOKEN_URL="https://prod.api.nvidia.com/oauth/api/v1/ssa/default/token"
export OPENAI_SCOPE="awsanthropic-readwrite"
# Main API key (still required)
export OPENAI_API_KEY="your-main-model-api-key"
How Authentication Priority Works
The system follows this exact priority order:
-
First Priority: Judge-specific environment variables
GPT_JUDGE_API_KEYfor GPT modelsLLAMA_JUDGE_API_KEYfor Llama modelsCLAUDE_JUDGE_API_KEYfor Claude models
-
Second Priority: Fallback to main API key
- If judge keys aren't set, automatically uses
OPENAI_API_KEY - System logs: "GPT_JUDGE_API_KEY is not set, setting to OPENAI_API_KEY"
- If judge keys aren't set, automatically uses
-
Third Priority: Credentials configuration
- Falls back to
credentials.confor deployment-specific keys
- Falls back to
Important: OAuth-generated tokens are NOT automatically used for judge API keys. The OAuth system is separate and serves different purposes.
OAuth 2.0 System Details
The OAuth system provides:
- Automatic Token Creation: Generates access tokens using client credentials
- Token Caching: Stores tokens in memory and disk (
{service_name}_oauth_token.json) - Automatic Refresh: Refreshes expired tokens automatically
- Scope Control: Different permissions per service:
azureopenai-readwritefor GPT servicesawsanthropic-readwritefor Claude services
When to Use OAuth:
- Better security (client credentials vs. long-lived API keys)
- Automatic token management
- Centralized billing and rate limiting
- Enterprise-grade authentication
When to Use Direct API Keys:
- Simpler setup
- Direct control over each judge's API key
- Different providers for different judges
- Testing and development scenarios
Security Features
API Key Protection: The system automatically sanitizes error messages to prevent API keys from appearing in logs. Any API key patterns (like nvapi-..., sk-..., hf_...) are automatically replaced with [API_KEY_REDACTED] before logging.
Configuration for Multi-Judge Evaluations
Basic Configuration (Direct API Keys)
config:
type: mtsamples_replicate # Example scenario that uses judges
output_dir: results/multi_judge_test
params:
limit_samples: 10
parallelism: 1
extra:
num_train_trials: 1
max_length: 2048
# Different API keys for each judge
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
target:
api_endpoint:
url: https://integrate.api.nvidia.com/v1
model_id: nvdev/meta/llama-3.1-8b-instruct
type: chat
api_key: OPENAI_API_KEY
Advanced Configuration (OAuth + Direct Keys)
config:
type: mtsamples_replicate
output_dir: results/oauth_multi_judge_test
params:
limit_samples: 50
parallelism: 2
extra:
num_train_trials: 3
max_length: 2048
# Mix OAuth (automatic) and direct keys
gpt_judge_api_key: GPT_JUDGE_API_KEY # Direct key for GPT
# Llama and Claude will use OAuth-generated tokens
target:
api_endpoint:
url: https://integrate.api.nvidia.com/v1
model_id: nvdev/meta/llama-3.3-70b-instruct
type: chat
api_key: OPENAI_API_KEY
Supported Scenarios with Judges
Currently, the following scenarios support multi-judge evaluations:
| Scenario | Description | Judge Types Used |
|---|---|---|
| mtsamples_replicate | Generate treatment plans based on clinical notes | GPT, Llama, Claude |
| mtsamples_procedures | Document and extract information about medical procedures | GPT, Llama, Claude |
| aci_bench | Extract and structure information from patient-doctor conversations | GPT, Llama, Claude |
| medication_qa | Answer consumer medication-related questions | GPT, Llama, Claude |
| medi_qa | Retrieve and rank answers based on medical question understanding | GPT, Llama, Claude |
| med_dialog | Generate summaries of doctor-patient conversations | GPT, Llama, Claude |
Complete Setup Guide
Method 1: Direct API Keys (Simplest)
# 1. Set up environment variables
export OPENAI_API_KEY="nvapi-your-main-api-key"
export GPT_JUDGE_API_KEY="nvapi-gpt-judge-api-key"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-api-key"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-api-key"
# 2. Run the evaluation
eval-factory run_eval \
--output_dir results/multi_judge_test \
--run_config multi_judge_config.yml
Method 2: OAuth 2.0 System (Enterprise)
# 1. Set up OAuth credentials
export OPENAI_CLIENT_ID="nvssa-prd-your-client-id"
export OPENAI_CLIENT_SECRET="ssap-your-client-secret"
export OPENAI_TOKEN_URL="https://prod.api.nvidia.com/oauth/api/v1/ssa/default/token"
export OPENAI_SCOPE="awsanthropic-readwrite"
# 2. Set main API key (still required)
export OPENAI_API_KEY="nvapi-your-main-api-key"
# 3. Run the evaluation
eval-factory run_eval \
--output_dir results/oauth_multi_judge_test \
--run_config oauth_multi_judge_config.yml
Method 3: Hybrid Approach (Flexible)
# 1. Set OAuth credentials for automatic token generation
export OPENAI_CLIENT_ID="nvssa-prd-your-client-id"
export OPENAI_CLIENT_SECRET="ssap-your-client-secret"
# 2. Override specific judge with direct API key
export GPT_JUDGE_API_KEY="nvapi-gpt-specific-key"
# 3. Set main API key
export OPENAI_API_KEY="nvapi-your-main-api-key"
# 4. Run the evaluation
eval-factory run_eval \
--output_dir results/hybrid_multi_judge_test \
--run_config hybrid_multi_judge_config.yml
Method 4: Using helm-run directly
# Set up environment variables (any of the above methods)
export OPENAI_API_KEY="nvapi-your-main-api-key"
export GPT_JUDGE_API_KEY="nvapi-gpt-judge-api-key"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-api-key"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-api-key"
# Run the evaluation
helm-run \
--run-entries mtsamples_replicate:model=openai/gpt-4 \
--suite my-suite \
--max-eval-instances 10 \
--num-train-trials 1 \
-o results/multi_judge_test
Advanced Judge Configuration
Using Different API Keys for Each Judge
You can use completely different API keys for each judge:
export GPT_JUDGE_API_KEY="nvapi-gpt-judge-1"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-2"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-3"
Using the Same API Key for All Judges
If you want to use the same API key for all judges:
export GPT_JUDGE_API_KEY="nvapi-shared-key"
export LLAMA_JUDGE_API_KEY="nvapi-shared-key"
export CLAUDE_JUDGE_API_KEY="nvapi-shared-key"
OAuth Token Management
Check OAuth Token Status:
# Look for OAuth token files
ls -la *_oauth_token.json
# Check token expiration
cat openai_oauth_token.json | jq '.expires_at'
Force Token Refresh:
# The system automatically refreshes expired tokens
# You can also manually trigger refresh by deleting token files
rm *_oauth_token.json
OAuth Scopes for Different Services:
# For GPT services
export OPENAI_SCOPE="azureopenai-readwrite"
# For Claude services
export OPENAI_SCOPE="awsanthropic-readwrite"
# For general access
export OPENAI_SCOPE="awsanthropic-readwrite"
Example Multi-Judge Evaluation
Here's a complete example for running a multi-judge evaluation:
# 1. Create configuration file (multi_judge_config.yml)
cat > multi_judge_config.yml << EOF
config:
type: mtsamples_replicate
output_dir: results/multi_judge_test
params:
limit_samples: 50
parallelism: 2
extra:
num_train_trials: 3
max_length: 2048
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
target:
api_endpoint:
url: https://integrate.api.nvidia.com/v1
model_id: nvdev/meta/llama-3.3-70b-instruct
type: chat
api_key: OPENAI_API_KEY
EOF
# 2. Set environment variables
export OPENAI_API_KEY="nvapi-main-model-key"
export GPT_JUDGE_API_KEY="nvapi-gpt-judge-key"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-key"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-key"
# 3. Run the evaluation
eval-factory run_eval \
--output_dir results/multi_judge_test \
--run_config multi_judge_config.yml
Troubleshooting Multi-Judge Evaluations
Check Environment Variables
Verify your environment variables are set correctly:
echo "Main API Key: $OPENAI_API_KEY"
echo "GPT Judge: $GPT_JUDGE_API_KEY"
echo "Llama Judge: $LLAMA_JUDGE_API_KEY"
echo "Claude Judge: $CLAUDE_JUDGE_API_KEY"
Check OAuth Configuration
Verify OAuth credentials are properly set:
echo "Client ID: $OPENAI_CLIENT_ID"
echo "Client Secret: $OPENAI_CLIENT_SECRET"
echo "Token URL: $OPENAI_TOKEN_URL"
echo "Scope: $OPENAI_SCOPE"
Debug Mode
Enable debug logging to see which API keys are being used:
eval-factory --debug run_eval \
--output_dir results/debug_multi_judge \
--run_config multi_judge_config.yml
Common Issues and Solutions
Issue: "GPT_JUDGE_API_KEY is not set, setting to OPENAI_API_KEY"
- Cause: Judge API key not set, system falling back to main API key
- Solution: Set the specific judge API key or accept the fallback
Issue: "Missing environment variables for openai token"
- Cause: OAuth credentials not properly configured
- Solution: Set
OPENAI_CLIENT_IDandOPENAI_CLIENT_SECRET
Issue: "Error creating openai OAuth token"
- Cause: Invalid credentials or network issues
- Solution: Verify credentials and check network connectivity
Issue: API key appears in logs
- Cause: This should not happen with the security fix
- Solution: Check if you're using the latest version with API key sanitization
Log Analysis
Look for these log patterns:
# Judge API key usage
grep "Using.*judge API key" logs/*.log
# OAuth token creation
grep "Creating new.*OAuth token" logs/*.log
# API key fallbacks
grep "is not set, setting to" logs/*.log
# Authentication errors
grep "Authentication error detected" logs/*.log
Performance Monitoring
Check API Key Usage:
# Monitor which API keys are being used
grep "Using.*API key.*ends with" logs/*.log
# Check for rate limiting
grep "rate limit\|429" logs/*.log
# Monitor OAuth token refresh
grep "token expired\|refreshing" logs/*.log
Look for messages like:
Using GPT judge API key from environment variable for model: nvidia/gpt4o-abc123
Using Llama judge API key from environment variable for model: nvdev/meta/llama-3.3-70b-instruct-def456
Using Claude judge API key from environment variable for model: nvidia/claude-3-7-sonnet-20250219-ghi789
Common Issues
- Environment variables not loaded: Make sure your environment variables are set before running the command
- API key format: Ensure your API keys start with
nvapi-for NVIDIA services - Configuration file: Verify your YAML configuration file references the correct environment variable names
- Judge model availability: Ensure the judge models are available through your API endpoints
Benefits of Multi-Judge Evaluations
- Better rate limiting: Each judge can have its own rate limits
- Cost tracking: Track costs separately for each judge
- Flexibility: Use different API keys for different purposes
- Security: Isolate API keys for different components
- Robustness: Multiple judges provide more reliable evaluations
- Diversity: Different judge models may catch different types of errors
Integration with EvalFactory
This framework is designed to work seamlessly with the EvalFactory infrastructure:
- Standardized Output: Results are generated in a format compatible with EvalFactory
- Configuration Management: Uses YAML-based configuration for easy integration
- Caching: Built-in caching for efficient re-runs and reproducibility
- Extensibility: Easy to add new benchmarks and evaluation metrics
Contributing
To add new benchmarks or modify existing ones:
- Update
framework.ymlwith new benchmark definitions - Implement the benchmark logic in the appropriate adapter
- Add test cases and documentation
- Update this README with new benchmark information
References
For more detailed information about specific benchmarks and their implementations, refer to the individual benchmark documentation and the main HELM repository.
Holistic Evaluation of Language Models (HELM)
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models. This framework includes the following features:
- Datasets and benchmarks in a standardized format (e.g. MMLU-Pro, GPQA, IFEval, WildBench)
- Models from various providers accessible through a unified interface (e.g. OpenAI models, Anthropic Claude, Google Gemini)
- Metrics for measuring various aspects beyond accuracy (e.g. efficiency, bias, toxicity)
- Web UI for inspecting individual prompts and responses
- Web leaderboard for comparing results across models and benchmarks
Documentation
Please refer to the documentation on Read the Docs for instructions on how to install and run HELM.
Quick Start
Install the package from PyPI:
pip install crfm-helm
Run the following in your shell:
# Run benchmark
helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10
# Summarize benchmark results
helm-summarize --suite my-suite
# Start a web server to display benchmark results
helm-server --suite my-suite
Then go to http://localhost:8000/ in your browser.
Attribution
This NVIDIA fork of HELM is based on the original Stanford CRFM HELM framework. The original framework was created by the Center for Research on Foundation Models (CRFM) at Stanford and is licensed under the Apache License 2.0.
Leaderboards
We maintain offical leaderboards with results from evaluating recent models on notable benchmarks using this framework. Our current flagship leaderboards are:
We also maintain leaderboards for a diverse range of domains (e.g. medicine, finance) and aspects (e.g. multi-linguality, world knowledge, regulation compliance). Refer to the HELM website for a full list of leaderboards.
Papers
The HELM framework was used in the following papers for evaluating models.
- Holistic Evaluation of Language Models - paper, leaderboard
- Holistic Evaluation of Vision-Language Models (VHELM) - paper, leaderboard, documentation
- Holistic Evaluation of Text-To-Image Models (HEIM) - paper, leaderboard, documentation
- Image2Struct: Benchmarking Structure Extraction for Vision-Language Models - paper
- Enterprise Benchmarks for Large Language Model Evaluation - paper, documentation
- The Mighty ToRR: A Benchmark for Table Reasoning and Robustness - paper, leaderboard
- Reliable and Efficient Amortized Model-based Evaluation - paper, documentation
- MedHELM - paper in progress, leaderboard, documentation
The HELM framework can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the main Reproducing Leaderboards documentation.
Citation
If you use this software in your research, please cite the Holistic Evaluation of Language Models paper as below.
@article{
liang2023holistic,
title={Holistic Evaluation of Language Models},
author={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=iO4LZibEqW},
note={Featured Certification, Expert Certification}
}
Attribution
Attribution and Acknowledgments
Original Project
This project is a fork of the Holistic Evaluation of Language Models (HELM) framework created by the Center for Research on Foundation Models (CRFM) at Stanford.
- Original Repository: https://github.com/stanford-crfm/helm
- Original Documentation: https://crfm.stanford.edu/helm
- Original Paper: Holistic Evaluation of Language Models
- Original Authors: Stanford CRFM Team
- Original License: Apache License 2.0
Citation
If you use this software in your research, please cite the original HELM paper:
@article{liang2023holistic,
title={Holistic Evaluation of Language Models},
author={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=iO4LZibEqW},
note={Featured Certification, Expert Certification}
}
Fork Information
- Fork Maintainer: NVIDIA
- Fork Purpose: Medical AI evaluation and EvalFactory integration
License
This fork is released under the same Apache License 2.0 as the original project, in accordance with the original license terms.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nvidia_crfm_helm-25.8.1-py3-none-any.whl.
File metadata
- Download URL: nvidia_crfm_helm-25.8.1-py3-none-any.whl
- Upload date:
- Size: 7.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e98730b2b6899a7110363c11ebae521e417b60a78c9985a65be48a1a30bd02fe
|
|
| MD5 |
f7dd23358b47b3a9b13134744a760518
|
|
| BLAKE2b-256 |
9a6ddb17c28319c8cd2d35928dfef719b4e9a0c03ba548a3f4d3b38cc7b2165f
|