Skip to main content

NVIDIA: Benchmark for language models - Fork of Stanford CRFM HELM

Project description

NVIDIA HELM Benchmark Framework

This directory contains the HELM (Holistic Evaluation of Language Models) framework for evaluating large language models in medical applications across various healthcare tasks.

Overview

The HELM framework provides a comprehensive evaluation system for medical AI models, supporting multiple benchmark datasets and evaluation scenarios. It's designed to work with the EvalFactory infrastructure for standardized model evaluation.

Available Benchmarks

The framework supports the following medical evaluation benchmarks:

Benchmark Description Type
medcalc_bench Medical calculation benchmark with patient notes and ground truth answers Medical QA
medec Medical error detection and correction pairs Error Detection
head_qa Biomedical multiple-choice questions for medical knowledge testing Medical QA
medbullets USMLE-style medical questions with explanations Medical QA
pubmed_qa PubMed abstracts with yes/no/maybe questions Medical QA
ehr_sql Natural language to SQL query generation for clinical research SQL Generation
race_based_med Detection of race-based biases in medical LLM outputs Bias Detection
medhallu Classification of factual vs hallucinated medical answers Hallucination Detection

Quick Start

1. Environment Setup

First, ensure you have the required environment variables set:

# Set your API keys
export OPENAI_API_KEY="your-api-key-here"

# Set Python path if necessary
export PYTHONPATH=$PYTHONPATH:$.

2. Running Your First Benchmark

Method 1: Using eval-factory (Recommended)

eval-factory is a wrapper that simplifies the HELM benchmark process by handling configuration generation, benchmark execution, and result formatting automatically.

What eval-factory does internally:

  1. Configuration Processing: Loads your YAML config and merges it with framework defaults
  2. Dynamic Config Generation: Creates the necessary HELM model configurations dynamically
  3. Benchmark Execution: Runs the HELM benchmark with proper parameters
  4. Result Processing: Formats and saves results in standardized YAML format

Create a configuration file (e.g., my_test.yml):

config:
  type: medcalc_bench  # Choose from available benchmarks
  output_dir: results/my_test
target:
  api_endpoint:
    url: https://api.openai.com/v1
    model_id: gpt-4
    type: chat
    api_key: OPENAI_API_KEY

Run the evaluation:

eval-factory run_eval \
    --output_dir results/my_test \
    --run_config my_test.yml

Internal Process Breakdown:

  1. Config Loading & Validation:

    • Loads your YAML configuration
    • Validates against framework schema
    • Merges with default parameters from framework.yml
  2. Dynamic Model Config Generation:

    • Calls scripts/generate_dynamic_model_configs.py
    • Creates model-specific configuration files
    • Handles provider-specific API endpoints and authentication
  3. HELM Benchmark Execution:

    • Executes helm-run with generated configurations
    • Downloads and prepares benchmark datasets
    • Runs evaluations with specified parameters
    • Caches responses for efficiency
  4. Result Processing:

    • Collects raw benchmark results
    • Formats into standardized YAML output
    • Saves results in your specified output directory

Method 2: Using helm-run directly

helm-run \
  --run-entries medcalc_bench:model=openai/gpt-4 \
  --suite my-suite \
  --max-eval-instances 10 \
  --num-train-trials 1 \
  -o results/my_test

Comparison: eval-factory vs helm-run

Feature eval-factory helm-run
Configuration Simple YAML config Complex command-line arguments
Model Setup Automatic config generation Manual model registration required
Provider Support Built-in adapter handling Requires custom model configs
Results Format Standardized YAML output Native HELM format only
Ease of Use Beginner-friendly Advanced users only
Integration EvalFactory compatible HELM-specific

Recommendation: Use eval-factory for most use cases, especially when working with EvalFactory. Use helm-run only when you need fine-grained control over HELM's native features.

3. Understanding the Output

After running a benchmark, you'll find results in your specified output directory:

results/my_test/
├── responses/          # Raw model responses
├── cache.db           # Cached responses for efficiency
├── instances.jsonl    # Evaluation instances
├── results.jsonl      # Final evaluation results
├── model_configs/     # Generated HELM model configurations
└── evaluation_config.yaml  # Standardized evaluation results

Generated Files Explanation:

  • responses/: Contains raw API responses from the model for each evaluation instance
  • cache.db: SQLite database caching responses to avoid re-running identical queries
  • instances.jsonl: The evaluation instances (questions, prompts, etc.) used in the benchmark
  • results.jsonl: HELM's native results format with detailed metrics
  • model_configs/: Dynamically generated configuration files for the specific model and provider
  • evaluation_config.yaml: Standardized results in YAML format compatible with EvalFactory

Key Advantage: eval-factory automatically handles the complexity of HELM configuration generation, making it much easier to run benchmarks compared to using helm-run directly.

Step-by-Step Guide

Step 1: Choose Your Benchmark

Select from the available benchmarks based on your evaluation needs:

  • For general medical QA: medcalc_bench, head_qa, medbullets
  • For error detection: medec
  • For research applications: pubmed_qa, ehr_sql
  • For safety evaluation: race_based_med, medhallu

Step 2: Configure Your Model

Create a YAML configuration file with your model details. Here are examples for different providers:

OpenAI Configuration

config:
  type: medcalc_bench
  output_dir: results/openai_test
target:
  api_endpoint:
    url: https://api.openai.com/v1
    model_id: gpt-4
    type: chat
    api_key: OPENAI_API_KEY

NVIDIA AI Foundation Models (build.nvidia.com)

config:
  type: pubmed_qa
  output_dir: results/nim_test
target:
  api_endpoint:
    url: https://integrate.api.nvidia.com/v1
    model_id: nvdev/meta/llama-3.3-70b-instruct
    type: chat
    api_key: OPENAI_API_KEY

NVIDIA Cloud Function (nvcf)

config:
  type: ehr_sql
  output_dir: results/nvcf_test
target:
  api_endpoint:
    url: https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/13e4f873-9d52-4ba9-8194-61baf8dc2bc9/
    model_id: meta-llama/Llama-3.3-70B-Instruct
    type: chat
    api_key: OPENAI_API_KEY
    adapter_config:
      use_nvcf: true

Model Naming Conventions

Different providers use different model ID formats:

  • OpenAI: gpt-4, gpt-3.5-turbo, text-davinci-003
  • NVIDIA: meta-llama/Llama-3.3-70B-Instruct, mistral-7b-instruct

Note: NVCF requires a specific function ID in the URL and the use_nvcf: true adapter configuration.

Step 3: Set Up API Credentials

Ensure your API credentials are properly configured:

# For OpenAI models
export OPENAI_API_KEY="<very-long-sequence>"

# For NVIDIA AI Foundation Models (build.nvidia.com)
export OPENAI_API_KEY="nvapi-..."  # Uses same env var as OpenAI

# For NVIDIA Cloud Function (nvcf)
export OPENAI_API_KEY="nvapi-..."  # Uses same env var as OpenAI

# Note: NVIDIA services typically use the same OPENAI_API_KEY environment variable
# but with NVIDIA-specific API keys (nvapi-... format)

Step 4: Run the Evaluation

Execute the benchmark using one of the methods above. The framework will:

  1. Load the configuration and validate parameters
  2. Generate model configs dynamically for the specified model
  3. Download and prepare the benchmark dataset
  4. Run evaluations on the specified number of instances
  5. Cache responses for efficiency and reproducibility
  6. Generate results in standardized format

Step 5: Analyze Results

Review the generated results:

# View raw results
cat results/my_test/results.jsonl

# Use HELM tools for analysis
helm-summarize --suite my-suite
helm-server  # Start web interface to view results

Advanced Configuration

Customizing Evaluation Parameters

You can customize various parameters in your configuration:

config:
  type: medcalc_bench
  output_dir: results/advanced_test
  params:
    limit_samples: 100        # Limit number of evaluation instances
    parallelism: 4           # Number of parallel threads
    extra:
      num_train_trials: 3    # Number of training trials
      max_length: 2048       # Maximum token length
target:
  api_endpoint:
    url: https://api.openai.com/v1
    model_id: gpt-4
    type: chat
    api_key: OPENAI_API_KEY

Advanced Configuration Parameters

The config.params.extra section provides additional parameters for fine-tuning evaluations:

data_path

  • Purpose: Custom data path for scenarios that support it
  • Supported Scenarios: ehrshot, clear, medalign, n2c2_ct_matching
  • Example: "/path/to/custom/data"
  • Description: Overrides the default data location for the scenario

num_output_tokens

  • Purpose: Maximum number of tokens the model is allowed to generate in its response
  • Scope: Controls only the output length, not the total sequence length
  • Example: 1000 limits model responses to 1000 tokens
  • Use Case: Useful for controlling response length in generation tasks

max_length

  • Purpose: Maximum total length for the entire input-output sequence (input + output combined)
  • Scope: Controls the combined length of both prompt and response
  • Example: 2048 limits total conversation to 2048 tokens
  • Difference from num_output_tokens: This controls total sequence length, while num_output_tokens only controls response length

subject

  • Purpose: Specific task or subset to evaluate within a scenario
  • Examples by Scenario:
    • ehrshot: "guo_readmission", "new_hypertension", "lab_anemia"
    • n2c2_ct_matching: "ABDOMINAL", "ADVANCED-CAD", "CREATININE"
    • clear: "major_depression", "bipolar_disorder", "substance_use_disorder"
  • Description: Filters the evaluation to a specific prediction task or medical condition

condition

  • Purpose: Specific condition or scenario variant to evaluate
  • Supported Scenarios: clear
  • Examples: "alcohol_dependence", "chronic_pain", "homelessness"
  • Description: Used by scenarios like 'clear' to specify medical conditions for evaluation

num_train_trials

  • Purpose: Number of training trials for few-shot evaluation
  • Behavior: Each trial samples a different set of in-context examples
  • Example: 3 runs the evaluation 3 times with different examples
  • Use Case: Useful for robust evaluation with multiple few-shot configurations

Example Configuration with All Parameters

config:
  type: ehrshot
  output_dir: results/ehrshot_evaluation
  params:
    limit_samples: 500
    parallelism: 2
    extra:
      data_path: "/custom/path/to/ehrshot/data"
      num_output_tokens: 1000
      max_length: 4096
      subject: "guo_readmission"
      num_train_trials: 3
target:
  api_endpoint:
    url: https://api.openai.com/v1
    model_id: gpt-4
    type: chat
    api_key: OPENAI_API_KEY

Running Multiple Benchmarks

To run multiple benchmarks on the same model:

# Create separate config files for each benchmark
eval-factory run_eval --output_dir results/medcalc_test --run_config medcalc_config.yml
eval-factory run_eval --output_dir results/medec_test --run_config medec_config.yml
eval-factory run_eval --output_dir results/head_qa_test --run_config head_qa_config.yml

Dry Run Mode

Test your configuration without running the full evaluation:

eval-factory run_eval \
    --output_dir results/test \
    --run_config my_config.yml \
    --dry_run

This will show you the rendered configuration and command without executing the benchmark.

Troubleshooting

Common Issues

  1. API Key Errors: Ensure your API keys are properly set and valid
  2. Model Not Found: Verify the model ID and endpoint URL are correct
  3. Memory Issues: Reduce parallelism or limit_samples for large models
  4. Timeout Errors: Increase timeout settings or reduce batch sizes

Debug Mode

Enable debug logging for detailed information:

eval-factory --debug run_eval \
    --output_dir results/debug_test \
    --run_config debug_config.yml

Checking Available Tasks

List all available evaluation types:

eval-factory ls

Examples from commands.sh

Here are some practical examples from the project:

Basic Medical Calculation Benchmark

eval-factory run_eval \
    --output_dir test_cases/test_case_nim_llama_3_1_8b_medcalc_bench \
    --run_config test_cases/test_case_nim_llama_3_1_8b_medcalc_bench.yml

Medical Error Detection

eval-factory run_eval \
    --output_dir test_cases/test_case_nim_llama_3_1_8b_medec \
    --run_config test_cases/test_case_nim_llama_3_1_8b_medec.yml

Biomedical QA

eval-factory run_eval \
    --output_dir test_cases/test_case_nim_llama_3_1_8b_head_qa \
    --run_config test_cases/test_case_nim_llama_3_1_8b_head_qa.yml

Running Evaluations with Judges

The HELM framework supports multi-judge evaluations for scenarios that require human-like assessment of model outputs. This is particularly useful for tasks like medical treatment plan generation, where multiple AI judges can provide more robust and reliable evaluations.

Overview of Multi-Judge Setup

The framework supports three types of judges:

  • GPT Judge: Uses OpenAI GPT models for evaluation
  • Llama Judge: Uses Llama models for evaluation
  • Claude Judge: Uses Anthropic Claude models for evaluation

Each judge can use different API keys, providing better rate limiting, cost tracking, and flexibility.

Authentication Systems

The framework supports two authentication methods for judge models:

1. Direct Judge API Keys (Recommended for Production)

Set individual API keys for each judge type:

# API key for the main model being evaluated
export OPENAI_API_KEY="your-main-model-api-key"

# API keys for the three judges (annotators)
export GPT_JUDGE_API_KEY="your-gpt-judge-api-key"
export LLAMA_JUDGE_API_KEY="your-llama-judge-api-key"
export CLAUDE_JUDGE_API_KEY="your-claude-judge-api-key"

2. OAuth 2.0 Client Credentials Flow (Advanced)

Use NVIDIA's OAuth system for automatic token management:

# OAuth 2.0 credentials for automatic token generation
export OPENAI_CLIENT_ID="nvssa-prd-your-client-id"
export OPENAI_CLIENT_SECRET="ssap-your-client-secret"
export OPENAI_TOKEN_URL="https://prod.api.nvidia.com/oauth/api/v1/ssa/default/token"
export OPENAI_SCOPE="awsanthropic-readwrite"

# Main API key (still required)
export OPENAI_API_KEY="your-main-model-api-key"

How Authentication Priority Works

The system follows this exact priority order:

  1. First Priority: Judge-specific environment variables

    • GPT_JUDGE_API_KEY for GPT models
    • LLAMA_JUDGE_API_KEY for Llama models
    • CLAUDE_JUDGE_API_KEY for Claude models
  2. Second Priority: Fallback to main API key

    • If judge keys aren't set, automatically uses OPENAI_API_KEY
    • System logs: "GPT_JUDGE_API_KEY is not set, setting to OPENAI_API_KEY"
  3. Third Priority: Credentials configuration

    • Falls back to credentials.conf or deployment-specific keys

Important: OAuth-generated tokens are NOT automatically used for judge API keys. The OAuth system is separate and serves different purposes.

OAuth 2.0 System Details

The OAuth system provides:

  • Automatic Token Creation: Generates access tokens using client credentials
  • Token Caching: Stores tokens in memory and disk ({service_name}_oauth_token.json)
  • Automatic Refresh: Refreshes expired tokens automatically
  • Scope Control: Different permissions per service:
    • azureopenai-readwrite for GPT services
    • awsanthropic-readwrite for Claude services

When to Use OAuth:

  • Better security (client credentials vs. long-lived API keys)
  • Automatic token management
  • Centralized billing and rate limiting
  • Enterprise-grade authentication

When to Use Direct API Keys:

  • Simpler setup
  • Direct control over each judge's API key
  • Different providers for different judges
  • Testing and development scenarios

Security Features

API Key Protection: The system automatically sanitizes error messages to prevent API keys from appearing in logs. Any API key patterns (like nvapi-..., sk-..., hf_...) are automatically replaced with [API_KEY_REDACTED] before logging.

Configuration for Multi-Judge Evaluations

Basic Configuration (Direct API Keys)

config:
  type: mtsamples_replicate  # Example scenario that uses judges
  output_dir: results/multi_judge_test
  params:
    limit_samples: 10
    parallelism: 1
    extra:
      num_train_trials: 1
      max_length: 2048
      # Different API keys for each judge
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
target:
  api_endpoint:
    url: https://integrate.api.nvidia.com/v1
    model_id: nvdev/meta/llama-3.1-8b-instruct
    type: chat
    api_key: OPENAI_API_KEY

Advanced Configuration (OAuth + Direct Keys)

config:
  type: mtsamples_replicate
  output_dir: results/oauth_multi_judge_test
  params:
    limit_samples: 50
    parallelism: 2
    extra:
      num_train_trials: 3
      max_length: 2048
      # Mix OAuth (automatic) and direct keys
      gpt_judge_api_key: GPT_JUDGE_API_KEY  # Direct key for GPT
      # Llama and Claude will use OAuth-generated tokens
target:
  api_endpoint:
    url: https://integrate.api.nvidia.com/v1
    model_id: nvdev/meta/llama-3.3-70b-instruct
    type: chat
    api_key: OPENAI_API_KEY

Supported Scenarios with Judges

Currently, the following scenarios support multi-judge evaluations:

Scenario Description Judge Types Used
mtsamples_replicate Generate treatment plans based on clinical notes GPT, Llama, Claude
mtsamples_procedures Document and extract information about medical procedures GPT, Llama, Claude
aci_bench Extract and structure information from patient-doctor conversations GPT, Llama, Claude
medication_qa Answer consumer medication-related questions GPT, Llama, Claude
medi_qa Retrieve and rank answers based on medical question understanding GPT, Llama, Claude
med_dialog Generate summaries of doctor-patient conversations GPT, Llama, Claude

Complete Setup Guide

Method 1: Direct API Keys (Simplest)

# 1. Set up environment variables
export OPENAI_API_KEY="nvapi-your-main-api-key"
export GPT_JUDGE_API_KEY="nvapi-gpt-judge-api-key"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-api-key"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-api-key"

# 2. Run the evaluation
eval-factory run_eval \
    --output_dir results/multi_judge_test \
    --run_config multi_judge_config.yml

Method 2: OAuth 2.0 System (Enterprise)

# 1. Set up OAuth credentials
export OPENAI_CLIENT_ID="nvssa-prd-your-client-id"
export OPENAI_CLIENT_SECRET="ssap-your-client-secret"
export OPENAI_TOKEN_URL="https://prod.api.nvidia.com/oauth/api/v1/ssa/default/token"
export OPENAI_SCOPE="awsanthropic-readwrite"

# 2. Set main API key (still required)
export OPENAI_API_KEY="nvapi-your-main-api-key"

# 3. Run the evaluation
eval-factory run_eval \
    --output_dir results/oauth_multi_judge_test \
    --run_config oauth_multi_judge_config.yml

Method 3: Hybrid Approach (Flexible)

# 1. Set OAuth credentials for automatic token generation
export OPENAI_CLIENT_ID="nvssa-prd-your-client-id"
export OPENAI_CLIENT_SECRET="ssap-your-client-secret"

# 2. Override specific judge with direct API key
export GPT_JUDGE_API_KEY="nvapi-gpt-specific-key"

# 3. Set main API key
export OPENAI_API_KEY="nvapi-your-main-api-key"

# 4. Run the evaluation
eval-factory run_eval \
    --output_dir results/hybrid_multi_judge_test \
    --run_config hybrid_multi_judge_config.yml

Method 4: Using helm-run directly

# Set up environment variables (any of the above methods)
export OPENAI_API_KEY="nvapi-your-main-api-key"
export GPT_JUDGE_API_KEY="nvapi-gpt-judge-api-key"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-api-key"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-api-key"

# Run the evaluation
helm-run \
  --run-entries mtsamples_replicate:model=openai/gpt-4 \
  --suite my-suite \
  --max-eval-instances 10 \
  --num-train-trials 1 \
  -o results/multi_judge_test

Advanced Judge Configuration

Using Different API Keys for Each Judge

You can use completely different API keys for each judge:

export GPT_JUDGE_API_KEY="nvapi-gpt-judge-1"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-2"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-3"

Using the Same API Key for All Judges

If you want to use the same API key for all judges:

export GPT_JUDGE_API_KEY="nvapi-shared-key"
export LLAMA_JUDGE_API_KEY="nvapi-shared-key"
export CLAUDE_JUDGE_API_KEY="nvapi-shared-key"

OAuth Token Management

Check OAuth Token Status:

# Look for OAuth token files
ls -la *_oauth_token.json

# Check token expiration
cat openai_oauth_token.json | jq '.expires_at'

Force Token Refresh:

# The system automatically refreshes expired tokens
# You can also manually trigger refresh by deleting token files
rm *_oauth_token.json

OAuth Scopes for Different Services:

# For GPT services
export OPENAI_SCOPE="azureopenai-readwrite"

# For Claude services  
export OPENAI_SCOPE="awsanthropic-readwrite"

# For general access
export OPENAI_SCOPE="awsanthropic-readwrite"

Example Multi-Judge Evaluation

Here's a complete example for running a multi-judge evaluation:

# 1. Create configuration file (multi_judge_config.yml)
cat > multi_judge_config.yml << EOF
config:
  type: mtsamples_replicate
  output_dir: results/multi_judge_test
  params:
    limit_samples: 50
    parallelism: 2
    extra:
      num_train_trials: 3
      max_length: 2048
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
target:
  api_endpoint:
    url: https://integrate.api.nvidia.com/v1
    model_id: nvdev/meta/llama-3.3-70b-instruct
    type: chat
    api_key: OPENAI_API_KEY
EOF

# 2. Set environment variables
export OPENAI_API_KEY="nvapi-main-model-key"
export GPT_JUDGE_API_KEY="nvapi-gpt-judge-key"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-key"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-key"

# 3. Run the evaluation
eval-factory run_eval \
    --output_dir results/multi_judge_test \
    --run_config multi_judge_config.yml

Troubleshooting Multi-Judge Evaluations

Check Environment Variables

Verify your environment variables are set correctly:

echo "Main API Key: $OPENAI_API_KEY"
echo "GPT Judge: $GPT_JUDGE_API_KEY"
echo "Llama Judge: $LLAMA_JUDGE_API_KEY"
echo "Claude Judge: $CLAUDE_JUDGE_API_KEY"

Check OAuth Configuration

Verify OAuth credentials are properly set:

echo "Client ID: $OPENAI_CLIENT_ID"
echo "Client Secret: $OPENAI_CLIENT_SECRET"
echo "Token URL: $OPENAI_TOKEN_URL"
echo "Scope: $OPENAI_SCOPE"

Debug Mode

Enable debug logging to see which API keys are being used:

eval-factory --debug run_eval \
    --output_dir results/debug_multi_judge \
    --run_config multi_judge_config.yml

Common Issues and Solutions

Issue: "GPT_JUDGE_API_KEY is not set, setting to OPENAI_API_KEY"

  • Cause: Judge API key not set, system falling back to main API key
  • Solution: Set the specific judge API key or accept the fallback

Issue: "Missing environment variables for openai token"

  • Cause: OAuth credentials not properly configured
  • Solution: Set OPENAI_CLIENT_ID and OPENAI_CLIENT_SECRET

Issue: "Error creating openai OAuth token"

  • Cause: Invalid credentials or network issues
  • Solution: Verify credentials and check network connectivity

Issue: API key appears in logs

  • Cause: This should not happen with the security fix
  • Solution: Check if you're using the latest version with API key sanitization

Log Analysis

Look for these log patterns:

# Judge API key usage
grep "Using.*judge API key" logs/*.log

# OAuth token creation
grep "Creating new.*OAuth token" logs/*.log

# API key fallbacks
grep "is not set, setting to" logs/*.log

# Authentication errors
grep "Authentication error detected" logs/*.log

Performance Monitoring

Check API Key Usage:

# Monitor which API keys are being used
grep "Using.*API key.*ends with" logs/*.log

# Check for rate limiting
grep "rate limit\|429" logs/*.log

# Monitor OAuth token refresh
grep "token expired\|refreshing" logs/*.log

Look for messages like:

Using GPT judge API key from environment variable for model: nvidia/gpt4o-abc123
Using Llama judge API key from environment variable for model: nvdev/meta/llama-3.3-70b-instruct-def456
Using Claude judge API key from environment variable for model: nvidia/claude-3-7-sonnet-20250219-ghi789

Common Issues

  1. Environment variables not loaded: Make sure your environment variables are set before running the command
  2. API key format: Ensure your API keys start with nvapi- for NVIDIA services
  3. Configuration file: Verify your YAML configuration file references the correct environment variable names
  4. Judge model availability: Ensure the judge models are available through your API endpoints

Benefits of Multi-Judge Evaluations

  • Better rate limiting: Each judge can have its own rate limits
  • Cost tracking: Track costs separately for each judge
  • Flexibility: Use different API keys for different purposes
  • Security: Isolate API keys for different components
  • Robustness: Multiple judges provide more reliable evaluations
  • Diversity: Different judge models may catch different types of errors

Integration with EvalFactory

This framework is designed to work seamlessly with the EvalFactory infrastructure:

  • Standardized Output: Results are generated in a format compatible with EvalFactory
  • Configuration Management: Uses YAML-based configuration for easy integration
  • Caching: Built-in caching for efficient re-runs and reproducibility
  • Extensibility: Easy to add new benchmarks and evaluation metrics

Contributing

To add new benchmarks or modify existing ones:

  1. Update framework.yml with new benchmark definitions
  2. Implement the benchmark logic in the appropriate adapter
  3. Add test cases and documentation
  4. Update this README with new benchmark information

References

For more detailed information about specific benchmarks and their implementations, refer to the individual benchmark documentation and the main HELM repository.

Holistic Evaluation of Language Models (HELM)

GitHub Repo stars GitHub contributors GitHub Actions Workflow Status Documentation Status License PyPI

HELM logo

Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models. This framework includes the following features:

  • Datasets and benchmarks in a standardized format (e.g. MMLU-Pro, GPQA, IFEval, WildBench)
  • Models from various providers accessible through a unified interface (e.g. OpenAI models, Anthropic Claude, Google Gemini)
  • Metrics for measuring various aspects beyond accuracy (e.g. efficiency, bias, toxicity)
  • Web UI for inspecting individual prompts and responses
  • Web leaderboard for comparing results across models and benchmarks

Documentation

Please refer to the documentation on Read the Docs for instructions on how to install and run HELM.

Quick Start

Install the package from PyPI:

pip install crfm-helm

Run the following in your shell:

# Run benchmark
helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10

# Summarize benchmark results
helm-summarize --suite my-suite

# Start a web server to display benchmark results
helm-server --suite my-suite

Then go to http://localhost:8000/ in your browser.

Attribution

This NVIDIA fork of HELM is based on the original Stanford CRFM HELM framework. The original framework was created by the Center for Research on Foundation Models (CRFM) at Stanford and is licensed under the Apache License 2.0.

Leaderboards

We maintain offical leaderboards with results from evaluating recent models on notable benchmarks using this framework. Our current flagship leaderboards are:

We also maintain leaderboards for a diverse range of domains (e.g. medicine, finance) and aspects (e.g. multi-linguality, world knowledge, regulation compliance). Refer to the HELM website for a full list of leaderboards.

Papers

The HELM framework was used in the following papers for evaluating models.

The HELM framework can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the main Reproducing Leaderboards documentation.

Citation

If you use this software in your research, please cite the Holistic Evaluation of Language Models paper as below.

@article{
liang2023holistic,
title={Holistic Evaluation of Language Models},
author={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=iO4LZibEqW},
note={Featured Certification, Expert Certification}
}

Attribution

Attribution and Acknowledgments

Original Project

This project is a fork of the Holistic Evaluation of Language Models (HELM) framework created by the Center for Research on Foundation Models (CRFM) at Stanford.

Citation

If you use this software in your research, please cite the original HELM paper:

@article{liang2023holistic,
    title={Holistic Evaluation of Language Models},
    author={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},
    journal={Transactions on Machine Learning Research},
    issn={2835-8856},
    year={2023},
    url={https://openreview.net/forum?id=iO4LZibEqW},
    note={Featured Certification, Expert Certification}
}

Fork Information

  • Fork Maintainer: NVIDIA
  • Fork Purpose: Medical AI evaluation and EvalFactory integration

License

This fork is released under the same Apache License 2.0 as the original project, in accordance with the original license terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nvidia_crfm_helm-25.8.1-py3-none-any.whl (7.0 MB view details)

Uploaded Python 3

File details

Details for the file nvidia_crfm_helm-25.8.1-py3-none-any.whl.

File metadata

File hashes

Hashes for nvidia_crfm_helm-25.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e98730b2b6899a7110363c11ebae521e417b60a78c9985a65be48a1a30bd02fe
MD5 f7dd23358b47b3a9b13134744a760518
BLAKE2b-256 9a6ddb17c28319c8cd2d35928dfef719b4e9a0c03ba548a3f4d3b38cc7b2165f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page