NVIDIA: Benchmark for language models - Fork of Stanford CRFM HELM

These details have not been verified by PyPI

Project links

Project description

NVIDIA HELM Benchmark Framework

This directory contains the HELM (Holistic Evaluation of Language Models) framework for evaluating large language models in medical applications across various healthcare tasks.

Overview

The HELM framework provides a comprehensive evaluation system for medical AI models, supporting multiple benchmark datasets and evaluation scenarios. It's designed to work with the EvalFactory infrastructure for standardized model evaluation.

Available Benchmarks

The framework supports the following medical evaluation benchmarks:

Benchmark	Description	Type
medcalc_bench	Medical calculation benchmark with patient notes and ground truth answers	Medical QA
medec	Medical error detection and correction pairs	Error Detection
head_qa	Biomedical multiple-choice questions for medical knowledge testing	Medical QA
medbullets	USMLE-style medical questions with explanations	Medical QA
pubmed_qa	PubMed abstracts with yes/no/maybe questions	Medical QA
ehr_sql	Natural language to SQL query generation for clinical research	SQL Generation
race_based_med	Detection of race-based biases in medical LLM outputs	Bias Detection
medhallu	Classification of factual vs hallucinated medical answers	Hallucination Detection

Quick Start

1. Environment Setup

First, ensure you have the required environment variables set:

# Set your API keys
export OPENAI_API_KEY="your-api-key-here"

# Set Python path if necessary
export PYTHONPATH=$PYTHONPATH:$.

2. Running Your First Benchmark

Method 1: Using `eval-factory` (Recommended)

eval-factory is a wrapper that simplifies the HELM benchmark process by handling configuration generation, benchmark execution, and result formatting automatically.

What eval-factory does internally:

Configuration Processing: Loads your YAML config and merges it with framework defaults
Dynamic Config Generation: Creates the necessary HELM model configurations dynamically
Benchmark Execution: Runs the HELM benchmark with proper parameters
Result Processing: Formats and saves results in standardized YAML format

Create a configuration file (e.g., my_test.yml):

config:
  type: medcalc_bench  # Choose from available benchmarks
  output_dir: results/my_test
target:
  api_endpoint:
    url: https://api.openai.com/v1
    model_id: gpt-4
    type: chat
    api_key: OPENAI_API_KEY

Run the evaluation:

eval-factory run_eval \
    --output_dir results/my_test \
    --run_config my_test.yml

Internal Process Breakdown:

Config Loading & Validation:
- Loads your YAML configuration
- Validates against framework schema
- Merges with default parameters from framework.yml
Dynamic Model Config Generation:
- Calls scripts/generate_dynamic_model_configs.py
- Creates model-specific configuration files
- Handles provider-specific API endpoints and authentication
HELM Benchmark Execution:
- Executes helm-run with generated configurations
- Downloads and prepares benchmark datasets
- Runs evaluations with specified parameters
- Caches responses for efficiency
Result Processing:
- Collects raw benchmark results
- Formats into standardized YAML output
- Saves results in your specified output directory

Method 2: Using `helm-run` directly

helm-run \
  --run-entries medcalc_bench:model=openai/gpt-4 \
  --suite my-suite \
  --max-eval-instances 10 \
  --num-train-trials 1 \
  -o results/my_test

Comparison: eval-factory vs helm-run

Feature	`eval-factory`	`helm-run`
Configuration	Simple YAML config	Complex command-line arguments
Model Setup	Automatic config generation	Manual model registration required
Provider Support	Built-in adapter handling	Requires custom model configs
Results Format	Standardized YAML output	Native HELM format only
Ease of Use	Beginner-friendly	Advanced users only
Integration	EvalFactory compatible	HELM-specific

Recommendation: Use eval-factory for most use cases, especially when working with EvalFactory. Use helm-run only when you need fine-grained control over HELM's native features.

3. Understanding the Output

After running a benchmark, you'll find results in your specified output directory:

results/my_test/
├── responses/          # Raw model responses
├── cache.db           # Cached responses for efficiency
├── instances.jsonl    # Evaluation instances
├── results.jsonl      # Final evaluation results
├── model_configs/     # Generated HELM model configurations
└── evaluation_config.yaml  # Standardized evaluation results

Generated Files Explanation:

responses/: Contains raw API responses from the model for each evaluation instance
cache.db: SQLite database caching responses to avoid re-running identical queries
instances.jsonl: The evaluation instances (questions, prompts, etc.) used in the benchmark
results.jsonl: HELM's native results format with detailed metrics
model_configs/: Dynamically generated configuration files for the specific model and provider
evaluation_config.yaml: Standardized results in YAML format compatible with EvalFactory

Key Advantage: eval-factory automatically handles the complexity of HELM configuration generation, making it much easier to run benchmarks compared to using helm-run directly.

Step-by-Step Guide

Step 1: Choose Your Benchmark

Select from the available benchmarks based on your evaluation needs:

For general medical QA: medcalc_bench, head_qa, medbullets
For error detection: medec
For research applications: pubmed_qa, ehr_sql
For safety evaluation: race_based_med, medhallu

Step 2: Configure Your Model

Create a YAML configuration file with your model details. Here are examples for different providers:

OpenAI Configuration

config:
  type: medcalc_bench
  output_dir: results/openai_test
target:
  api_endpoint:
    url: https://api.openai.com/v1
    model_id: gpt-4
    type: chat
    api_key: OPENAI_API_KEY

NVIDIA AI Foundation Models (build.nvidia.com)

config:
  type: pubmed_qa
  output_dir: results/nim_test
target:
  api_endpoint:
    url: https://integrate.api.nvidia.com/v1
    model_id: nvdev/meta/llama-3.3-70b-instruct
    type: chat
    api_key: OPENAI_API_KEY

NVIDIA Cloud Function (nvcf)

config:
  type: ehr_sql
  output_dir: results/nvcf_test
target:
  api_endpoint:
    url: https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/13e4f873-9d52-4ba9-8194-61baf8dc2bc9/
    model_id: meta-llama/Llama-3.3-70B-Instruct
    type: chat
    api_key: OPENAI_API_KEY
    adapter_config:
      use_nvcf: true

Model Naming Conventions

Different providers use different model ID formats:

OpenAI: gpt-4, gpt-3.5-turbo, text-davinci-003
NVIDIA: meta-llama/Llama-3.3-70B-Instruct, mistral-7b-instruct

Note: NVCF requires a specific function ID in the URL and the use_nvcf: true adapter configuration.

Step 3: Set Up API Credentials

Ensure your API credentials are properly configured:

# For OpenAI models
export OPENAI_API_KEY="<very-long-sequence>"

# For NVIDIA AI Foundation Models (build.nvidia.com)
export OPENAI_API_KEY="nvapi-..."  # Uses same env var as OpenAI

# For NVIDIA Cloud Function (nvcf)
export OPENAI_API_KEY="nvapi-..."  # Uses same env var as OpenAI

# Note: NVIDIA services typically use the same OPENAI_API_KEY environment variable
# but with NVIDIA-specific API keys (nvapi-... format)

Step 4: Run the Evaluation

Execute the benchmark using one of the methods above. The framework will:

Load the configuration and validate parameters
Generate model configs dynamically for the specified model
Download and prepare the benchmark dataset
Run evaluations on the specified number of instances
Cache responses for efficiency and reproducibility
Generate results in standardized format

Step 5: Analyze Results

Review the generated results:

# View raw results
cat results/my_test/results.jsonl

# Use HELM tools for analysis
helm-summarize --suite my-suite
helm-server  # Start web interface to view results

Advanced Configuration

Customizing Evaluation Parameters

You can customize various parameters in your configuration:

config:
  type: medcalc_bench
  output_dir: results/advanced_test
  params:
    limit_samples: 100        # Limit number of evaluation instances
    parallelism: 4           # Number of parallel threads
    extra:
      num_train_trials: 3    # Number of training trials
      max_length: 2048       # Maximum token length
target:
  api_endpoint:
    url: https://api.openai.com/v1
    model_id: gpt-4
    type: chat
    api_key: OPENAI_API_KEY

Advanced Configuration Parameters

The config.params.extra section provides additional parameters for fine-tuning evaluations:

`data_path`

Purpose: Custom data path for scenarios that support it
Supported Scenarios: ehrshot, clear, medalign, n2c2_ct_matching
Example: "/path/to/custom/data"
Description: Overrides the default data location for the scenario

`num_output_tokens`

Purpose: Maximum number of tokens the model is allowed to generate in its response
Scope: Controls only the output length, not the total sequence length
Example: 1000 limits model responses to 1000 tokens
Use Case: Useful for controlling response length in generation tasks

`max_length`

Purpose: Maximum total length for the entire input-output sequence (input + output combined)
Scope: Controls the combined length of both prompt and response
Example: 2048 limits total conversation to 2048 tokens
Difference from num_output_tokens: This controls total sequence length, while num_output_tokens only controls response length

`subject`

Purpose: Specific task or subset to evaluate within a scenario
Examples by Scenario:
- ehrshot: "guo_readmission", "new_hypertension", "lab_anemia"
- n2c2_ct_matching: "ABDOMINAL", "ADVANCED-CAD", "CREATININE"
- clear: "major_depression", "bipolar_disorder", "substance_use_disorder"
Description: Filters the evaluation to a specific prediction task or medical condition

`condition`

Purpose: Specific condition or scenario variant to evaluate
Supported Scenarios: clear
Examples: "alcohol_dependence", "chronic_pain", "homelessness"
Description: Used by scenarios like 'clear' to specify medical conditions for evaluation

`num_train_trials`

Purpose: Number of training trials for few-shot evaluation
Behavior: Each trial samples a different set of in-context examples
Example: 3 runs the evaluation 3 times with different examples
Use Case: Useful for robust evaluation with multiple few-shot configurations

Example Configuration with All Parameters

config:
  type: ehrshot
  output_dir: results/ehrshot_evaluation
  params:
    limit_samples: 500
    parallelism: 2
    extra:
      data_path: "/custom/path/to/ehrshot/data"
      num_output_tokens: 1000
      max_length: 4096
      subject: "guo_readmission"
      num_train_trials: 3
target:
  api_endpoint:
    url: https://api.openai.com/v1
    model_id: gpt-4
    type: chat
    api_key: OPENAI_API_KEY

Running Multiple Benchmarks

To run multiple benchmarks on the same model:

# Create separate config files for each benchmark
eval-factory run_eval --output_dir results/medcalc_test --run_config medcalc_config.yml
eval-factory run_eval --output_dir results/medec_test --run_config medec_config.yml
eval-factory run_eval --output_dir results/head_qa_test --run_config head_qa_config.yml

Dry Run Mode

Test your configuration without running the full evaluation:

eval-factory run_eval \
    --output_dir results/test \
    --run_config my_config.yml \
    --dry_run

This will show you the rendered configuration and command without executing the benchmark.

Troubleshooting

Common Issues

API Key Errors: Ensure your API keys are properly set and valid
Model Not Found: Verify the model ID and endpoint URL are correct
Memory Issues: Reduce parallelism or limit_samples for large models
Timeout Errors: Increase timeout settings or reduce batch sizes

Debug Mode

Enable debug logging for detailed information:

eval-factory --debug run_eval \
    --output_dir results/debug_test \
    --run_config debug_config.yml

Checking Available Tasks

List all available evaluation types:

eval-factory ls

Examples from commands.sh

Here are some practical examples from the project:

Basic Medical Calculation Benchmark

eval-factory run_eval \
    --output_dir test_cases/test_case_nim_llama_3_1_8b_medcalc_bench \
    --run_config test_cases/test_case_nim_llama_3_1_8b_medcalc_bench.yml

Medical Error Detection

eval-factory run_eval \
    --output_dir test_cases/test_case_nim_llama_3_1_8b_medec \
    --run_config test_cases/test_case_nim_llama_3_1_8b_medec.yml

Biomedical QA

eval-factory run_eval \
    --output_dir test_cases/test_case_nim_llama_3_1_8b_head_qa \
    --run_config test_cases/test_case_nim_llama_3_1_8b_head_qa.yml

Running Evaluations with Judges

The HELM framework supports multi-judge evaluations for scenarios that require human-like assessment of model outputs. This is particularly useful for tasks like medical treatment plan generation, where multiple AI judges can provide more robust and reliable evaluations.

Overview of Multi-Judge Setup

The framework supports three types of judges:

GPT Judge: Uses OpenAI GPT models for evaluation
Llama Judge: Uses Llama models for evaluation
Claude Judge: Uses Anthropic Claude models for evaluation

Each judge can use different API keys, providing better rate limiting, cost tracking, and flexibility.

Authentication Systems

The framework supports two authentication methods for judge models:

1. Direct Judge API Keys (Recommended for Production)

Set individual API keys for each judge type:

# API key for the main model being evaluated
export OPENAI_API_KEY="your-main-model-api-key"

# API keys for the three judges (annotators)
export GPT_JUDGE_API_KEY="your-gpt-judge-api-key"
export LLAMA_JUDGE_API_KEY="your-llama-judge-api-key"
export CLAUDE_JUDGE_API_KEY="your-claude-judge-api-key"

2. OAuth 2.0 Client Credentials Flow (Advanced)

Use NVIDIA's OAuth system for automatic token management:

# OAuth 2.0 credentials for automatic token generation
export OPENAI_CLIENT_ID="nvssa-prd-your-client-id"
export OPENAI_CLIENT_SECRET="ssap-your-client-secret"
export OPENAI_TOKEN_URL="https://prod.api.nvidia.com/oauth/api/v1/ssa/default/token"
export OPENAI_SCOPE="awsanthropic-readwrite"

# Main API key (still required)
export OPENAI_API_KEY="your-main-model-api-key"

How Authentication Priority Works

The system follows this exact priority order:

First Priority: Judge-specific environment variables
- GPT_JUDGE_API_KEY for GPT models
- LLAMA_JUDGE_API_KEY for Llama models
- CLAUDE_JUDGE_API_KEY for Claude models
Second Priority: Fallback to main API key
- If judge keys aren't set, automatically uses OPENAI_API_KEY
- System logs: "GPT_JUDGE_API_KEY is not set, setting to OPENAI_API_KEY"
Third Priority: Credentials configuration
- Falls back to credentials.conf or deployment-specific keys

Important: OAuth-generated tokens are NOT automatically used for judge API keys. The OAuth system is separate and serves different purposes.

OAuth 2.0 System Details

The OAuth system provides:

Automatic Token Creation: Generates access tokens using client credentials
Token Caching: Stores tokens in memory and disk ({service_name}_oauth_token.json)
Automatic Refresh: Refreshes expired tokens automatically
Scope Control: Different permissions per service:
- azureopenai-readwrite for GPT services
- awsanthropic-readwrite for Claude services

When to Use OAuth:

Better security (client credentials vs. long-lived API keys)
Automatic token management
Centralized billing and rate limiting
Enterprise-grade authentication

When to Use Direct API Keys:

Simpler setup
Direct control over each judge's API key
Different providers for different judges
Testing and development scenarios

Security Features

API Key Protection: The system automatically sanitizes error messages to prevent API keys from appearing in logs. Any API key patterns (like nvapi-..., sk-..., hf_...) are automatically replaced with [API_KEY_REDACTED] before logging.

Configuration for Multi-Judge Evaluations

Basic Configuration (Direct API Keys)

config:
  type: mtsamples_replicate  # Example scenario that uses judges
  output_dir: results/multi_judge_test
  params:
    limit_samples: 10
    parallelism: 1
    extra:
      num_train_trials: 1
      max_length: 2048
      # Different API keys for each judge
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
target:
  api_endpoint:
    url: https://integrate.api.nvidia.com/v1
    model_id: nvdev/meta/llama-3.1-8b-instruct
    type: chat
    api_key: OPENAI_API_KEY

Advanced Configuration (OAuth + Direct Keys)

config:
  type: mtsamples_replicate
  output_dir: results/oauth_multi_judge_test
  params:
    limit_samples: 50
    parallelism: 2
    extra:
      num_train_trials: 3
      max_length: 2048
      # Mix OAuth (automatic) and direct keys
      gpt_judge_api_key: GPT_JUDGE_API_KEY  # Direct key for GPT
      # Llama and Claude will use OAuth-generated tokens
target:
  api_endpoint:
    url: https://integrate.api.nvidia.com/v1
    model_id: nvdev/meta/llama-3.3-70b-instruct
    type: chat
    api_key: OPENAI_API_KEY

Supported Scenarios with Judges

Currently, the following scenarios support multi-judge evaluations:

Scenario	Description	Judge Types Used
mtsamples_replicate	Generate treatment plans based on clinical notes	GPT, Llama, Claude
mtsamples_procedures	Document and extract information about medical procedures	GPT, Llama, Claude
aci_bench	Extract and structure information from patient-doctor conversations	GPT, Llama, Claude
medication_qa	Answer consumer medication-related questions	GPT, Llama, Claude
medi_qa	Retrieve and rank answers based on medical question understanding	GPT, Llama, Claude
med_dialog	Generate summaries of doctor-patient conversations	GPT, Llama, Claude

Complete Setup Guide

Method 1: Direct API Keys (Simplest)

# 1. Set up environment variables
export OPENAI_API_KEY="nvapi-your-main-api-key"
export GPT_JUDGE_API_KEY="nvapi-gpt-judge-api-key"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-api-key"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-api-key"

# 2. Run the evaluation
eval-factory run_eval \
    --output_dir results/multi_judge_test \
    --run_config multi_judge_config.yml

Method 2: OAuth 2.0 System (Enterprise)

# 1. Set up OAuth credentials
export OPENAI_CLIENT_ID="nvssa-prd-your-client-id"
export OPENAI_CLIENT_SECRET="ssap-your-client-secret"
export OPENAI_TOKEN_URL="https://prod.api.nvidia.com/oauth/api/v1/ssa/default/token"
export OPENAI_SCOPE="awsanthropic-readwrite"

# 2. Set main API key (still required)
export OPENAI_API_KEY="nvapi-your-main-api-key"

# 3. Run the evaluation
eval-factory run_eval \
    --output_dir results/oauth_multi_judge_test \
    --run_config oauth_multi_judge_config.yml

Method 3: Hybrid Approach (Flexible)

# 1. Set OAuth credentials for automatic token generation
export OPENAI_CLIENT_ID="nvssa-prd-your-client-id"
export OPENAI_CLIENT_SECRET="ssap-your-client-secret"

# 2. Override specific judge with direct API key
export GPT_JUDGE_API_KEY="nvapi-gpt-specific-key"

# 3. Set main API key
export OPENAI_API_KEY="nvapi-your-main-api-key"

# 4. Run the evaluation
eval-factory run_eval \
    --output_dir results/hybrid_multi_judge_test \
    --run_config hybrid_multi_judge_config.yml

Method 4: Using `helm-run` directly

# Set up environment variables (any of the above methods)
export OPENAI_API_KEY="nvapi-your-main-api-key"
export GPT_JUDGE_API_KEY="nvapi-gpt-judge-api-key"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-api-key"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-api-key"

# Run the evaluation
helm-run \
  --run-entries mtsamples_replicate:model=openai/gpt-4 \
  --suite my-suite \
  --max-eval-instances 10 \
  --num-train-trials 1 \
  -o results/multi_judge_test

Advanced Judge Configuration

Using Different API Keys for Each Judge

You can use completely different API keys for each judge:

export GPT_JUDGE_API_KEY="nvapi-gpt-judge-1"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-2"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-3"

Using the Same API Key for All Judges

If you want to use the same API key for all judges:

export GPT_JUDGE_API_KEY="nvapi-shared-key"
export LLAMA_JUDGE_API_KEY="nvapi-shared-key"
export CLAUDE_JUDGE_API_KEY="nvapi-shared-key"

OAuth Token Management

Check OAuth Token Status:

# Look for OAuth token files
ls -la *_oauth_token.json

# Check token expiration
cat openai_oauth_token.json | jq '.expires_at'

Force Token Refresh:

# The system automatically refreshes expired tokens
# You can also manually trigger refresh by deleting token files
rm *_oauth_token.json

OAuth Scopes for Different Services:

# For GPT services
export OPENAI_SCOPE="azureopenai-readwrite"

# For Claude services  
export OPENAI_SCOPE="awsanthropic-readwrite"

# For general access
export OPENAI_SCOPE="awsanthropic-readwrite"

Example Multi-Judge Evaluation

Here's a complete example for running a multi-judge evaluation:

# 1. Create configuration file (multi_judge_config.yml)
cat > multi_judge_config.yml << EOF
config:
  type: mtsamples_replicate
  output_dir: results/multi_judge_test
  params:
    limit_samples: 50
    parallelism: 2
    extra:
      num_train_trials: 3
      max_length: 2048
      gpt_judge_api_key: GPT_JUDGE_API_KEY
      llama_judge_api_key: LLAMA_JUDGE_API_KEY
      claude_judge_api_key: CLAUDE_JUDGE_API_KEY
target:
  api_endpoint:
    url: https://integrate.api.nvidia.com/v1
    model_id: nvdev/meta/llama-3.3-70b-instruct
    type: chat
    api_key: OPENAI_API_KEY
EOF

# 2. Set environment variables
export OPENAI_API_KEY="nvapi-main-model-key"
export GPT_JUDGE_API_KEY="nvapi-gpt-judge-key"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-key"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-key"

# 3. Run the evaluation
eval-factory run_eval \
    --output_dir results/multi_judge_test \
    --run_config multi_judge_config.yml

Troubleshooting Multi-Judge Evaluations

Check Environment Variables

Verify your environment variables are set correctly:

echo "Main API Key: $OPENAI_API_KEY"
echo "GPT Judge: $GPT_JUDGE_API_KEY"
echo "Llama Judge: $LLAMA_JUDGE_API_KEY"
echo "Claude Judge: $CLAUDE_JUDGE_API_KEY"

Check OAuth Configuration

Verify OAuth credentials are properly set:

echo "Client ID: $OPENAI_CLIENT_ID"
echo "Client Secret: $OPENAI_CLIENT_SECRET"
echo "Token URL: $OPENAI_TOKEN_URL"
echo "Scope: $OPENAI_SCOPE"

Debug Mode

Enable debug logging to see which API keys are being used:

eval-factory --debug run_eval \
    --output_dir results/debug_multi_judge \
    --run_config multi_judge_config.yml

Common Issues and Solutions

Issue: "GPT_JUDGE_API_KEY is not set, setting to OPENAI_API_KEY"

Cause: Judge API key not set, system falling back to main API key
Solution: Set the specific judge API key or accept the fallback

Issue: "Missing environment variables for openai token"

Cause: OAuth credentials not properly configured
Solution: Set OPENAI_CLIENT_ID and OPENAI_CLIENT_SECRET

Issue: "Error creating openai OAuth token"

Cause: Invalid credentials or network issues
Solution: Verify credentials and check network connectivity

Issue: API key appears in logs

Cause: This should not happen with the security fix
Solution: Check if you're using the latest version with API key sanitization

Log Analysis

Look for these log patterns:

# Judge API key usage
grep "Using.*judge API key" logs/*.log

# OAuth token creation
grep "Creating new.*OAuth token" logs/*.log

# API key fallbacks
grep "is not set, setting to" logs/*.log

# Authentication errors
grep "Authentication error detected" logs/*.log

Performance Monitoring

Check API Key Usage:

# Monitor which API keys are being used
grep "Using.*API key.*ends with" logs/*.log

# Check for rate limiting
grep "rate limit\|429" logs/*.log

# Monitor OAuth token refresh
grep "token expired\|refreshing" logs/*.log

Look for messages like:

Using GPT judge API key from environment variable for model: nvidia/gpt4o-abc123
Using Llama judge API key from environment variable for model: nvdev/meta/llama-3.3-70b-instruct-def456
Using Claude judge API key from environment variable for model: nvidia/claude-3-7-sonnet-20250219-ghi789

Common Issues

Environment variables not loaded: Make sure your environment variables are set before running the command
API key format: Ensure your API keys start with nvapi- for NVIDIA services
Configuration file: Verify your YAML configuration file references the correct environment variable names
Judge model availability: Ensure the judge models are available through your API endpoints

Benefits of Multi-Judge Evaluations

Better rate limiting: Each judge can have its own rate limits
Cost tracking: Track costs separately for each judge
Flexibility: Use different API keys for different purposes
Security: Isolate API keys for different components
Robustness: Multiple judges provide more reliable evaluations
Diversity: Different judge models may catch different types of errors

Integration with EvalFactory

This framework is designed to work seamlessly with the EvalFactory infrastructure:

Standardized Output: Results are generated in a format compatible with EvalFactory
Configuration Management: Uses YAML-based configuration for easy integration
Caching: Built-in caching for efficient re-runs and reproducibility
Extensibility: Easy to add new benchmarks and evaluation metrics

Contributing

To add new benchmarks or modify existing ones:

Update framework.yml with new benchmark definitions
Implement the benchmark logic in the appropriate adapter
Add test cases and documentation
Update this README with new benchmark information

References

For more detailed information about specific benchmarks and their implementations, refer to the individual benchmark documentation and the main HELM repository.

Holistic Evaluation of Language Models (HELM)

HELM logo

Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models. This framework includes the following features:

Datasets and benchmarks in a standardized format (e.g. MMLU-Pro, GPQA, IFEval, WildBench)
Models from various providers accessible through a unified interface (e.g. OpenAI models, Anthropic Claude, Google Gemini)
Metrics for measuring various aspects beyond accuracy (e.g. efficiency, bias, toxicity)
Web UI for inspecting individual prompts and responses
Web leaderboard for comparing results across models and benchmarks

Documentation

Please refer to the documentation on Read the Docs for instructions on how to install and run HELM.

Quick Start

Install the package from PyPI:

pip install crfm-helm

Run the following in your shell:

# Run benchmark
helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10

# Summarize benchmark results
helm-summarize --suite my-suite

# Start a web server to display benchmark results
helm-server --suite my-suite

Then go to http://localhost:8000/ in your browser.

Attribution

This NVIDIA fork of HELM is based on the original Stanford CRFM HELM framework. The original framework was created by the Center for Research on Foundation Models (CRFM) at Stanford and is licensed under the Apache License 2.0.

Leaderboards

We maintain offical leaderboards with results from evaluating recent models on notable benchmarks using this framework. Our current flagship leaderboards are:

We also maintain leaderboards for a diverse range of domains (e.g. medicine, finance) and aspects (e.g. multi-linguality, world knowledge, regulation compliance). Refer to the HELM website for a full list of leaderboards.

Papers

The HELM framework was used in the following papers for evaluating models.

Holistic Evaluation of Language Models - paper, leaderboard
Holistic Evaluation of Vision-Language Models (VHELM) - paper, leaderboard, documentation
Holistic Evaluation of Text-To-Image Models (HEIM) - paper, leaderboard, documentation
Image2Struct: Benchmarking Structure Extraction for Vision-Language Models - paper
Enterprise Benchmarks for Large Language Model Evaluation - paper, documentation
The Mighty ToRR: A Benchmark for Table Reasoning and Robustness - paper, leaderboard
Reliable and Efficient Amortized Model-based Evaluation - paper, documentation
MedHELM - paper in progress, leaderboard, documentation

The HELM framework can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the main Reproducing Leaderboards documentation.

Citation

If you use this software in your research, please cite the Holistic Evaluation of Language Models paper as below.

@article{
liang2023holistic,
title={Holistic Evaluation of Language Models},
author={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=iO4LZibEqW},
note={Featured Certification, Expert Certification}
}

Attribution

Attribution and Acknowledgments

Original Project

This project is a fork of the Holistic Evaluation of Language Models (HELM) framework created by the Center for Research on Foundation Models (CRFM) at Stanford.

Original Repository: https://github.com/stanford-crfm/helm
Original Documentation: https://crfm.stanford.edu/helm
Original Paper: Holistic Evaluation of Language Models
Original Authors: Stanford CRFM Team
Original License: Apache License 2.0

Citation

If you use this software in your research, please cite the original HELM paper:

@article{liang2023holistic,
    title={Holistic Evaluation of Language Models},
    author={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},
    journal={Transactions on Machine Learning Research},
    issn={2835-8856},
    year={2023},
    url={https://openreview.net/forum?id=iO4LZibEqW},
    note={Featured Certification, Expert Certification}
}

Fork Information

Fork Maintainer: NVIDIA
Fork Purpose: Medical AI evaluation and EvalFactory integration

License

This fork is released under the same Apache License 2.0 as the original project, in accordance with the original license terms.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

25.10

Oct 31, 2025

25.9.1

Oct 23, 2025

25.9

Oct 3, 2025

This version

25.8.1

Sep 16, 2025

25.8

Sep 4, 2025

25.7.2

Aug 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nvidia_crfm_helm-25.8.1-py3-none-any.whl (7.0 MB view details)

Uploaded Sep 16, 2025 Python 3

File details

Details for the file nvidia_crfm_helm-25.8.1-py3-none-any.whl.

File metadata

Download URL: nvidia_crfm_helm-25.8.1-py3-none-any.whl
Upload date: Sep 16, 2025
Size: 7.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.18

File hashes

Hashes for nvidia_crfm_helm-25.8.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e98730b2b6899a7110363c11ebae521e417b60a78c9985a65be48a1a30bd02fe`
MD5	`f7dd23358b47b3a9b13134744a760518`
BLAKE2b-256	`9a6ddb17c28319c8cd2d35928dfef719b4e9a0c03ba548a3f4d3b38cc7b2165f`

See more details on using hashes here.

nvidia-crfm-helm 25.8.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

NVIDIA HELM Benchmark Framework

Overview

Available Benchmarks

Quick Start

1. Environment Setup

2. Running Your First Benchmark

Method 1: Using eval-factory (Recommended)

Method 2: Using helm-run directly

3. Understanding the Output

Step-by-Step Guide

Step 1: Choose Your Benchmark

Step 2: Configure Your Model

OpenAI Configuration

NVIDIA AI Foundation Models (build.nvidia.com)

NVIDIA Cloud Function (nvcf)

Model Naming Conventions

Step 3: Set Up API Credentials

Step 4: Run the Evaluation

Step 5: Analyze Results

Advanced Configuration

Customizing Evaluation Parameters

Advanced Configuration Parameters

data_path

num_output_tokens

max_length

subject

condition

num_train_trials

Example Configuration with All Parameters

Running Multiple Benchmarks

Dry Run Mode

Troubleshooting

Common Issues

Debug Mode

Checking Available Tasks

Examples from commands.sh

Basic Medical Calculation Benchmark

Medical Error Detection

Biomedical QA

Running Evaluations with Judges

Overview of Multi-Judge Setup

Authentication Systems

1. Direct Judge API Keys (Recommended for Production)

2. OAuth 2.0 Client Credentials Flow (Advanced)

How Authentication Priority Works

OAuth 2.0 System Details

Security Features

Configuration for Multi-Judge Evaluations

Basic Configuration (Direct API Keys)

Advanced Configuration (OAuth + Direct Keys)

Supported Scenarios with Judges

Complete Setup Guide

Method 1: Direct API Keys (Simplest)

Method 2: OAuth 2.0 System (Enterprise)

Method 3: Hybrid Approach (Flexible)

Method 4: Using helm-run directly

Advanced Judge Configuration

Using Different API Keys for Each Judge

Using the Same API Key for All Judges

OAuth Token Management

Example Multi-Judge Evaluation

Troubleshooting Multi-Judge Evaluations

Check Environment Variables

Check OAuth Configuration

Debug Mode

Common Issues and Solutions

Log Analysis

Performance Monitoring

Common Issues

Benefits of Multi-Judge Evaluations

Integration with EvalFactory

Method 1: Using `eval-factory` (Recommended)

Method 2: Using `helm-run` directly

`data_path`

`num_output_tokens`

`max_length`

`subject`

`condition`

`num_train_trials`

Method 4: Using `helm-run` directly