Skip to main content

NeMo Evaluator — benchmark environments, pluggable solvers, interceptor proxy, and decision-grade scoring for LLMs

Project description

NeMo Evaluator

License Python 3.12-3.13 Code style: ruff

Documentation | GitHub | Issues

LLM evaluation framework with benchmark environments, pluggable solvers, composable interceptor proxy, and multi-format reporting.


Install

pip install -e .                   # core
pip install -e ".[scoring]"        # + sympy for symbolic math
pip install -e ".[stats]"          # + scipy (regression analysis)
pip install -e ".[scoring,stats]"  # + sympy + scipy for confidence intervals
pip install -e ".[harbor]"         # + Harbor agents (OpenHands, Terminus-2)
pip install -e ".[inspect]"        # + Inspect AI log export
pip install -e ".[all]"            # common runtime integrations

Quick Start

export NVIDIA_API_KEY="your-api-key-here"

# Run a benchmark from the CLI
nel eval run --bench mmlu \
  --model-url https://integrate.api.nvidia.com/v1 \
  --model-id nvidia/nemotron-3-super-120b-a12b \
  --api-key $NVIDIA_API_KEY \
  --repeats 3 --max-problems 100

# Run from a YAML config
nel eval run config.yaml
nel eval run config.yaml --resume

# Generate a report
nel eval report ./eval_results/ -f markdown -o report.md

Benchmarks

17 built-in benchmarks plus external harness integrations:

Benchmark Type Scoring
mmlu, mmlu_pro, gpqa Multichoice multichoice_regex
gsm8k, math500, mgsm Math numeric_match / answer_line
drop, triviaqa QA fuzzy_match
humaneval Code code_sandbox (Docker)
simpleqa, healthbench Judge needs_judge
pinchbench Agentic code_sandbox / needs_judge
xstest Safety needs_judge
terminal-bench-hard, terminal-bench-v1 Terminal tasks Task test harness
nmp_harbor Agentic NMP Harbor task tests

External environments via URI schemes: lm-eval://, skills://, vlmevalkit://, gym://, harbor://, container://.

Adapter Proxy

Built-in local interceptor proxy for LLM traffic. Intercepts all agent-to-model requests for caching, logging, payload modification, turn limiting, and custom transformations — no external dependencies required.

services:
  nemotron:
    type: api
    url: https://integrate.api.nvidia.com/v1/chat/completions
    protocol: chat_completions
    model: nvidia/nemotron-3-super-120b-a12b
    api_key: ${NVIDIA_API_KEY}
    proxy:
      request_timeout: 600
      interceptors:
        - name: turn_counter
          config:
            max_turns: 100
        - name: drop_params
          config:
            params: [max_tokens]
      verbose: true

Available interceptors:

Interceptor Stage Description
endpoint request→response Async HTTP forwarding with retry, backoff, connection pooling
caching request→response Disk-backed SQLite cache with canonical keys
turn_counter request Per-session turn counting with budget injection
drop_params request Strip named parameters from requests
modify_tools request Add/remove properties in tool schemas
system_message request Inject/replace/prepend system messages
payload_modifier request Recursive parameter add/remove/rename
raise_client_errors response Convert 4xx to exceptions
log_tokens response Log token usage per request
response_stats response Aggregate timing and token statistics
reasoning response Normalize <think> blocks to reasoning_content
progress_tracking response Progress counter with optional webhook
logging request + response Request/response logging with body preview

Solvers

Configured via solver.type in each benchmark:

Solver Type Config type Use Case
SimpleSolver simple Standard chat/completion/VLM (default)
HarborSolver harbor Harbor agents (OpenHands, Terminus-2, etc.)
ToolCallingSolver tool_calling Tool-use with Gym resource servers
GymDelegationSolver gym_delegation Delegate to nemo-gym server
OpenClawSolver openclaw OpenClaw CLI agent
ContainerSolver container Legacy container harness

Export

Evaluation results can be exported to experiment trackers and compatible formats:

output:
  export: [inspect, wandb, mlflow]
  • inspect — Produces inspect_ai-compatible EvalLog JSON files. Install with pip install -e ".[inspect]".
  • wandb / mlflow — Push scores and artifacts to experiment trackers. Install with pip install -e ".[export]".

BYOB (Bring Your Own Benchmark)

from nemo_evaluator import benchmark, scorer, ScorerInput, exact_match

@benchmark(name="my-bench", dataset="hf://my-org/data?split=test",
           prompt="Q: {question}\nA:", target_field="answer")
@scorer
def my_scorer(sample: ScorerInput) -> dict:
    return exact_match(sample)

Sandboxes

Per-problem Docker/SLURM sandboxes for code execution and agentic evaluation. Two modes: stateful (shared sandbox for solve + verify) and stateless (separate agent and verification containers with shared volume).

SLURM

Pyxis/Enroot-based execution with auto-selected container images per URI scheme. Uses node_pools topology for flexible resource allocation across model, agent, and sandbox nodes.

Tag suffix Contents
:latest Base + gym + vlmevalkit
:latest-lm-eval + lm-evaluation-harness
:latest-skills + NeMo Skills
:latest-full All harnesses

CLI

Command Purpose
nel eval run Run evaluation (name or YAML)
nel eval merge <dir> Merge sharded results
nel eval report <dir> Generate reports
nel list List benchmarks
nel serve -b <name> Serve as HTTP endpoint
nel validate -b <name> Sanity check
nel export <paths> --dest <exporter> Export bundles
nel cache-sqsh <image> Build a SLURM .sqsh cache image
nel report <dir> Generate multi-benchmark reports
nel compare Paired run comparison
nel gate Multi-benchmark quality gate
nel config Persistent user config
nel package Containerize BYOB benchmark

Compare Results Between Runs

Use nel compare when you want to compare two runs of the same benchmark and inspect score deltas, flips, and statistical evidence.

nel compare ./results/baseline ./results/candidate --strict

Full tutorial: docs/tutorials/compare.md

Implement Quality Gates

Use nel gate when you want one GO / NO-GO / INCONCLUSIVE decision across multiple benchmarks from an explicit policy file.

nel gate ./results/baseline ./results/candidate \
  --policy gate_policy.yaml \
  --strict \
  --output gate_report.json

Full tutorial: docs/tutorials/quality-gate.md

Examples

See examples/configs/ for 25+ end-to-end configs covering all solver types, verification methods, and execution backends.

License

Apache 2.0

Project details


Release history Release notifications | RSS feed

This version

0.3.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nemo_evaluator-0.3.0.tar.gz (836.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nemo_evaluator-0.3.0-py3-none-any.whl (522.7 kB view details)

Uploaded Python 3

File details

Details for the file nemo_evaluator-0.3.0.tar.gz.

File metadata

  • Download URL: nemo_evaluator-0.3.0.tar.gz
  • Upload date:
  • Size: 836.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for nemo_evaluator-0.3.0.tar.gz
Algorithm Hash digest
SHA256 0b978f1ca9b817cd92d5eab13bd58fb07c5a318580d89b79dcdd04fb7af7fa8f
MD5 a0e35d6fa5629923f56d642d928c2e86
BLAKE2b-256 ad79b223bb65b4b4567e755ed2ab7038387985443d182a6db653605c1b4ecdc9

See more details on using hashes here.

File details

Details for the file nemo_evaluator-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: nemo_evaluator-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 522.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for nemo_evaluator-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 60fdb8613e78348aa0b0b82144756279b3ea63835a826d37c5e7a8b2ae579c41
MD5 73617a4fa52da1ee954e67601aee5531
BLAKE2b-256 6266aec5d840820ce95541cc3473abfab46a6395561d193bcc05bec0dbe61fd7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page