NeMo Evaluator — benchmark environments, pluggable solvers, interceptor proxy, and decision-grade scoring for LLMs
Project description
NeMo Evaluator
Documentation | GitHub | Issues
LLM evaluation framework with benchmark environments, pluggable solvers, composable interceptor proxy, and multi-format reporting.
Install
pip install -e . # core
pip install -e ".[scoring]" # + sympy for symbolic math
pip install -e ".[stats]" # + scipy (regression analysis)
pip install -e ".[scoring,stats]" # + sympy + scipy for confidence intervals
pip install -e ".[harbor]" # + Harbor agents (OpenHands, Terminus-2)
pip install -e ".[inspect]" # + Inspect AI log export
pip install -e ".[all]" # common runtime integrations
Quick Start
export NVIDIA_API_KEY="your-api-key-here"
# Run a benchmark from the CLI
nel eval run --bench mmlu \
--model-url https://integrate.api.nvidia.com/v1 \
--model-id nvidia/nemotron-3-super-120b-a12b \
--api-key $NVIDIA_API_KEY \
--repeats 3 --max-problems 100
# Run from a YAML config
nel eval run config.yaml
nel eval run config.yaml --resume
# Generate a report
nel eval report ./eval_results/ -f markdown -o report.md
Benchmarks
17 built-in benchmarks plus external harness integrations:
| Benchmark | Type | Scoring |
|---|---|---|
| mmlu, mmlu_pro, gpqa | Multichoice | multichoice_regex |
| gsm8k, math500, mgsm | Math | numeric_match / answer_line |
| drop, triviaqa | QA | fuzzy_match |
| humaneval | Code | code_sandbox (Docker) |
| simpleqa, healthbench | Judge | needs_judge |
| pinchbench | Agentic | code_sandbox / needs_judge |
| xstest | Safety | needs_judge |
| terminal-bench-hard, terminal-bench-v1 | Terminal tasks | Task test harness |
| nmp_harbor | Agentic NMP | Harbor task tests |
External environments via URI schemes: lm-eval://, skills://, vlmevalkit://, gym://, harbor://, container://.
Adapter Proxy
Built-in local interceptor proxy for LLM traffic. Intercepts all agent-to-model requests for caching, logging, payload modification, turn limiting, and custom transformations — no external dependencies required.
services:
nemotron:
type: api
url: https://integrate.api.nvidia.com/v1/chat/completions
protocol: chat_completions
model: nvidia/nemotron-3-super-120b-a12b
api_key: ${NVIDIA_API_KEY}
proxy:
request_timeout: 600
interceptors:
- name: turn_counter
config:
max_turns: 100
- name: drop_params
config:
params: [max_tokens]
verbose: true
Available interceptors:
| Interceptor | Stage | Description |
|---|---|---|
endpoint |
request→response | Async HTTP forwarding with retry, backoff, connection pooling |
caching |
request→response | Disk-backed SQLite cache with canonical keys |
turn_counter |
request | Per-session turn counting with budget injection |
drop_params |
request | Strip named parameters from requests |
modify_tools |
request | Add/remove properties in tool schemas |
system_message |
request | Inject/replace/prepend system messages |
payload_modifier |
request | Recursive parameter add/remove/rename |
raise_client_errors |
response | Convert 4xx to exceptions |
log_tokens |
response | Log token usage per request |
response_stats |
response | Aggregate timing and token statistics |
reasoning |
response | Normalize <think> blocks to reasoning_content |
progress_tracking |
response | Progress counter with optional webhook |
logging |
request + response | Request/response logging with body preview |
Solvers
Configured via solver.type in each benchmark:
| Solver Type | Config type |
Use Case |
|---|---|---|
| SimpleSolver | simple |
Standard chat/completion/VLM (default) |
| HarborSolver | harbor |
Harbor agents (OpenHands, Terminus-2, etc.) |
| ToolCallingSolver | tool_calling |
Tool-use with Gym resource servers |
| GymDelegationSolver | gym_delegation |
Delegate to nemo-gym server |
| OpenClawSolver | openclaw |
OpenClaw CLI agent |
| ContainerSolver | container |
Legacy container harness |
Export
Evaluation results can be exported to experiment trackers and compatible formats:
output:
export: [inspect, wandb, mlflow]
inspect— Producesinspect_ai-compatibleEvalLogJSON files. Install withpip install -e ".[inspect]".wandb/mlflow— Push scores and artifacts to experiment trackers. Install withpip install -e ".[export]".
BYOB (Bring Your Own Benchmark)
from nemo_evaluator import benchmark, scorer, ScorerInput, exact_match
@benchmark(name="my-bench", dataset="hf://my-org/data?split=test",
prompt="Q: {question}\nA:", target_field="answer")
@scorer
def my_scorer(sample: ScorerInput) -> dict:
return exact_match(sample)
Sandboxes
Per-problem Docker/SLURM sandboxes for code execution and agentic evaluation. Two modes: stateful (shared sandbox for solve + verify) and stateless (separate agent and verification containers with shared volume).
SLURM
Pyxis/Enroot-based execution with auto-selected container images per URI scheme. Uses node_pools topology for flexible resource allocation across model, agent, and sandbox nodes.
| Tag suffix | Contents |
|---|---|
:latest |
Base + gym + vlmevalkit |
:latest-lm-eval |
+ lm-evaluation-harness |
:latest-skills |
+ NeMo Skills |
:latest-full |
All harnesses |
CLI
| Command | Purpose |
|---|---|
nel eval run |
Run evaluation (name or YAML) |
nel eval merge <dir> |
Merge sharded results |
nel eval report <dir> |
Generate reports |
nel list |
List benchmarks |
nel serve -b <name> |
Serve as HTTP endpoint |
nel validate -b <name> |
Sanity check |
nel export <paths> --dest <exporter> |
Export bundles |
nel cache-sqsh <image> |
Build a SLURM .sqsh cache image |
nel report <dir> |
Generate multi-benchmark reports |
nel compare |
Paired run comparison |
nel gate |
Multi-benchmark quality gate |
nel config |
Persistent user config |
nel package |
Containerize BYOB benchmark |
Compare Results Between Runs
Use nel compare when you want to compare two runs of the same benchmark and inspect score deltas, flips, and statistical evidence.
nel compare ./results/baseline ./results/candidate --strict
Full tutorial: docs/tutorials/compare.md
Implement Quality Gates
Use nel gate when you want one GO / NO-GO / INCONCLUSIVE decision across multiple benchmarks from an explicit policy file.
nel gate ./results/baseline ./results/candidate \
--policy gate_policy.yaml \
--strict \
--output gate_report.json
Full tutorial: docs/tutorials/quality-gate.md
Examples
See examples/configs/ for 25+ end-to-end configs covering all solver types, verification methods, and execution backends.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nemo_evaluator-0.3.0.tar.gz.
File metadata
- Download URL: nemo_evaluator-0.3.0.tar.gz
- Upload date:
- Size: 836.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b978f1ca9b817cd92d5eab13bd58fb07c5a318580d89b79dcdd04fb7af7fa8f
|
|
| MD5 |
a0e35d6fa5629923f56d642d928c2e86
|
|
| BLAKE2b-256 |
ad79b223bb65b4b4567e755ed2ab7038387985443d182a6db653605c1b4ecdc9
|
File details
Details for the file nemo_evaluator-0.3.0-py3-none-any.whl.
File metadata
- Download URL: nemo_evaluator-0.3.0-py3-none-any.whl
- Upload date:
- Size: 522.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60fdb8613e78348aa0b0b82144756279b3ea63835a826d37c5e7a8b2ae579c41
|
|
| MD5 |
73617a4fa52da1ee954e67601aee5531
|
|
| BLAKE2b-256 |
6266aec5d840820ce95541cc3473abfab46a6395561d193bcc05bec0dbe61fd7
|