NeMo Evaluator — benchmark environments, pluggable solvers, interceptor proxy, and decision-grade scoring for LLMs

These details have not been verified by PyPI

Project links

Project description

NeMo Evaluator

Documentation | GitHub | Issues

LLM evaluation framework with benchmark environments, pluggable solvers, composable interceptor proxy, and multi-format reporting.

Install

pip install -e .                   # core
pip install -e ".[scoring]"        # + sympy for symbolic math
pip install -e ".[stats]"          # + scipy (regression analysis)
pip install -e ".[scoring,stats]"  # + sympy + scipy for confidence intervals
pip install -e ".[harbor]"         # + Harbor agents (OpenHands, Terminus-2)
pip install -e ".[inspect]"        # + Inspect AI log export
pip install -e ".[all]"            # common runtime integrations

Quick Start

export NVIDIA_API_KEY="your-api-key-here"

# Run a benchmark from the CLI
nel eval run --bench mmlu \
  --model-url https://integrate.api.nvidia.com/v1 \
  --model-id nvidia/nemotron-3-super-120b-a12b \
  --api-key $NVIDIA_API_KEY \
  --repeats 3 --max-problems 100

# Run from a YAML config
nel eval run config.yaml
nel eval run config.yaml --resume

# Generate a report
nel eval report ./eval_results/ -f markdown -o report.md

Benchmarks

17 built-in benchmarks plus external harness integrations:

Benchmark	Type	Scoring
mmlu, mmlu_pro, gpqa	Multichoice	`multichoice_regex`
gsm8k, math500, mgsm	Math	`numeric_match` / `answer_line`
drop, triviaqa	QA	`fuzzy_match`
humaneval	Code	`code_sandbox` (Docker)
simpleqa, healthbench	Judge	`needs_judge`
pinchbench	Agentic	`code_sandbox` / `needs_judge`
xstest	Safety	`needs_judge`
terminal-bench-hard, terminal-bench-v1	Terminal tasks	Task test harness
nmp_harbor	Agentic NMP	Harbor task tests

External environments via URI schemes: lm-eval://, skills://, vlmevalkit://, gym://, harbor://, container://.

Adapter Proxy

Built-in local interceptor proxy for LLM traffic. Intercepts all agent-to-model requests for caching, logging, payload modification, turn limiting, and custom transformations — no external dependencies required.

services:
  nemotron:
    type: api
    url: https://integrate.api.nvidia.com/v1/chat/completions
    protocol: chat_completions
    model: nvidia/nemotron-3-super-120b-a12b
    api_key: ${NVIDIA_API_KEY}
    proxy:
      request_timeout: 600
      interceptors:
        - name: turn_counter
          config:
            max_turns: 100
        - name: drop_params
          config:
            params: [max_tokens]
      verbose: true

Available interceptors:

Interceptor	Stage	Description
`endpoint`	request→response	Async HTTP forwarding with retry, backoff, connection pooling
`caching`	request→response	Disk-backed SQLite cache with canonical keys
`turn_counter`	request	Per-session turn counting with budget injection
`drop_params`	request	Strip named parameters from requests
`modify_tools`	request	Add/remove properties in tool schemas
`system_message`	request	Inject/replace/prepend system messages
`payload_modifier`	request	Recursive parameter add/remove/rename
`raise_client_errors`	response	Convert 4xx to exceptions
`log_tokens`	response	Log token usage per request
`response_stats`	response	Aggregate timing and token statistics
`reasoning`	response	Normalize `<think>` blocks to `reasoning_content`
`progress_tracking`	response	Progress counter with optional webhook
`logging`	request + response	Request/response logging with body preview

Solvers

Configured via solver.type in each benchmark:

Solver Type	Config `type`	Use Case
SimpleSolver	`simple`	Standard chat/completion/VLM (default)
HarborSolver	`harbor`	Harbor agents (OpenHands, Terminus-2, etc.)
ToolCallingSolver	`tool_calling`	Tool-use with Gym resource servers
GymDelegationSolver	`gym_delegation`	Delegate to nemo-gym server
OpenClawSolver	`openclaw`	OpenClaw CLI agent
ContainerSolver	`container`	Legacy container harness

Export

Evaluation results can be exported to experiment trackers and compatible formats:

output:
  export: [inspect, wandb, mlflow]

inspect — Produces inspect_ai-compatible EvalLog JSON files. Install with pip install -e ".[inspect]".
wandb / mlflow — Push scores and artifacts to experiment trackers. Install with pip install -e ".[export]".

BYOB (Bring Your Own Benchmark)

from nemo_evaluator import benchmark, scorer, ScorerInput, exact_match

@benchmark(name="my-bench", dataset="hf://my-org/data?split=test",
           prompt="Q: {question}\nA:", target_field="answer")
@scorer
def my_scorer(sample: ScorerInput) -> dict:
    return exact_match(sample)

Sandboxes

Per-problem Docker/SLURM sandboxes for code execution and agentic evaluation. Two modes: stateful (shared sandbox for solve + verify) and stateless (separate agent and verification containers with shared volume).

SLURM

Pyxis/Enroot-based execution with auto-selected container images per URI scheme. Uses node_pools topology for flexible resource allocation across model, agent, and sandbox nodes.

Tag suffix	Contents
`:latest`	Base + gym + vlmevalkit
`:latest-lm-eval`	+ lm-evaluation-harness
`:latest-skills`	+ NeMo Skills
`:latest-full`	All harnesses

CLI

Command	Purpose
`nel eval run`	Run evaluation (name or YAML)
`nel eval merge <dir>`	Merge sharded results
`nel eval report <dir>`	Generate reports
`nel list`	List benchmarks
`nel serve -b <name>`	Serve as HTTP endpoint
`nel validate -b <name>`	Sanity check
`nel export <paths> --dest <exporter>`	Export bundles
`nel cache-sqsh <image>`	Build a SLURM `.sqsh` cache image
`nel report <dir>`	Generate multi-benchmark reports
`nel compare`	Paired run comparison
`nel gate`	Multi-benchmark quality gate
`nel config`	Persistent user config
`nel package`	Containerize BYOB benchmark

Compare Results Between Runs

Use nel compare when you want to compare two runs of the same benchmark and inspect score deltas, flips, and statistical evidence.

nel compare ./results/baseline ./results/candidate --strict

Full tutorial: docs/tutorials/compare.md

Implement Quality Gates

Use nel gate when you want one GO / NO-GO / INCONCLUSIVE decision across multiple benchmarks from an explicit policy file.

nel gate ./results/baseline ./results/candidate \
  --policy gate_policy.yaml \
  --strict \
  --output gate_report.json

Full tutorial: docs/tutorials/quality-gate.md

Examples

See examples/configs/ for 25+ end-to-end configs covering all solver types, verification methods, and execution backends.

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Jun 3, 2026

0.2.8

May 8, 2026

0.2.7

Apr 29, 2026

0.2.6

Apr 16, 2026

0.2.5

Mar 18, 2026

0.2.4

Mar 12, 2026

0.2.3

Mar 11, 2026

0.2.2

Mar 10, 2026

0.2.1

Mar 9, 2026

0.2.0

Mar 9, 2026

0.1.95

Mar 5, 2026

0.1.94

Mar 4, 2026

0.1.93

Mar 3, 2026

0.1.92

Mar 2, 2026

0.1.91

Mar 2, 2026

0.1.90

Feb 27, 2026

0.1.89

Feb 26, 2026

0.1.88

Feb 25, 2026

0.1.87

Feb 24, 2026

0.1.86

Feb 23, 2026

0.1.85

Feb 19, 2026

0.1.84

Feb 18, 2026

0.1.83

Feb 17, 2026

0.1.82

Feb 16, 2026

0.1.81

Feb 12, 2026

0.1.80

Feb 11, 2026

0.1.79

Feb 9, 2026

0.1.78

Feb 6, 2026

0.1.77

Feb 5, 2026

0.1.76

Feb 4, 2026

0.1.75

Feb 3, 2026

0.1.74

Feb 2, 2026

0.1.73

Jan 29, 2026

0.1.72

Jan 28, 2026

0.1.71

Jan 27, 2026

0.1.70

Jan 26, 2026

0.1.69

Jan 22, 2026

0.1.68

Jan 21, 2026

0.1.67

Jan 20, 2026

0.1.66

Jan 19, 2026

0.1.65

Jan 15, 2026

0.1.64

Jan 14, 2026

0.1.63

Jan 13, 2026

0.1.62

Jan 8, 2026

0.1.61

Jan 7, 2026

0.1.60

Jan 6, 2026

0.1.59

Jan 5, 2026

0.1.58

Jan 1, 2026

0.1.57

Dec 31, 2025

0.1.56

Dec 30, 2025

0.1.55

Dec 29, 2025

0.1.54

Dec 25, 2025

0.1.53

Dec 24, 2025

0.1.52

Dec 18, 2025

0.1.51

Dec 18, 2025

0.1.50

Dec 17, 2025

0.1.49

Dec 17, 2025

0.1.48

Dec 16, 2025

0.1.47

Dec 16, 2025

0.1.46

Dec 15, 2025

0.1.45

Dec 10, 2025

0.1.44

Dec 9, 2025

0.1.43

Dec 8, 2025

0.1.42

Dec 4, 2025

0.1.41

Dec 3, 2025

0.1.40

Dec 2, 2025

0.1.39

Dec 1, 2025

0.1.38

Dec 1, 2025

0.1.37

Nov 27, 2025

0.1.35

Nov 25, 2025

0.1.34

Nov 24, 2025

0.1.33

Nov 20, 2025

0.1.32

Nov 19, 2025

0.1.31

Nov 18, 2025

0.1.30

Nov 17, 2025

0.1.29

Nov 13, 2025

0.1.28

Nov 12, 2025

0.1.27

Nov 11, 2025

0.1.26

Nov 6, 2025

0.1.25

Nov 5, 2025

0.1.24

Nov 4, 2025

0.1.23

Nov 3, 2025

0.1.22

Oct 30, 2025

0.1.21

Oct 29, 2025

0.1.20

Oct 28, 2025

0.1.19

Oct 27, 2025

0.1.18

Oct 22, 2025

0.1.17

Oct 21, 2025

0.1.16

Oct 20, 2025

0.1.14

Oct 16, 2025

0.1.13

Oct 14, 2025

0.1.12

Oct 13, 2025

0.1.11

Oct 8, 2025

0.1.10

Oct 7, 2025

0.1.9

Oct 6, 2025

0.1.8

Oct 2, 2025

0.1.7

Oct 1, 2025

0.1.6

Sep 30, 2025

0.1.5

Sep 29, 2025

0.1.4

Sep 25, 2025

0.1.3

Sep 24, 2025

0.1.2

Sep 24, 2025

0.1.1

Sep 22, 2025

0.1.0

Sep 18, 2025

0.1.0rc3 pre-release

Sep 18, 2025

0.1.0rc2 pre-release

Sep 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nemo_evaluator-0.3.0.tar.gz (836.1 kB view details)

Uploaded Jun 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nemo_evaluator-0.3.0-py3-none-any.whl (522.7 kB view details)

Uploaded Jun 3, 2026 Python 3

File details

Details for the file nemo_evaluator-0.3.0.tar.gz.

File metadata

Download URL: nemo_evaluator-0.3.0.tar.gz
Upload date: Jun 3, 2026
Size: 836.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for nemo_evaluator-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`0b978f1ca9b817cd92d5eab13bd58fb07c5a318580d89b79dcdd04fb7af7fa8f`
MD5	`a0e35d6fa5629923f56d642d928c2e86`
BLAKE2b-256	`ad79b223bb65b4b4567e755ed2ab7038387985443d182a6db653605c1b4ecdc9`

See more details on using hashes here.

File details

Details for the file nemo_evaluator-0.3.0-py3-none-any.whl.

File metadata

Download URL: nemo_evaluator-0.3.0-py3-none-any.whl
Upload date: Jun 3, 2026
Size: 522.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for nemo_evaluator-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`60fdb8613e78348aa0b0b82144756279b3ea63835a826d37c5e7a8b2ae579c41`
MD5	`73617a4fa52da1ee954e67601aee5531`
BLAKE2b-256	`6266aec5d840820ce95541cc3473abfab46a6395561d193bcc05bec0dbe61fd7`

See more details on using hashes here.

nemo-evaluator 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

NeMo Evaluator

Install

Quick Start

Benchmarks

Adapter Proxy

Solvers

Export

BYOB (Bring Your Own Benchmark)

Sandboxes

SLURM

CLI

Compare Results Between Runs

Implement Quality Gates

Examples

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes