Skip to main content

A graph-based toolkit for evaluating LLM and RAG outputs with repeatable quality metrics and reporting.

Project description

nexa-gauge

nexa-gauge - Graph-Based Evaluation for LLM and RAG Systems

A cache-aware evaluation engine for measuring LLM and RAG output quality with repeatable metrics, cost estimates, and structured reports.

CI status PyPI version Python versions MIT License Documentation uv ready

Read the Documentation  ·  Quickstart  ·  CLI Usage  ·  Report Bug  ·  Request Feature


What is nexa-gauge?

nexa-gauge is a Python package and command-line toolkit for evaluating generated outputs from LLM, RAG, and agentic systems. It replaces ad-hoc manual checks with a typed evaluation graph that can estimate cost, execute only the required nodes, reuse cached artifacts, and emit structured per-case reports.

It is designed for prompt iteration, benchmark runs, regression testing, release gates, and production evaluation workflows where teams need measurable quality and safety signals.

Core evaluation coverage includes:

  • Relevance - measures whether generated claims answer the user question.
  • Grounding - checks whether generated claims are supported by supplied context.
  • Red team scoring - evaluates safety and risk behavior with configurable rubrics.
  • GEval / LLM-as-a-judge - scores outputs against explicit criteria or evaluation steps.
  • Reference metrics - computes overlap-based metrics against known reference answers.

Quickstart

Install the package from PyPI:

pip install nexa-gauge

Install optional Hugging Face dataset support:

pip install "nexa-gauge[huggingface]"

Set your provider key:

export OPENAI_API_KEY="<your-key>"

Estimate cost before running billable evaluation work:

nexagauge estimate eval --input sample.json --limit 10

Run the full evaluation graph and write per-case reports:

nexagauge run eval --input sample.json --limit 10 --output-dir ./report

Core Capabilities

  • Graph-based evaluation pipeline - predictable node topology for scanning, chunking, claim extraction, metric execution, aggregation, and reporting.
  • Estimate-first execution - preview uncached eligible cost before making LLM-backed calls.
  • Cache-aware runs - avoid duplicate LLM spend and recomputation when inputs, prompts, and model routes are unchanged.
  • Structured JSON reports - write per-case report files for CI, dashboards, notebooks, and downstream analytics.
  • Per-node model routing - configure global models, node-specific models, fallback models, and temperatures.
  • Scalable CLI execution - tune case-level workers, in-flight cases, and global LLM concurrency.
  • Local and hosted datasets - evaluate JSON, JSONL, CSV, text files, and Hugging Face datasets.

Evaluation Pipeline

nexa-gauge runs evaluations through a deterministic node graph. Each target executes only its required upstream dependencies.

Category Nodes Purpose
Input and orchestration scan, eval, report Normalize records, aggregate metric branches, and project final reports.
Utility chunk, refiner, claims, geval_steps Prepare generated text, select top-k chunks via configurable refinement, extract claims, and resolve GEval steps.
Metrics relevance, grounding, redteam, geval, reference Score answer quality, evidence support, safety, rubric alignment, and reference overlap.

Typical execution paths:

grounding: scan -> chunk -> refiner -> claims -> grounding
relevance: scan -> chunk -> refiner -> claims -> relevance
geval:     scan -> geval_steps -> geval
eval:      full graph execution and aggregate metric summary

For a full architecture diagram, see docs/architecture.md.


CLI Usage

The CLI entry point is nexagauge.

nexagauge --help
nexagauge run --help
nexagauge estimate --help
nexagauge cache --help

Primary commands:

Command Purpose
nexagauge estimate <target_node> --input <source> Estimate uncached cost for a target branch before execution.
nexagauge run <target_node> --input <source> Execute a target branch and optionally write reports.
nexagauge cache dir Print the resolved cache root directory.
nexagauge cache delete Inspect or clear cached node outputs.

Common examples:

# Estimate full evaluation cost for a dataset slice
nexagauge estimate eval --input sample.json --limit 100

# Run full evaluation and write JSON reports
nexagauge run eval --input sample.json --limit 100 --output-dir ./report

# Run full evaluation with explicit chunk/refiner strategies
nexagauge run eval --input sample.json --limit 100 --chunker semchunk --refiner mmr --refiner-top-k 3

# Run only the grounding metric branch
nexagauge run grounding --input sample.json --limit 25

# Preview cache cleanup
nexagauge cache delete --dry-run

Common flags:

Area Flags
Data selection --input, --adapter, --start, --end, --limit
Model routing --model, --llm-model, --llm-fallback
Strategy routing --chunker, --refiner, --refiner-top-k
Caching --force, --no-cache
Execution --max-workers, --max-in-flight, --llm-concurrency, --continue-on-error
Debugging --debug
Reports --output-dir

See docs/cli-code-flow.md and the hosted CLI documentation for deeper usage details.


Input Formats

Use --input with local files or hosted datasets.

Source Example Notes
JSON sample.json Object or array of records.
JSONL dataset.jsonl One JSON object per line.
CSV dataset.csv Rows are loaded as dictionaries.
Text generation.txt Treated as a single generated output.
Hugging Face hf://org/dataset Requires pip install "nexa-gauge[huggingface]".

Canonical record fields include:

Field Used by
generation Required for all metric branches.
question Relevance and some GEval configurations.
context Grounding and context-aware GEval checks.
reference Reference metrics and reference-aware GEval checks.
geval Rubric-driven GEval metrics.
redteam Custom safety and risk rubrics.

Common aliases such as response, answer, output, completion, query, prompt, ground_truth, and label are normalized during scanning.


Metrics

nexa-gauge combines deterministic metrics with LLM-as-a-judge evaluation.

Metric node What it evaluates
relevance Whether generated claims directly answer the question.
grounding Whether generated claims are supported by the provided context.
redteam Bias, toxicity, and custom risk behavior using rubrics.
geval Criteria-based LLM judging with generated or provided evaluation steps.
reference BLEU, METEOR, ROUGE-1, ROUGE-2, and ROUGE-L against reference answers.

GEval is split into two phases:

  1. geval_steps resolves reusable evaluation steps from criteria or accepts provided steps.
  2. geval scores each case against those resolved steps and selected input fields.

This design makes rubric-based evaluation repeatable and cache-friendly across datasets.


Caching and Cost Estimation

Cost control is a first-class part of the runtime.

# Preview uncached work before execution
nexagauge estimate eval --input sample.json --limit 50

# Reuse cache during normal runs
nexagauge run eval --input sample.json --limit 50 --output-dir ./report

# Ignore cache reads but still write fresh outputs
nexagauge run eval --input sample.json --limit 50 --force

# Disable cache reads and writes for debugging
nexagauge run eval --input sample.json --limit 50 --no-cache

The cache is deterministic and route-aware. Inputs, evaluation criteria, model routing, prompt versions, parser versions, and relevant upstream artifacts are included in cache keys so stale outputs are not reused across incompatible runs.

For run, cache location can be controlled with:

export NEXAGAUGE_CACHE_DIR="./.nexagauge-cache"

Inspect the active cache root:

nexagauge cache dir

Clear cached node outputs:

nexagauge cache delete --dry-run
nexagauge cache delete --yes

Configuration

For local development or repeatable runs, copy the environment template:

cp .env.example .env

Minimum configuration for OpenAI-backed runs:

OPENAI_API_KEY=<your-key>
LLM_MODEL=gpt-4o-mini

Supported per-node overrides follow this pattern:

LLM_CLAIMS_MODEL=openai/gpt-4o-mini
LLM_CLAIMS_FALLBACK_MODEL=openai/gpt-4o
LLM_GROUNDING_TEMPERATURE=0.0

Runtime overrides can also be passed through the CLI:

nexagauge run eval \
  --input sample.json \
  --llm-model openai/gpt-4o-mini \
  --llm-model grounding=openai/gpt-4o \
  --llm-fallback openai/gpt-4o

Development

Clone the repository and install it from source:

git clone https://github.com/harnexa/nexa-gauge.git
cd nexa-gauge
pip install -e .

Contributor workflow with uv:

uv sync
make lint
make test
make ci

Build distributions:

uv build

Expected artifacts:

dist/nexa_gauge-<version>-py3-none-any.whl
dist/nexa_gauge-<version>.tar.gz

Releases use release-please. Conventional Commit titles such as feat:, fix:, docs:, deps:, and chore: are recommended for cleaner generated release notes, but they are not required by CI.


Documentation


Project Standards


License

Distributed under the MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nexa_gauge-0.1.8.tar.gz (134.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nexa_gauge-0.1.8-py3-none-any.whl (112.8 kB view details)

Uploaded Python 3

File details

Details for the file nexa_gauge-0.1.8.tar.gz.

File metadata

  • Download URL: nexa_gauge-0.1.8.tar.gz
  • Upload date:
  • Size: 134.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nexa_gauge-0.1.8.tar.gz
Algorithm Hash digest
SHA256 c00afcb927415165cdb0869148c1407b2403cb2aed07d67c6363d544dfd448e4
MD5 f4dda90d062b941783361327595eb3ea
BLAKE2b-256 fffaf8f2c9e6d68d9b88f477c7187d2d8ef23a14d6fcaee98dfb1bedfda44661

See more details on using hashes here.

Provenance

The following attestation bundles were made for nexa_gauge-0.1.8.tar.gz:

Publisher: release.yml on harnexa/nexa-gauge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nexa_gauge-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: nexa_gauge-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 112.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nexa_gauge-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 1cfc39467681f3b8662011d86abbd38003addfcd4b1d36597cdbf8ea5bf5a278
MD5 be0b165792975c2f9c36534a0c64d022
BLAKE2b-256 dcc1a2a7c384b78604a5ffe201831c5336e6e5697af480dc66df8e248f38c6c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for nexa_gauge-0.1.8-py3-none-any.whl:

Publisher: release.yml on harnexa/nexa-gauge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page