A graph-based toolkit for evaluating LLM and RAG outputs with repeatable quality metrics and reporting.
Project description
nexa-gauge - Graph-Based Evaluation for LLM and RAG Systems
A cache-aware evaluation engine for measuring LLM and RAG output quality with repeatable metrics, cost estimates, and structured reports.
Read the Documentation · Quickstart · CLI Usage · Report Bug · Request Feature
What is nexa-gauge?
nexa-gauge is a Python package and command-line toolkit for evaluating generated outputs from LLM, RAG, and agentic systems. It replaces ad-hoc manual checks with a typed evaluation graph that can estimate cost, execute only the required nodes, reuse cached artifacts, and emit structured per-case reports.
It is designed for prompt iteration, benchmark runs, regression testing, release gates, and production evaluation workflows where teams need measurable quality and safety signals.
Core evaluation coverage includes:
- Relevance - measures whether generated claims answer the user question.
- Grounding - checks whether generated claims are supported by supplied context.
- Red team scoring - evaluates safety and risk behavior with configurable rubrics.
- GEval / LLM-as-a-judge - scores outputs against explicit criteria or evaluation steps.
- Reference metrics - computes overlap-based metrics against known reference answers.
Quickstart
Install the package from PyPI:
pip install nexa-gauge
Install optional Hugging Face dataset support:
pip install "nexa-gauge[huggingface]"
Set your provider key:
export OPENAI_API_KEY="<your-key>"
Estimate cost before running billable evaluation work:
nexagauge estimate eval --input sample.json --limit 10
Run the full evaluation graph and write per-case reports:
nexagauge run eval --input sample.json --limit 10 --output-dir ./report
Core Capabilities
- Graph-based evaluation pipeline - predictable node topology for scanning, chunking, claim extraction, metric execution, aggregation, and reporting.
- Estimate-first execution - preview uncached eligible cost before making LLM-backed calls.
- Cache-aware runs - avoid duplicate LLM spend and recomputation when inputs, prompts, and model routes are unchanged.
- Structured JSON reports - write per-case report files for CI, dashboards, notebooks, and downstream analytics.
- Per-node model routing - configure global models, node-specific models, fallback models, and temperatures.
- Scalable CLI execution - tune case-level workers, in-flight cases, and global LLM concurrency.
- Local and hosted datasets - evaluate JSON, JSONL, CSV, text files, and Hugging Face datasets.
Evaluation Pipeline
nexa-gauge runs evaluations through a deterministic node graph. Each target executes only its required upstream dependencies.
| Category | Nodes | Purpose |
|---|---|---|
| Input and orchestration | scan, eval, report |
Normalize records, aggregate metric branches, and project final reports. |
| Utility | chunk, claims, dedup, geval_steps |
Prepare generated text, extract claims, remove duplicates, and resolve GEval steps. |
| Metrics | relevance, grounding, redteam, geval, reference |
Score answer quality, evidence support, safety, rubric alignment, and reference overlap. |
Typical execution paths:
grounding: scan -> chunk -> claims -> dedup -> grounding
relevance: scan -> chunk -> claims -> dedup -> relevance
geval: scan -> geval_steps -> geval
eval: full graph execution and aggregate metric summary
For a full architecture diagram, see docs/architecture.md.
CLI Usage
The CLI entry point is nexagauge.
nexagauge --help
nexagauge run --help
nexagauge estimate --help
nexagauge cache --help
Primary commands:
| Command | Purpose |
|---|---|
nexagauge estimate <target_node> --input <source> |
Estimate uncached cost for a target branch before execution. |
nexagauge run <target_node> --input <source> |
Execute a target branch and optionally write reports. |
nexagauge cache dir |
Print the resolved cache root directory. |
nexagauge cache delete |
Inspect or clear cached node outputs. |
Common examples:
# Estimate full evaluation cost for a dataset slice
nexagauge estimate eval --input sample.json --limit 100
# Run full evaluation and write JSON reports
nexagauge run eval --input sample.json --limit 100 --output-dir ./report
# Run only the grounding metric branch
nexagauge run grounding --input sample.json --limit 25
# Preview cache cleanup
nexagauge cache delete --dry-run
Common flags:
| Area | Flags |
|---|---|
| Data selection | --input, --adapter, --start, --end, --limit |
| Model routing | --model, --llm-model, --llm-fallback |
| Caching | --force, --no-cache |
| Execution | --max-workers, --max-in-flight, --llm-concurrency, --continue-on-error |
| Debugging | --debug |
| Reports | --output-dir |
See docs/cli-code-flow.md and the hosted CLI documentation for deeper usage details.
Input Formats
Use --input with local files or hosted datasets.
| Source | Example | Notes |
|---|---|---|
| JSON | sample.json |
Object or array of records. |
| JSONL | dataset.jsonl |
One JSON object per line. |
| CSV | dataset.csv |
Rows are loaded as dictionaries. |
| Text | generation.txt |
Treated as a single generated output. |
| Hugging Face | hf://org/dataset |
Requires pip install "nexa-gauge[huggingface]". |
Canonical record fields include:
| Field | Used by |
|---|---|
generation |
Required for all metric branches. |
question |
Relevance and some GEval configurations. |
context |
Grounding and context-aware GEval checks. |
reference |
Reference metrics and reference-aware GEval checks. |
geval |
Rubric-driven GEval metrics. |
redteam |
Custom safety and risk rubrics. |
Common aliases such as response, answer, output, completion, query, prompt, ground_truth, and label are normalized during scanning.
Metrics
nexa-gauge combines deterministic metrics with LLM-as-a-judge evaluation.
| Metric node | What it evaluates |
|---|---|
relevance |
Whether generated claims directly answer the question. |
grounding |
Whether generated claims are supported by the provided context. |
redteam |
Bias, toxicity, and custom risk behavior using rubrics. |
geval |
Criteria-based LLM judging with generated or provided evaluation steps. |
reference |
BLEU, METEOR, ROUGE-1, ROUGE-2, and ROUGE-L against reference answers. |
GEval is split into two phases:
geval_stepsresolves reusable evaluation steps from criteria or accepts provided steps.gevalscores each case against those resolved steps and selected input fields.
This design makes rubric-based evaluation repeatable and cache-friendly across datasets.
Caching and Cost Estimation
Cost control is a first-class part of the runtime.
# Preview uncached work before execution
nexagauge estimate eval --input sample.json --limit 50
# Reuse cache during normal runs
nexagauge run eval --input sample.json --limit 50 --output-dir ./report
# Ignore cache reads but still write fresh outputs
nexagauge run eval --input sample.json --limit 50 --force
# Disable cache reads and writes for debugging
nexagauge run eval --input sample.json --limit 50 --no-cache
The cache is deterministic and route-aware. Inputs, evaluation criteria, model routing, prompt versions, parser versions, and relevant upstream artifacts are included in cache keys so stale outputs are not reused across incompatible runs.
For run, cache location can be controlled with:
export NEXAGAUGE_CACHE_DIR="./.nexagauge-cache"
Inspect the active cache root:
nexagauge cache dir
Clear cached node outputs:
nexagauge cache delete --dry-run
nexagauge cache delete --yes
Configuration
For local development or repeatable runs, copy the environment template:
cp .env.example .env
Minimum configuration for OpenAI-backed runs:
OPENAI_API_KEY=<your-key>
LLM_MODEL=gpt-4o-mini
Supported per-node overrides follow this pattern:
LLM_CLAIMS_MODEL=openai/gpt-4o-mini
LLM_CLAIMS_FALLBACK_MODEL=openai/gpt-4o
LLM_GROUNDING_TEMPERATURE=0.0
Runtime overrides can also be passed through the CLI:
nexagauge run eval \
--input sample.json \
--llm-model openai/gpt-4o-mini \
--llm-model grounding=openai/gpt-4o \
--llm-fallback openai/gpt-4o
Development
Clone the repository and install it from source:
git clone https://github.com/harnexa/nexa-gauge.git
cd nexa-gauge
pip install -e .
Contributor workflow with uv:
uv sync
make lint
make test
make ci
Build distributions:
uv build
Expected artifacts:
dist/nexa_gauge-<version>-py3-none-any.whl
dist/nexa_gauge-<version>.tar.gz
Releases use release-please. Conventional Commit titles such as feat:, fix:, docs:, deps:, and chore: are recommended for cleaner generated release notes, but they are not required by CI.
Documentation
- Hosted documentation: harnexa.dev/nexa-gauge/docs/introduction
- Local getting started guide: docs/get-started.md
- Architecture: docs/architecture.md
- CLI code flow: docs/cli-code-flow.md
- Product summary: docs/PRODUCT_SUMMARY.md
Project Standards
- License: MIT
- Security policy: SECURITY.md
- Contributing guide: CONTRIBUTING.md
- Code of conduct: CODE_OF_CONDUCT.md
License
Distributed under the MIT License. See LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nexa_gauge-0.1.6.tar.gz.
File metadata
- Download URL: nexa_gauge-0.1.6.tar.gz
- Upload date:
- Size: 131.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c63604e9b0d09ea4885e08e155cf85cbb1592f5a214b824273c1fe37e68a96b4
|
|
| MD5 |
cdf0e47fe8ebe37f50b8a9b901104c52
|
|
| BLAKE2b-256 |
a086abaf350503e29da53d2dcff3cd13c9a59e6a7db42f0ed1f3ecb42478aa19
|
Provenance
The following attestation bundles were made for nexa_gauge-0.1.6.tar.gz:
Publisher:
release.yml on harnexa/nexa-gauge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nexa_gauge-0.1.6.tar.gz -
Subject digest:
c63604e9b0d09ea4885e08e155cf85cbb1592f5a214b824273c1fe37e68a96b4 - Sigstore transparency entry: 1399258886
- Sigstore integration time:
-
Permalink:
harnexa/nexa-gauge@64c23ed755d4af4c2b0ef116aea04ace1d31cbba -
Branch / Tag:
refs/heads/main - Owner: https://github.com/harnexa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@64c23ed755d4af4c2b0ef116aea04ace1d31cbba -
Trigger Event:
push
-
Statement type:
File details
Details for the file nexa_gauge-0.1.6-py3-none-any.whl.
File metadata
- Download URL: nexa_gauge-0.1.6-py3-none-any.whl
- Upload date:
- Size: 108.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a760f13852f2847b055ea60547fcd276c3f0f5d325ac59da69dc1b6c5b7e035
|
|
| MD5 |
b31da2993b4fe48b04dff680c24248d1
|
|
| BLAKE2b-256 |
156be388680fe4a3fdcf302d9339f0b2d1aecbb4d4df531446afb539656d6cba
|
Provenance
The following attestation bundles were made for nexa_gauge-0.1.6-py3-none-any.whl:
Publisher:
release.yml on harnexa/nexa-gauge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nexa_gauge-0.1.6-py3-none-any.whl -
Subject digest:
7a760f13852f2847b055ea60547fcd276c3f0f5d325ac59da69dc1b6c5b7e035 - Sigstore transparency entry: 1399258889
- Sigstore integration time:
-
Permalink:
harnexa/nexa-gauge@64c23ed755d4af4c2b0ef116aea04ace1d31cbba -
Branch / Tag:
refs/heads/main - Owner: https://github.com/harnexa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@64c23ed755d4af4c2b0ef116aea04ace1d31cbba -
Trigger Event:
push
-
Statement type: