A graph-based toolkit for evaluating LLM and RAG outputs with repeatable quality metrics and reporting.

These details have not been verified by PyPI

Project links

Project description

nexa-gauge - Graph-Based Evaluation for LLM and RAG Systems

A cache-aware evaluation engine for measuring LLM and RAG output quality with repeatable metrics, cost estimates, and structured reports.

Read the Documentation · Quickstart · CLI Usage · Report Bug · Request Feature

What is nexa-gauge?

nexa-gauge is a Python package and command-line toolkit for evaluating generated outputs from LLM, RAG, and agentic systems. It replaces ad-hoc manual checks with a typed evaluation graph that can estimate cost, execute only the required nodes, reuse cached artifacts, and emit structured per-case reports.

It is designed for prompt iteration, benchmark runs, regression testing, release gates, and production evaluation workflows where teams need measurable quality and safety signals.

Core evaluation coverage includes:

Relevance - measures whether generated claims answer the user question.
Grounding - checks whether generated claims are supported by supplied context.
Red team scoring - evaluates safety and risk behavior with configurable rubrics.
GEval / LLM-as-a-judge - scores outputs against explicit criteria or evaluation steps.
Reference metrics - computes overlap-based metrics against known reference answers.

Quickstart

Install the package from PyPI:

pip install nexa-gauge

Install optional Hugging Face dataset support:

pip install "nexa-gauge[huggingface]"

Set your provider key:

export OPENAI_API_KEY="<your-key>"

Estimate cost before running billable evaluation work:

nexagauge estimate eval --input sample.json --limit 10

Run the full evaluation graph and write per-case reports:

nexagauge run eval --input sample.json --limit 10 --output-dir ./report

Core Capabilities

Graph-based evaluation pipeline - predictable node topology for scanning, chunking, claim extraction, metric execution, aggregation, and reporting.
Estimate-first execution - preview uncached eligible cost before making LLM-backed calls.
Cache-aware runs - avoid duplicate LLM spend and recomputation when inputs, prompts, and model routes are unchanged.
Structured JSON reports - write per-case report files for CI, dashboards, notebooks, and downstream analytics.
Per-node model routing - configure global models, node-specific models, fallback models, and temperatures.
Scalable CLI execution - tune case-level workers, in-flight cases, and global LLM concurrency.
Local and hosted datasets - evaluate JSON, JSONL, CSV, text files, and Hugging Face datasets.

Evaluation Pipeline

nexa-gauge runs evaluations through a deterministic node graph. Each target executes only its required upstream dependencies.

Category	Nodes	Purpose
Input and orchestration	`scan`, `eval`, `report`	Normalize records, aggregate metric branches, and project final reports.
Utility	`chunk`, `refiner`, `claims`, `geval_steps`	Prepare generated text, select top-k chunks via configurable refinement, extract claims, and resolve GEval steps.
Metrics	`relevance`, `grounding`, `redteam`, `geval`, `reference`	Score answer quality, evidence support, safety, rubric alignment, and reference overlap.

Typical execution paths:

grounding: scan -> chunk -> refiner -> claims -> grounding
relevance: scan -> chunk -> refiner -> claims -> relevance
geval:     scan -> geval_steps -> geval
eval:      full graph execution and aggregate metric summary

For a full architecture diagram, see docs/architecture.md.

CLI Usage

The CLI entry point is nexagauge.

nexagauge --help
nexagauge run --help
nexagauge estimate --help
nexagauge cache --help

Primary commands:

Command	Purpose
`nexagauge estimate <target_node> --input <source>`	Estimate uncached cost for a target branch before execution.
`nexagauge run <target_node> --input <source>`	Execute a target branch and optionally write reports.
`nexagauge cache dir`	Print the resolved cache root directory.
`nexagauge cache delete`	Inspect or clear cached node outputs.

Common examples:

# Estimate full evaluation cost for a dataset slice
nexagauge estimate eval --input sample.json --limit 100

# Run full evaluation and write JSON reports
nexagauge run eval --input sample.json --limit 100 --output-dir ./report

# Run full evaluation with explicit chunk/refiner strategies
nexagauge run eval --input sample.json --limit 100 --chunker semchunk --refiner mmr --refiner-top-k 3

# Run only the grounding metric branch
nexagauge run grounding --input sample.json --limit 25

# Preview cache cleanup
nexagauge cache delete --dry-run

Common flags:

Area	Flags
Data selection	`--input`, `--adapter`, `--start`, `--end`, `--limit`
Model routing	`--model`, `--llm-model`, `--llm-fallback`
Strategy routing	`--chunker`, `--refiner`, `--refiner-top-k`
Caching	`--force`, `--no-cache`
Execution	`--max-workers`, `--max-in-flight`, `--llm-concurrency`, `--continue-on-error`
Debugging	`--debug`
Reports	`--output-dir`

See docs/cli-code-flow.md and the hosted CLI documentation for deeper usage details.

Input Formats

Use --input with local files or hosted datasets.

Source	Example	Notes
JSON	`sample.json`	Object or array of records.
JSONL	`dataset.jsonl`	One JSON object per line.
CSV	`dataset.csv`	Rows are loaded as dictionaries.
Text	`generation.txt`	Treated as a single generated output.
Hugging Face	`hf://org/dataset`	Requires `pip install "nexa-gauge[huggingface]"`.

Canonical record fields include:

Field	Used by
`generation`	Required for all metric branches.
`question`	Relevance and some GEval configurations.
`context`	Grounding and context-aware GEval checks.
`reference`	Reference metrics and reference-aware GEval checks.
`geval`	Rubric-driven GEval metrics.
`redteam`	Custom safety and risk rubrics.

Common aliases such as response, answer, output, completion, query, prompt, ground_truth, and label are normalized during scanning.

Metrics

nexa-gauge combines deterministic metrics with LLM-as-a-judge evaluation.

Metric node	What it evaluates
`relevance`	Whether generated claims directly answer the question.
`grounding`	Whether generated claims are supported by the provided context.
`redteam`	Bias, toxicity, and custom risk behavior using rubrics.
`geval`	Criteria-based LLM judging with generated or provided evaluation steps.
`reference`	BLEU, METEOR, ROUGE-1, ROUGE-2, and ROUGE-L against reference answers.

GEval is split into two phases:

geval_steps resolves reusable evaluation steps from criteria or accepts provided steps.
geval scores each case against those resolved steps and selected input fields.

This design makes rubric-based evaluation repeatable and cache-friendly across datasets.

Caching and Cost Estimation

Cost control is a first-class part of the runtime.

# Preview uncached work before execution
nexagauge estimate eval --input sample.json --limit 50

# Reuse cache during normal runs
nexagauge run eval --input sample.json --limit 50 --output-dir ./report

# Ignore cache reads but still write fresh outputs
nexagauge run eval --input sample.json --limit 50 --force

# Disable cache reads and writes for debugging
nexagauge run eval --input sample.json --limit 50 --no-cache

The cache is deterministic and route-aware. Inputs, evaluation criteria, model routing, prompt versions, parser versions, and relevant upstream artifacts are included in cache keys so stale outputs are not reused across incompatible runs.

For run, cache location can be controlled with:

export NEXAGAUGE_CACHE_DIR="./.nexagauge-cache"

Inspect the active cache root:

nexagauge cache dir

Clear cached node outputs:

nexagauge cache delete --dry-run
nexagauge cache delete --yes

Configuration

For local development or repeatable runs, copy the environment template:

cp .env.example .env

Minimum configuration for OpenAI-backed runs:

OPENAI_API_KEY=<your-key>
LLM_MODEL=gpt-4o-mini

Supported per-node overrides follow this pattern:

LLM_CLAIMS_MODEL=openai/gpt-4o-mini
LLM_CLAIMS_FALLBACK_MODEL=openai/gpt-4o
LLM_GROUNDING_TEMPERATURE=0.0

Runtime overrides can also be passed through the CLI:

nexagauge run eval \
  --input sample.json \
  --llm-model openai/gpt-4o-mini \
  --llm-model grounding=openai/gpt-4o \
  --llm-fallback openai/gpt-4o

Development

Clone the repository and install it from source:

git clone https://github.com/harnexa/nexa-gauge.git
cd nexa-gauge
pip install -e .

Contributor workflow with uv:

uv sync
make lint
make test
make ci

Build distributions:

uv build

Expected artifacts:

dist/nexa_gauge-<version>-py3-none-any.whl
dist/nexa_gauge-<version>.tar.gz

Releases use release-please. Conventional Commit titles such as feat:, fix:, docs:, deps:, and chore: are recommended for cleaner generated release notes, but they are not required by CI.

Documentation

Hosted documentation: harnexa.dev/nexa-gauge/docs/introduction
Local getting started guide: docs/get-started.md
Architecture: docs/architecture.md
CLI code flow: docs/cli-code-flow.md
Product summary: docs/PRODUCT_SUMMARY.md

Project Standards

License: MIT
Security policy: SECURITY.md
Contributing guide: CONTRIBUTING.md
Code of conduct: CODE_OF_CONDUCT.md

License

Distributed under the MIT License. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.8

Apr 30, 2026

0.1.7

Apr 30, 2026

0.1.6

Apr 29, 2026

0.1.4

Apr 27, 2026

0.1.3

Apr 23, 2026

0.1.2

Apr 21, 2026

0.1.1

Apr 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nexa_gauge-0.1.8.tar.gz (134.0 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nexa_gauge-0.1.8-py3-none-any.whl (112.8 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file nexa_gauge-0.1.8.tar.gz.

File metadata

Download URL: nexa_gauge-0.1.8.tar.gz
Upload date: Apr 30, 2026
Size: 134.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nexa_gauge-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`c00afcb927415165cdb0869148c1407b2403cb2aed07d67c6363d544dfd448e4`
MD5	`f4dda90d062b941783361327595eb3ea`
BLAKE2b-256	`fffaf8f2c9e6d68d9b88f477c7187d2d8ef23a14d6fcaee98dfb1bedfda44661`

See more details on using hashes here.

Provenance

The following attestation bundles were made for nexa_gauge-0.1.8.tar.gz:

Publisher: release.yml on harnexa/nexa-gauge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: nexa_gauge-0.1.8.tar.gz
- Subject digest: c00afcb927415165cdb0869148c1407b2403cb2aed07d67c6363d544dfd448e4
- Sigstore transparency entry: 1409177304
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: harnexa/nexa-gauge@8efbf209f4efbeab7f0708c28e43498b9d053117
- Branch / Tag: refs/heads/main
- Owner: https://github.com/harnexa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8efbf209f4efbeab7f0708c28e43498b9d053117
- Trigger Event: push

File details

Details for the file nexa_gauge-0.1.8-py3-none-any.whl.

File metadata

Download URL: nexa_gauge-0.1.8-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 112.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nexa_gauge-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1cfc39467681f3b8662011d86abbd38003addfcd4b1d36597cdbf8ea5bf5a278`
MD5	`be0b165792975c2f9c36534a0c64d022`
BLAKE2b-256	`dcc1a2a7c384b78604a5ffe201831c5336e6e5697af480dc66df8e248f38c6c6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for nexa_gauge-0.1.8-py3-none-any.whl:

Publisher: release.yml on harnexa/nexa-gauge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: nexa_gauge-0.1.8-py3-none-any.whl
- Subject digest: 1cfc39467681f3b8662011d86abbd38003addfcd4b1d36597cdbf8ea5bf5a278
- Sigstore transparency entry: 1409177311
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: harnexa/nexa-gauge@8efbf209f4efbeab7f0708c28e43498b9d053117
- Branch / Tag: refs/heads/main
- Owner: https://github.com/harnexa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8efbf209f4efbeab7f0708c28e43498b9d053117
- Trigger Event: push

nexa-gauge 0.1.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

nexa-gauge - Graph-Based Evaluation for LLM and RAG Systems

What is nexa-gauge?

Quickstart

Core Capabilities

Evaluation Pipeline

CLI Usage

Input Formats

Metrics

Caching and Cost Estimation

Configuration

Development

Documentation

Project Standards

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance