Skip to main content

Systematic evaluation of LangGraph nodes using Arize Phoenix experiments.

Project description

evalwire

Systematic, reproducible evaluation of LangGraph nodes and subgraphs against human-curated testsets, tracked in Arize Phoenix.


License CI PyPI - Version

What it does

When iterating on a LangGraph agent, it is hard to know whether a change to a specific node improved or degraded its behaviour. Running the full graph end-to-end is expensive and makes it difficult to attribute a score change to a specific component.

evalwire solves this by:

  • Turning a human-curated CSV of queries and expected outputs into versioned Arize Phoenix datasets.
  • Letting you define a task that isolates and invokes individual LangGraph nodes independently of the rest of the graph.
  • Running those tasks against the stored datasets, scoring each output with one or more evaluators, and recording results in Phoenix — giving you a reproducible, comparable experiment per run.

Installation

pip install evalwire
# With LangGraph node-isolation helpers:
pip install 'evalwire[langgraph]'
# With LLM-as-a-judge evaluator:
pip install 'evalwire[llm-judge]'
# Everything:
pip install 'evalwire[all]'

Quick start

1. Upload your testset

evalwire upload --csv data/testset.csv

The CSV must contain a tags column whose values name the target Phoenix dataset (multiple tags can be pipe-delimited: es_search|source_router).

2. Structure your experiments

experiments/
├── es_search/
│   ├── task.py        # defines: async def task(example) -> Any
│   └── top_k.py       # defines: def top_k(output, expected) -> float
└── source_router/
    ├── task.py
    └── accuracy.py

3. Run experiments

evalwire run --experiments experiments/

Built-in evaluators

All factories are importable from evalwire.evaluators and return a callable with signature (output, expected: dict) -> float | bool.

Factory Returns Use case
make_top_k_evaluator(K=20) float Position-weighted retrieval scoring
make_membership_evaluator() bool Classification / routing label check
make_exact_match_evaluator() bool Extractive QA, single ground-truth string
make_contains_evaluator() bool Free-text generation, required phrase present
make_regex_evaluator() bool Structured format validation (dates, IDs, …)
make_json_match_evaluator(keys) float Tool-call / structured-output key matching
make_schema_evaluator(schema) bool JSON Schema conformance
make_numeric_tolerance_evaluator(atol, rtol) bool Math / calculation tasks with tolerance
make_llm_judge_evaluator(model, prompt, schema) float|bool LLM-as-a-judge with structured output

Example

from evalwire.evaluators import make_top_k_evaluator, make_exact_match_evaluator

# Drop the factory return value into your experiment directory as the evaluator
top_k = make_top_k_evaluator(K=5)
exact = make_exact_match_evaluator()

LLM judge

from pydantic import BaseModel
from langchain.chat_models import init_chat_model
from evalwire.evaluators import make_llm_judge_evaluator

class Verdict(BaseModel):
    explanation: str
    score: bool  # True = correct

llm_judge = make_llm_judge_evaluator(
    model=init_chat_model("gpt-4o-mini"),
    prompt_template=(
        "Output: {output}\n"
        "Expected: {expected_output}\n"
        "Is the output correct? Think step by step, then set score."
    ),
    output_schema=Verdict,
)

Requires pip install 'evalwire[llm-judge]'.


Node isolation

Use invoke_node to call a single LangGraph node without compiling a full graph:

from evalwire.langgraph import invoke_node

async def task(example) -> list[str]:
    result = await invoke_node(retrieve, example.input["user_query"], RAGState)
    return result["retrieved_titles"]

CLI reference

Command Description
evalwire upload --csv PATH Upload CSV testset to Phoenix
evalwire run --experiments DIR Discover and run all experiments
evalwire run --name NAME Run a single named experiment
evalwire run --dry-run N Run N examples without recording results
evalwire run --concurrency N Run N experiments in parallel

Configuration

Create evalwire.toml in your project root to avoid repeating flags:

[dataset]
csv_path = "data/testset.csv"
on_exist = "skip"

[experiments]
dir = "experiments"
prefix = "eval"
concurrency = 4

Requirements

  • Python >= 3.10
  • arize-phoenix >= 13.0, < 14
  • A running Phoenix instance (local or cloud)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalwire-0.3.1.tar.gz (277.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalwire-0.3.1-py3-none-any.whl (26.7 kB view details)

Uploaded Python 3

File details

Details for the file evalwire-0.3.1.tar.gz.

File metadata

  • Download URL: evalwire-0.3.1.tar.gz
  • Upload date:
  • Size: 277.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for evalwire-0.3.1.tar.gz
Algorithm Hash digest
SHA256 58fe5374b99cd4a8adf3dc276ddc40ac0f619faf9c8d348da408953d7ed81b49
MD5 04e54eaca8c6f5049fc41e6b96787000
BLAKE2b-256 81c3ab91cbb4e79972344cd38e8fe7ed1b633dda6fcbe15e5bb1f804277dc8cd

See more details on using hashes here.

File details

Details for the file evalwire-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: evalwire-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 26.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for evalwire-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b19d08a554fb3f7be58377d3d6fe36a7ce3a006e30e4b6abf798a000484a1d12
MD5 d4e75fc7b96eb825dde8f289b8f034d8
BLAKE2b-256 5c61fd65cd3834506cf692ff7232e415453d459a16e26976fdf4b1049fbf59f1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page