The autonomous quality agent for Arize Phoenix. Phoenix shows you what's wrong. Nengok fixes it.
Project description
Nengok
Phoenix shows you what's wrong. Nengok fixes it.
Nengok (Malay: "to watch over") is a pip-installable SDK that autonomously detects, diagnoses, and fixes silent failures in AI agents. It connects to your Arize Phoenix instance, samples production traces, clusters failure patterns, generates regression tests from real failures, runs controlled experiments to verify fixes, and presents verified solutions for human approval. Every cycle opens with an ADK LlmAgent (nengok/agents/triage.py) that reads recent traffic through the Arize Phoenix MCP server (McpToolset running @arizeai/phoenix-mcp) and decides whether the full pipeline should wake. Diagnosis runs Gemini 3.1 Pro via google-genai on Vertex AI, and the hosted demo lives on Cloud Run.
Trace data never leaves your infrastructure. Nengok runs locally next to your Phoenix instance, calls your Gemini key, and writes fix artifacts to your local filesystem.
$ pip install nengok
$ nengok init --phoenix-url http://localhost:6006
$ nengok run
Triage: investigate (adk) -> error burst in 'flights' over the last 15m
Observer: 200 spans -> 16 anomalies -> 16 new after dedup
Diagnoser: 3 clusters with hypotheses
Fixer: cluster 'flights-schema-drift' -> baseline 25% -> fix 100% (golden: no regression)
Verifier: PASSED -> artifacts/flights-schema-drift/
Cycle complete: 3 clusters detected, 1 fix proposed, 0 escalations.
Run `nengok dashboard` to review and approve.
Why Nengok?
Every AI agent in production loses money through confident wrong answers. HTTP 200, no error log, just a quietly hallucinated hotel name or a date parsed from the wrong schema. The fix loop is brutal:
- A user complains (days or weeks later).
- A senior engineer digs through trace logs.
- They reproduce, write an eval, hand-craft a fix.
- They never write a regression test, so the same class fails again next month.
Phoenix gives you the observability layer. Nengok adds the autonomous remediation loop on top:
Triage -> Observer -> Diagnoser -> Fixer -> Verifier
| | | | |
ADK agent Pull anomalous Cluster + Generate Pass/fail
reads Phoenix spans from hypothesize tests + gate +
via MCP, Phoenix root cause experiment artifact
wakes the loop output
Each cycle takes minutes instead of hours, every fix becomes a permanent regression test, and a human approves every change before anything ships.
Features
- Plug-and-play with Phoenix. Works with Phoenix Cloud, self-hosted Phoenix, or
phoenix serverunning on your laptop. - ADK triage gate. An
LlmAgentarmed with Phoenix MCP tools inspects recent traffic at the head of every cycle and decides whether the full pipeline is worth waking. If the agent errors, the cycle falls back to the rule-based anomaly filter; pass--no-triageto skip the gate entirely. - Two-stage failure filtering. Anomaly query at the SDK layer, then deduplication against previously-seen span IDs. You never re-process healthy traffic.
- Clusters with a memory. A recurring failure mode lands in its existing cluster row instead of minting a new one per cycle. An approved fix that regresses escalates the cluster and fires a notification; rejected and dismissed clusters re-accrete silently instead of re-alerting.
- More than one agent per install. List several Phoenix projects in
project_identifiersand a single cycle observes them all. When two agents fail for the same upstream reason, the cross-agent linker confirms the pair and both cluster pages show an "Also affects" panel. - Reviewer feedback becomes signal. Reject and dismiss decisions (optionally tagged
duplicate_cluster,mixed_root_causes, ornot_a_failure) replay into the next cycle's clusterer prompt, andnengok improvereads the last 30 days of outcomes to propose a clustering prompt amendment that only a human can activate. - Code-first, LLM-second evaluators. Structural checks are programmatic; only subjective dimensions (coherence, intent match) reach an LLM-as-Judge. Mitigates the well-documented position/verbosity bias of LLM judges.
- A/B experiments via Phoenix. Baseline vs. fix prompt, full per-case breakdown, dry-run safeguard.
- Human approval gate. Every fix lands in
artifacts/and waits for a one-click approve / reject / dismiss in the local dashboard. - Zero data egress. Your traces stay in your Phoenix. Your Gemini key calls Google directly from your machine. Nothing in this loop goes to a Nengok-controlled endpoint.
Stack
- Python 3.11+ for the SDK and engine, TypeScript for the dashboard.
- Gemini 3.1 for reasoning (
gemini-3.1-pro-preview) and LLM-as-Judge (gemini-3-flash-preview). - ADK
LlmAgent(nengok/agents/triage.py) gates every cycle through the Arize Phoenix MCP server (McpToolset→@arizeai/phoenix-mcp); diagnosis runs Gemini 3.1 Pro viagoogle-genaion Vertex AI; hosted on Cloud Run. - Arize Phoenix for observability (Python SDK +
@arizeai/phoenix-mcp@4.0.13, CLI). - FastAPI bundled inside the SDK to serve the dashboard API.
- Vite, React, TypeScript, and Tailwind for the frontend.
- SQLite (default) or any Postgres / MySQL via
DATABASE_URL, served throughnengok/state/store.py. pip install nengokfor local use; Cloud Run for the hackathon hosted URL.
Quickstart
Prerequisites
- Python 3.11+
- A reachable Phoenix instance (Phoenix Cloud, self-hosted, or
phoenix serve) - A Google AI Studio API key for Gemini
1. Install
pip install nengok
Nengok writes cluster state to ~/.nengok/state.db (SQLite) on first run, so the default install has no database setup step. Point DATABASE_URL at Postgres or MySQL when you want shared state across pods; the optional deploy/local/docker-compose.postgres.yml and deploy/local/docker-compose.mysql.yml files bring a local instance up for backend testing. For local development against this repo, see .github/CONTRIBUTING.md.
2. Configure
nengok init --phoenix-url http://localhost:6006 --project my-agent
export PHOENIX_API_KEY=... # if your Phoenix requires auth
export GOOGLE_API_KEY=...
nengok init writes ~/.nengok/config.toml. Secrets stay in your environment. Every config field is documented with its env var and default in docs/configuration.md.
3. Run a cycle
nengok run
This executes one full Observer -> Diagnoser -> Fixer -> Verifier pass. With the adk extra installed, the cycle opens with the triage agent described in docs/agent-builder.md; pass --no-triage to skip it.
4. Watch continuously (optional)
nengok watch --interval 300
5. Review and approve
nengok dashboard
# Opens http://127.0.0.1:8765
The dashboard renders every fix-proposed cluster (the proposed prompt diff, the regression dataset, the root-cause analysis) and gives you one-click approve / reject / dismiss.
If you operate Nengok over SSH and would rather stay in the terminal, install the optional TUI extra and run nengok review in the same session:
pip install "nengok[tui]"
nengok review
The TUI hits the same FastAPI routes the browser uses, and every decision lands in the same nengok_approvals table tagged with source='tui'. See docs/tui-review.md for keybindings and the audit-log contract.
Project Layout
nengok-codebase/
├── nengok/ # The SDK (pip install nengok)
│ ├── cli.py # nengok run, watch, dashboard, review, init
│ ├── config.py
│ ├── core/ # Orchestrator + the four pipeline stages
│ │ ├── observer/
│ │ ├── diagnoser/
│ │ ├── fixer/
│ │ ├── verifier/
│ │ └── evaluators/ # Code-based + LLM-as-Judge
│ ├── phoenix/ # Phoenix SDK + MCP integration
│ ├── server/ # Bundled FastAPI dashboard API
│ └── state/ # Multi-backend cluster lifecycle (SQLite default; Postgres or MySQL via DATABASE_URL)
├── frontend/ # Vite + React + TS + Tailwind dashboard
├── sample_agent/ # Travel Planner demo agent (3 injectable failures)
├── phoenix_harness/ # Live integration tests against a real Phoenix
├── golden_dataset/ # Curated cases the Verifier never lets regress
├── tests/ # Unit tests (fakes, no network)
├── artifacts/ # Fix output (per-cluster prompt + dataset + RCA)
├── deploy/ # Cloud Run image for the hosted-demo URL
├── pyproject.toml
└── README.md
The Demo Scenario
The sample_agent/ package ships a Travel Planner with three runtime-toggleable failure modes:
| Failure mode | What goes wrong | Effect on the agent |
|---|---|---|
flights |
departure_time changes from "14:30" to {"hour": 14, "minute": 30} |
Agent emits a malformed itinerary |
weather |
Temperature unit silently switches from F to C | Agent suggests a parka for 75 °F weather |
hotels |
Endpoint times out 40 % of the time | Agent hallucinates hotel names instead of erroring |
A second sample agent lives under sample_agent/qa_agent/. It is a tiny retrieval-augmented Q&A with four injectable failure modes: retriever drops the retrieved context, hallucination patches the prompt to answer from memory, wrong_attribution rotates snippet ids so the citation no longer matches its body, and flights_schema rides the same mock flights API as the Travel Planner so the cross-agent linker has a shared upstream failure to find. Nengok can point at it without code changes.
Run the demo with one copy-paste:
pip install "nengok[gemini,phoenix,adk,tui]"
python -m sample_agent.seed --count 5
nengok init --phoenix-url http://localhost:6006 --project travel-planner-agent
nengok run
sample_agent.seed fires five runs of the Travel Planner with every failure mode injected, then prints the Phoenix project URL. Hand the same project name to nengok init and nengok run opens with the ADK triage agent, then walks the four-stage loop end to end. Run nengok dashboard afterwards to approve the verified fix. That install line is the one the demo recording uses: it skips the optional clustering extra on purpose, so every model call in the loop is a Gemini call.
Plug in Your Own Agent
Nengok loads any class that satisfies the AgentRunner protocol: a name property and a run(agent_input: dict, prompt: str) -> dict method. Drop the class in your own package, then point Nengok at it from ~/.nengok/config.toml:
# my_pkg/runner.py
from typing import Any
class MyAgent:
@property
def name(self) -> str:
return "my-agent"
def run(self, agent_input: dict[str, Any], prompt: str) -> dict[str, Any]:
from my_pkg.agent import answer
return answer(agent_input["query"], system_prompt=prompt)
# ~/.nengok/config.toml
[nengok]
project_identifier = "my-agent"
agent_runner = "my_pkg.runner:MyAgent"
baseline_prompt_path = "my_pkg/prompts/system.md"
Then nengok doctor confirms the runner imports and the protocol check passes, and nengok run --project my-agent cycles against your traces. The bundled sample_agent/qa_agent/ is a worked example you can copy from.
Architecture
Your Infrastructure
+---------------------------------------------------------------+
| |
| $ pip install nengok |
| |
| +-------------------------------------------------------+ |
| | Nengok SDK | |
| | | |
| | +--------------------------------+ | |
| | | Triage: ADK LlmAgent | | |
| | | McpToolset -> @arizeai/ | | |
| | | phoenix-mcp -> your Phoenix | | |
| | +---------------+----------------+ | |
| | | investigate? project + window | |
| | v | |
| | +--------+ +----------+ +-------+ +----------+ | |
| | |Observer|->|Diagnoser |->|Fixer |->|Verifier | | |
| | +---+----+ +-----+----+ +---+---+ +----+-----+ | |
| +------+-------------+-----------+------------+---------+ |
| v v v v |
| +---------+ +----------+ +-------+ +-----------+ |
| | Your | | Your | | Your | | Local | |
| | Phoenix | | Gemini | |Phoenix| | artifacts | |
| | (read) | | key | |(write)| | + dash | |
| +---------+ +----------+ +-------+ +-----------+ |
| |
+---------------------------------------------------------------+
Nothing leaves this box.
Project Rules
These are non-negotiable for every contribution. See .github/CONTRIBUTING.md for the full guide.
- Code-first, LLM-second evaluators. Anything objectively verifiable lives in
nengok/core/evaluators/code_evals.py. LLM-as-Judge is reserved for subjective criteria. - No data egress. Nengok must never send trace data to a third-party endpoint. Period.
- Human-in-the-loop always. No code path auto-applies a fix.
- Phoenix SDK for writes, MCP for reads. Centralized in
nengok/phoenix/client.pyandnengok/phoenix/mcp.py; the triage agent innengok/agents/triage.pyreads through the same MCP server via its ADK toolset. - Pinned Phoenix versions.
arize-phoenix-clientis pinned inpyproject.toml. Do not chase upstream releases mid-cycle.
Roadmap
- v0.1 (current): the closed loop end to end. ADK triage gate, Observer -> Diagnoser -> Fixer -> Verifier, cluster identity across cycles, monitoring for several agents at once with cross-agent cluster links, reviewer feedback feeding the clusterer,
nengok improveretros, local artifacts, and approval from the browser dashboard or thenengok reviewTUI. - v0.2:
TraceBackendabstraction so Langfuse and raw OTLP can stand in for Phoenix; optional HDBSCAN embedding pre-pass in front of the Gemini clusterer. - v0.3: Git MCP integration (approved artifacts open as PRs), event-driven cycle scheduling with a heartbeat threshold.
- v0.4: Plugin architecture for fix strategies, write-back targets, and evaluators; DSPy GEPA and TextGrad fix-generation backends; managed cloud tier (open-core, following the Langfuse playbook). The self-hosted SDK stays the source of truth.
- v1.0: EU AI Act audit bundle built on the
nengok exportformat.
Out of scope for v0.1
The v0.1 hackathon release intentionally defers:
- Git MCP integration. Approved fixes write to
artifacts/; opening them as PRs lands in v0.3. - A
TraceBackendabstraction. v0.1 is Phoenix-native; Langfuse and raw OTLP support land in v0.2. - Event-driven cycle scheduling. The current loop polls on a fixed interval; the heartbeat threshold lands in v0.3.
- HDBSCAN clustering. v0.1 ships the Gemini-only clusterer, with cluster identity and reviewer feedback layered on top. The
clusteringextra exists inpyproject.tomlbut nothing imports it yet. - Plugin architecture and the DSPy / TextGrad fix backends (v0.4).
Acknowledgements
Nengok is built on top of Arize Phoenix and would not exist without the MCP server, Python SDK, OpenInference instrumentation, and Phoenix Skills published by the Arize team. Nengok automates the workflow Phoenix's own documentation teaches developers to perform by hand.
The clustering and root-cause hypothesis pipeline is informed by Pathak et al. (2025), Detecting Silent Failures in Multi-Agentic AI Trajectories, and by the SAGE benchmark on LLM-as-Judge reliability.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nengok-0.1.0.tar.gz.
File metadata
- Download URL: nengok-0.1.0.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4db5324931483d3b93e76cb3f62e03927a4bd540b32591174e572de9bc99e6f
|
|
| MD5 |
c8a3679dcc400eb062f10bc092dbe1ff
|
|
| BLAKE2b-256 |
aa0abac3f40030aac12d426749eccc4cf7ea07a245a8e35df7cf043574db5bfc
|
Provenance
The following attestation bundles were made for nengok-0.1.0.tar.gz:
Publisher:
publish.yml on waizwafiq/Nengok
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nengok-0.1.0.tar.gz -
Subject digest:
e4db5324931483d3b93e76cb3f62e03927a4bd540b32591174e572de9bc99e6f - Sigstore transparency entry: 1780482322
- Sigstore integration time:
-
Permalink:
waizwafiq/Nengok@b4d58a349f5eaf7a4e999edd1bcc7af1d3f90328 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/waizwafiq
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b4d58a349f5eaf7a4e999edd1bcc7af1d3f90328 -
Trigger Event:
push
-
Statement type:
File details
Details for the file nengok-0.1.0-py3-none-any.whl.
File metadata
- Download URL: nengok-0.1.0-py3-none-any.whl
- Upload date:
- Size: 1.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db4596b830d6c67cd6bf790a8e6eeec4c0a587c38ad789cb2d69076024e11ff3
|
|
| MD5 |
0da94b239763a0155f14348fac22576e
|
|
| BLAKE2b-256 |
10232ee3eccd4cba4a4a2b1c868cdce7365a9c4de97e164f103f22cb5b3c00d6
|
Provenance
The following attestation bundles were made for nengok-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on waizwafiq/Nengok
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nengok-0.1.0-py3-none-any.whl -
Subject digest:
db4596b830d6c67cd6bf790a8e6eeec4c0a587c38ad789cb2d69076024e11ff3 - Sigstore transparency entry: 1780482419
- Sigstore integration time:
-
Permalink:
waizwafiq/Nengok@b4d58a349f5eaf7a4e999edd1bcc7af1d3f90328 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/waizwafiq
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b4d58a349f5eaf7a4e999edd1bcc7af1d3f90328 -
Trigger Event:
push
-
Statement type: