Skip to main content

Math-driven Agent Harness — CUSUM + HMM + POMDP decision engine with built-in execution loop

Project description

judgment

Math-driven Agent Harness — CUSUM + HMM + POMDP decision engine with built-in execution loop.

Instead of relying on prompt heuristics, this module uses sequential change-point detection, discrete Hidden Markov Models, and exact POMDP value iteration to make quantifiable, auditable decisions about:

  • Whether the agent is healthy, degraded, or broken (latent state inference)
  • When to continue, correct, escalate, or gather information (optimal action under uncertainty)
  • Whether the current observation stream has drifted from normal (anomaly detection)

This targets the core engineering challenge: knowing when an LLM agent is going off the rails before it wastes context or produces bad output.

Architecture

┌─────────────────────────────┐
│ Layer 1: CUSUM + Hawkes     │  Page (1954) change-point detection
│   Anomaly detection         │  Hawkes (1971) baseline likelihood
└──────────────┬──────────────┘
               ▼
┌─────────────────────────────┐
│ Layer 2: 3-State HMM        │  Rabiner (1989) Forward algorithm
│   Healthy/Degraded/Broken   │  Log-space filtering
└──────────────┬──────────────┘
               ▼
┌─────────────────────────────┐
│ Layer 3: POMDP Policy       │  Kaelbling et al. (1998)
│   Exact value iteration     │  Belief-MDP on 231-point simplex
│   → continue/correct/       │
│     escalate/gather         │
└─────────────────────────────┘

Each layer has a citable mathematical foundation. No heuristic masquerading as math.

Project Structure

judgment/
├── core/
│   ├── engine.py          # DecisionEngine: 3-layer integration
│   ├── hawkes.py          # Multivariate marked Hawkes (likelihood provider)
│   ├── hmm.py             # 3-state discrete HMM + Forward/Viterbi (7 obs dims)
│   ├── cusum.py           # CUSUM anomaly detector
│   ├── pomdp.py           # Exact belief-MDP value iteration + RewardConfig
│   ├── pomcp.py           # POMCP — online particle MCTS (no grid, scalable)
│   ├── content_signals.py # Content-quality metrics (length, novelty, negation)
│   ├── corrective.py      # Heuristic corrective action router (4 rules)
│   ├── training.py        # Baum-Welch EM for HMM parameter learning
│   └── diagnostics.py     # Structured diagnostic outputs
├── integration/
│   ├── base.py            # Abstract adapter protocol
│   └── langgraph.py       # LangGraph node + conditional-edge router
├── harness/
│   ├── loop.py            # JudgmentHarness: unified execution loop
│   ├── executor.py        # LLMExecutor (OpenAI-compat) + SimulatedExecutor
│   └── tools.py           # ToolRegistry with built-in tools
├── examples/
│   ├── coding_agent_demo.py
│   └── langgraph_agent.py # LangGraph-style agent with judgment oversight
├── cli/
│   └── main.py            # CLI: judgment run / train / dashboard
├── dashboard/
│   └── app.py             # Streamlit live visualizer
├── docs/
│   ├── architecture-redesign.md
│   ├── hawkes-redesign.md
│   └── three-gaps-design.md
├── tests/                 # 139 tests, all passing
├── pyproject.toml
└── README.md

Math → Harness Problems

Component Math Problem Solved
CUSUM + Hawkes Page (1954) + Hawkes (1971) Detects observation drift; Hawkes corrects for expected event clustering
HMM Forward Rabiner (1989) Infers hidden health state (H/D/B) from noisy structural + content signals
POMCP (online MCTS) Silver & Veness (2010) Scalable online POMDP solving — particle-based UCT, no grid discretisation
Grid POMDP Kaelbling et al. (1998) Exact value iteration on 231-point simplex (fast fallback for 3-state case)
Content Signals Heuristic (lightweight) Detects LLM derailment from text output: length anomaly, repetition, self-contradiction
Corrective Router Heuristic (explicitly labelled) Maps CORRECT signal to concrete advice (verify/rethink/retry/rollback)
Baum-Welch (EM) Rabiner (1989) §III-C Learns HMM parameters from agent run logs; semi-supervised mode

Quick Start

git clone https://github.com/pearthink123/judgment.git
cd judgment
pip install -e ".[dashboard]"   # base + Streamlit + pandas + matplotlib

# Run demo (no API key needed)
python examples/coding_agent_demo.py

# Launch dashboard
streamlit run dashboard/app.py

# Run CLI
judgment run "Implement an LRU cache in Python" --max-steps 10

Using as a Library

from judgment import JudgmentHarness

# One line to create a harness
harness = JudgmentHarness(max_steps=30, seed=42)

# Run a task
result = harness.run("Write a function that merges two sorted lists.")

print(f"Status: {result.status}")
print(f"Steps:  {result.steps}")
print(f"Belief: {result.final_belief}")
print(f"Summary: {result.summary}")

# Inspect the full decision log
for d in result.decision_log:
    print(f"  {d.action:10s}  H={d.belief['healthy']:.3f}  "
          f"D={d.belief['degraded']:.3f}  B={d.belief['broken']:.3f}")

Power users can access individual layers directly:

from judgment import DecisionEngine, RewardConfig, train_hmm

# Custom reward function
reward = RewardConfig.preset("conservative")
engine = DecisionEngine(reward=reward)

# Process observations step by step
decision = engine.step({
    "tool_ok": True,
    "progress_delta": 0.15,
    "has_user_msg": False,
    "error_count_delta": 0,
})
print(decision.action)  # "continue"

# Learn from logs
hmm = train_hmm(logs)

# POMCP mode — scalable online MCTS (no grid)
engine = DecisionEngine(use_pomcp=True, pomcp_n_simulations=2000)

# Content-quality monitoring
engine = DecisionEngine(use_content_signals=True)
decision = engine.step({
    "tool_ok": True,
    "progress_delta": 0.12,
    "has_user_msg": False,
    "error_count_delta": 0,
    "llm_text": "The agent's output text for this step...",
})
# decision.content_signals → length_z_cat, novelty_cat, negation_cat

LangGraph Integration

Drop judgment oversight into an existing LangGraph agent without restructuring your graph:

from judgment.integration.langgraph import (
    create_judgment_node, create_judgment_router,
)
from judgment import DecisionEngine

engine = DecisionEngine()

graph = StateGraph(MyState)
graph.add_node("agent", my_agent_node)
graph.add_node("tools", my_tool_node)
graph.add_node("human", my_human_node)
graph.add_node("judgment", create_judgment_node(engine))

# After each tool execution, check health
graph.add_edge("tools", "judgment")

# judgment routes back to agent, tools, or human based on engine output
graph.add_conditional_edges(
    "judgment",
    create_judgment_router(engine),
    {"agent": "agent", "tools": "tools", "human": "human"},
)

See examples/langgraph_agent.py for a complete runnable example (no LangGraph install required).

What This Is Not

  • Not a replacement for ReAct / LangGraph / CrewAI. The engine is a critic that watches the execution loop and decides when to intervene. The harness loop (JudgmentHarness) wraps an LLM executor with math-driven oversight.
  • Not a learning system (yet). Baum-Welch can learn HMM parameters from logs, but the POMDP reward function must be configured, not learned.
  • Not a content-quality judge. The HMM cares about structural health signals (tool success rate, progress, error trend), not whether the LLM's output text is correct.

Roadmap

  • POMDP reward learning from annotated trajectories
  • LangGraph adapter (drop-in decision node)
  • Multi-agent health monitoring (one engine per agent, shared HMM)
  • Streaming observation model (replace discrete categories with continuous soft signals)

License

MIT

References

  • Page, E. S. (1954). "Continuous Inspection Schemes." Biometrika, 41(1/2), 100–115.
  • Rabiner, L. R. (1989). "A Tutorial on Hidden Markov Models." Proceedings of the IEEE, 77(2), 257–286.
  • Hawkes, A. G. (1971). "Spectra of Some Self-Exciting and Mutually Exciting Point Processes." Biometrika, 58(1), 83–90.
  • Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). "Planning and Acting in Partially Observable Stochastic Domains." Artificial Intelligence, 101(1-2), 99–134.
  • Silver, D. & Veness, J. (2010). "Monte-Carlo Planning in Large POMDPs." NeurIPS, 23.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

judgment-0.2.0-py3-none-any.whl (62.8 kB view details)

Uploaded Python 3

File details

Details for the file judgment-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: judgment-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 62.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for judgment-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1a037955c39371c65eb2126ef6da2dafbc1ad2bb8c973b2a58bce877fb72d9d1
MD5 bf44798839fcd3b8b4ed52ac3a3a7627
BLAKE2b-256 0a93162282310c79a5d585e984d6b56be2341803f53fd3a467552752e7b2b532

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page