Math-driven Agent Harness — CUSUM + HMM + POMDP decision engine with built-in execution loop

These details have not been verified by PyPI

Project links

Project description

judgment

Math-driven Agent Harness — CUSUM + HMM + POMDP decision engine with built-in execution loop.

Instead of relying on prompt heuristics, this module uses sequential change-point detection, discrete Hidden Markov Models, and exact POMDP value iteration to make quantifiable, auditable decisions about:

Whether the agent is healthy, degraded, or broken (latent state inference)
When to continue, correct, escalate, or gather information (optimal action under uncertainty)
Whether the current observation stream has drifted from normal (anomaly detection)

This targets the core engineering challenge: knowing when an LLM agent is going off the rails before it wastes context or produces bad output.

Architecture

┌─────────────────────────────┐
│ Layer 1: CUSUM + Hawkes     │  Page (1954) change-point detection
│   Anomaly detection         │  Hawkes (1971) baseline likelihood
└──────────────┬──────────────┘
               ▼
┌─────────────────────────────┐
│ Layer 2: 3-State HMM        │  Rabiner (1989) Forward algorithm
│   Healthy/Degraded/Broken   │  Log-space filtering
└──────────────┬──────────────┘
               ▼
┌─────────────────────────────┐
│ Layer 3: POMDP Policy       │  Kaelbling et al. (1998)
│   Exact value iteration     │  Belief-MDP on 231-point simplex
│   → continue/correct/       │
│     escalate/gather         │
└─────────────────────────────┘

Each layer has a citable mathematical foundation. No heuristic masquerading as math.

Project Structure

judgment/
├── core/
│   ├── engine.py          # DecisionEngine: 3-layer integration
│   ├── hawkes.py          # Multivariate marked Hawkes (likelihood provider)
│   ├── hmm.py             # 3-state discrete HMM + Forward/Viterbi (7 obs dims)
│   ├── cusum.py           # CUSUM anomaly detector
│   ├── pomdp.py           # Exact belief-MDP value iteration + RewardConfig
│   ├── pomcp.py           # POMCP — online particle MCTS (no grid, scalable)
│   ├── content_signals.py # Content-quality metrics (length, novelty, negation)
│   ├── corrective.py      # Heuristic corrective action router (4 rules)
│   ├── training.py        # Baum-Welch EM for HMM parameter learning
│   └── diagnostics.py     # Structured diagnostic outputs
├── integration/
│   ├── base.py            # Abstract adapter protocol
│   └── langgraph.py       # LangGraph node + conditional-edge router
├── harness/
│   ├── loop.py            # JudgmentHarness: unified execution loop
│   ├── executor.py        # LLMExecutor (OpenAI-compat) + SimulatedExecutor
│   └── tools.py           # ToolRegistry with built-in tools
├── examples/
│   ├── coding_agent_demo.py
│   └── langgraph_agent.py # LangGraph-style agent with judgment oversight
├── cli/
│   └── main.py            # CLI: judgment run / train / dashboard
├── dashboard/
│   └── app.py             # Streamlit live visualizer
├── docs/
│   ├── architecture-redesign.md
│   ├── hawkes-redesign.md
│   └── three-gaps-design.md
├── tests/                 # 139 tests, all passing
├── pyproject.toml
└── README.md

Math → Harness Problems

Component	Math	Problem Solved
CUSUM + Hawkes	Page (1954) + Hawkes (1971)	Detects observation drift; Hawkes corrects for expected event clustering
HMM Forward	Rabiner (1989)	Infers hidden health state (H/D/B) from noisy structural + content signals
POMCP (online MCTS)	Silver & Veness (2010)	Scalable online POMDP solving — particle-based UCT, no grid discretisation
Grid POMDP	Kaelbling et al. (1998)	Exact value iteration on 231-point simplex (fast fallback for 3-state case)
Content Signals	Heuristic (lightweight)	Detects LLM derailment from text output: length anomaly, repetition, self-contradiction
Corrective Router	Heuristic (explicitly labelled)	Maps CORRECT signal to concrete advice (verify/rethink/retry/rollback)
Baum-Welch (EM)	Rabiner (1989) §III-C	Learns HMM parameters from agent run logs; semi-supervised mode

Quick Start

git clone https://github.com/pearthink123/judgment.git
cd judgment
pip install -e ".[dashboard]"   # base + Streamlit + pandas + matplotlib

# Run demo (no API key needed)
python examples/coding_agent_demo.py

# Launch dashboard
streamlit run dashboard/app.py

# Run CLI
judgment run "Implement an LRU cache in Python" --max-steps 10

Using as a Library

from judgment import JudgmentHarness

# One line to create a harness
harness = JudgmentHarness(max_steps=30, seed=42)

# Run a task
result = harness.run("Write a function that merges two sorted lists.")

print(f"Status: {result.status}")
print(f"Steps:  {result.steps}")
print(f"Belief: {result.final_belief}")
print(f"Summary: {result.summary}")

# Inspect the full decision log
for d in result.decision_log:
    print(f"  {d.action:10s}  H={d.belief['healthy']:.3f}  "
          f"D={d.belief['degraded']:.3f}  B={d.belief['broken']:.3f}")

Power users can access individual layers directly:

from judgment import DecisionEngine, RewardConfig, train_hmm

# Custom reward function
reward = RewardConfig.preset("conservative")
engine = DecisionEngine(reward=reward)

# Process observations step by step
decision = engine.step({
    "tool_ok": True,
    "progress_delta": 0.15,
    "has_user_msg": False,
    "error_count_delta": 0,
})
print(decision.action)  # "continue"

# Learn from logs
hmm = train_hmm(logs)

# POMCP mode — scalable online MCTS (no grid)
engine = DecisionEngine(use_pomcp=True, pomcp_n_simulations=2000)

# Content-quality monitoring
engine = DecisionEngine(use_content_signals=True)
decision = engine.step({
    "tool_ok": True,
    "progress_delta": 0.12,
    "has_user_msg": False,
    "error_count_delta": 0,
    "llm_text": "The agent's output text for this step...",
})
# decision.content_signals → length_z_cat, novelty_cat, negation_cat

LangGraph Integration

Drop judgment oversight into an existing LangGraph agent without restructuring your graph:

from judgment.integration.langgraph import (
    create_judgment_node, create_judgment_router,
)
from judgment import DecisionEngine

engine = DecisionEngine()

graph = StateGraph(MyState)
graph.add_node("agent", my_agent_node)
graph.add_node("tools", my_tool_node)
graph.add_node("human", my_human_node)
graph.add_node("judgment", create_judgment_node(engine))

# After each tool execution, check health
graph.add_edge("tools", "judgment")

# judgment routes back to agent, tools, or human based on engine output
graph.add_conditional_edges(
    "judgment",
    create_judgment_router(engine),
    {"agent": "agent", "tools": "tools", "human": "human"},
)

See examples/langgraph_agent.py for a complete runnable example (no LangGraph install required).

What This Is Not

Not a replacement for ReAct / LangGraph / CrewAI. The engine is a critic that watches the execution loop and decides when to intervene. The harness loop (JudgmentHarness) wraps an LLM executor with math-driven oversight.
Not a learning system (yet). Baum-Welch can learn HMM parameters from logs, but the POMDP reward function must be configured, not learned.
Not a content-quality judge. The HMM cares about structural health signals (tool success rate, progress, error trend), not whether the LLM's output text is correct.

Roadmap

POMDP reward learning from annotated trajectories
LangGraph adapter (drop-in decision node)
Multi-agent health monitoring (one engine per agent, shared HMM)
Streaming observation model (replace discrete categories with continuous soft signals)

License

MIT

References

Page, E. S. (1954). "Continuous Inspection Schemes." Biometrika, 41(1/2), 100–115.
Rabiner, L. R. (1989). "A Tutorial on Hidden Markov Models." Proceedings of the IEEE, 77(2), 257–286.
Hawkes, A. G. (1971). "Spectra of Some Self-Exciting and Mutually Exciting Point Processes." Biometrika, 58(1), 83–90.
Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). "Planning and Acting in Partially Observable Stochastic Domains." Artificial Intelligence, 101(1-2), 99–134.
Silver, D. & Veness, J. (2010). "Monte-Carlo Planning in Large POMDPs." NeurIPS, 23.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jun 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

judgment-0.2.0-py3-none-any.whl (62.8 kB view details)

Uploaded Jun 27, 2026 Python 3

File details

Details for the file judgment-0.2.0-py3-none-any.whl.

File metadata

Download URL: judgment-0.2.0-py3-none-any.whl
Upload date: Jun 27, 2026
Size: 62.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for judgment-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1a037955c39371c65eb2126ef6da2dafbc1ad2bb8c973b2a58bce877fb72d9d1`
MD5	`bf44798839fcd3b8b4ed52ac3a3a7627`
BLAKE2b-256	`0a93162282310c79a5d585e984d6b56be2341803f53fd3a467552752e7b2b532`

See more details on using hashes here.

judgment 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

judgment

Architecture

Project Structure

Math → Harness Problems

Quick Start

Using as a Library

LangGraph Integration

What This Is Not

Roadmap

License

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes