GMS-Harness — provider-agnostic DOE-driven black-box testing platform for LLM agents
Project description
knowlytix-harness
Geometric Memory Systems Harness — DOE-driven, black-box agentic testing with graph-verified ground truth and Design-of-Experiments factor analysis. Provider-agnostic: swap between Anthropic, OpenAI, Bedrock, Azure, or local Ollama without touching code.
knowlytix-harness is the headline package in the Geometric Memory Systems
family. Use it to turn ad-hoc "does this agent work?" evaluations into
repeatable, statistically-grounded campaigns with typed verdicts, failure
taxonomy, cost tracking, and release gates. Bundles the runtime-governance
surface (knowlytix.harness.governance) for production-grade governed
agentic systems — same install, no extra step.
- Package:
knowlytix-harness - License: Apache-2.0
- Python: 3.12+
- Status: alpha (v0.x)
Install
pip install knowlytix-harness
Pulls knowlytix-core, knowlytix-knowledge, and knowlytix-benchmark at
matching ~=0.1.0 versions (lockstep releases — no version mismatches). LLM
calls route through LiteLLM: one library, every provider.
Provider setup (pick one)
The same knowlytix-harness wheel runs against any supported provider. Set the right env
vars and go — no code changes.
Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
export GMS_LLM_MODEL=anthropic/claude-opus-4-6
OpenAI
export OPENAI_API_KEY=sk-...
export GMS_LLM_MODEL=openai/gpt-4o-mini
AWS Bedrock
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_REGION=us-west-2
export GMS_LLM_MODEL=bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0
Azure OpenAI
export AZURE_API_KEY=...
export AZURE_API_BASE=https://your-resource.openai.azure.com
export AZURE_API_VERSION=2024-02-15-preview
export GMS_LLM_MODEL=azure/your-deployment-name
Local Ollama (no API key)
export OLLAMA_BASE_URL=http://localhost:11434
export GMS_LLM_MODEL=ollama/llama3
Full list including Google, Mistral, Cohere, Together, and more in
.env.example from the source repo.
Tutorials
Two hands-on tutorial tracks ship inside the wheel:
| Track | Notebooks | Path |
|---|---|---|
| Testing — DOE-driven black-box testing, calibration, release gates | 24 | knowlytix/harness/testing/tutorials/notebooks/ |
| Governance — USER_GUIDE companion exercises | 27 | knowlytix/harness/governance/tutorials/notebooks/ |
Install the tutorial extras (Anthropic SDK, JupyterLab, matplotlib, shap):
pip install "knowlytix-harness[tutorials]"
export ANTHROPIC_API_KEY=sk-ant-... # tutorials call claude-sonnet-4-6 directly
Launch:
jupyter lab $(python -c "import knowlytix.harness.testing.tutorials; print(__import__('importlib.resources', fromlist=['files']).files('knowlytix.harness.testing.tutorials').joinpath('notebooks'))")
# or navigate manually to the notebooks/ path inside your site-packages
Post-install verification
After pip install knowlytix-harness, three commands confirm your stack is
healthy and open the human-facing exploration notebook:
pip install jupyterlab # if not already installed
knowlytix-smoke # 5-step key-free assertion suite
jupyter lab $(knowlytix-smoke --notebook-path) # interactive walkthrough (requires [tutorials])
knowlytix-smoke exits 0 on a healthy install; exit 1 names which of the
5 checks failed (imports + __all__, Settings defaults,
importlib.resources fixture reachability, knowlytix.benchmark.score_answer
on shipped predictions, harness DOE fixture schema). The notebook
is shipped as package data inside this wheel — no repo clone
needed — and its --notebook-path output is an absolute,
symlink-resolved filesystem path.
CLI quickstart
# 1. Verify install
knowlytix-harness --help
# 2. Smoke test against the bundled fixture (no external data, no API key needed
# if you use a dry-run evaluator)
knowlytix-harness run --fixture doe_smoke.json --dry-run
# 3. Live run with an LLM evaluator
knowlytix-harness run --markdown report.md --factor-group query_core --n-runs 32
# Alias — knowlytix-harness and knowlytix-testing are the same entry point
knowlytix-testing run --campaign campaigns/regression.yaml
Programmatic quickstart — one DOE campaign end-to-end
import os
from gms import get_llm, ModelPurpose
from knowlytix.harness.testing import (
DOEGMSBenchmark, DOEHarnessConfig,
make_evaluator, HallucinationOracle,
)
config = DOEHarnessConfig(
markdown_path="report.md", # document under test
factor_group="query_core", # DOE factor group
n_runs=32,
enable_hallucination_testing=True,
enable_cost_tracking=True,
)
bench = DOEGMSBenchmark(config)
bench.ingest()
# make_evaluator(target_type, target_model, client=None, harness=None)
evaluator = make_evaluator(
target_type="llm",
target_model=os.environ["GMS_LLM_MODEL"],
client=get_llm(ModelPurpose.DEFAULT),
)
result = bench.run(evaluator=evaluator)
analyzer = bench.analyze(result) # returns a DOEAnalyzer (from graphdoe)
print(analyzer.summary()) # check the DOEAnalyzer API for exact method
Configuration reference
GMSH_* — harness tuning
| Variable | Default | Meaning |
|---|---|---|
GMSH_DOE_N_RUNS |
32 |
Runs per DOE campaign. |
GMSH_DOE_SEED |
42 |
RNG seed for run selection. |
GMSH_DOE_SLA_LATENCY_MS |
5000 |
Per-call latency SLA. |
GMSH_DOE_COST_BUDGET_USD |
10.0 |
Campaign-level USD ceiling. |
GMSH_DOE_HALLUCINATION_THRESHOLD |
0.1 |
Max tolerated hallucination rate. |
GMSH_MAX_WORKERS |
4 |
Parallel eval worker count. |
GMSH_QUESTION_TIMEOUT_S |
60 |
Per-question timeout. |
GMSH_MAX_RETRIES |
2 |
Retry count on evaluator error. |
GMSH_MAX_TURNS |
8 |
Multi-turn conversation cap. |
GMSH_TRUNCATE_RESULT_AT |
10000 |
Character cap on captured outputs. |
GMSH_STORES_DIR |
./gms_stores |
Where ingested stores live. |
GMSH_TRACING_DIR |
./doe_tracing_store |
Trace artifact root. |
GMSH_RUNS_DIR |
./runs |
Run records output dir. |
GMSH_CAMPAIGNS_DIR |
./campaigns |
Campaign manifests. |
GMSH_SESSION_STORE_PATH |
./harness_session_store |
Session state. |
GMSH_LIVE_DASHBOARD_PORT |
8765 |
Live WebSocket dashboard port. |
Twenty-one GMSH_ENABLE_* feature flags toggle optional subsystems (typed
verdicts, provenance, gateway fault injection, policy engine, stateful
testing, hallucination oracle, calibration, multi-agent, cross-document,
disambiguation, invariance, streaming, live dashboard, and more). See the
USER_GUIDE.md shipped in the wheel for the full list.
GMS_LLM_* — LLM routing
| Variable | Meaning |
|---|---|
GMS_LLM_MODEL |
Base LiteLLM model string. Required unless every purpose is overridden. |
GMS_LLM_MODEL_JUDGE |
Override for judge/verifier calls. |
GMS_LLM_MODEL_GENERATOR |
Override for question-generation. |
GMS_LLM_MODEL_SCORER |
Override for scoring. |
GMS_LLM_TIMEOUT_SECONDS |
Per-call timeout. Default 60. |
GMS_LLM_MAX_RETRIES |
Transient retries. Default 2. |
GMS_LLM_TEMPERATURE |
Sampling temperature. Default 0.0. |
Architecture in one paragraph
knowlytix-harness decomposes "did my agent behave correctly?" into (1) document
ingestion via knowlytix-knowledge → geometric memory store, (2) auto-generation of
graph-verified questions via knowlytix-benchmark + geometric generators, (3)
DOE factor-group sweep producing a structured run matrix, (4) typed verdict
verification against provable graph traversals, (5) failure taxonomy +
severity classification + cost/latency tracking, (6) release-gate decision
with audit packet. Every step is provider-agnostic — the same campaign YAML
runs unchanged against any supported LLM.
For runtime governance of agentic systems in production, the wheel also ships knowlytix.harness.governance (the governed harness): triple-gate tool gateway (schema + policy + plausibility), typed claim verification routed to GMS primitives, behavioral FSM contracts, governance bundle signing, runtime gates, and drift monitoring. Same wheel; same install — pip install knowlytix-harness gives you both the black-box testing and the governed-runtime surface.
Public API
The wheel ships two subpackages — black-box testing (knowlytix.harness.testing) and runtime governance (knowlytix.harness.governance):
# Black-box DOE testing (the headline product)
from knowlytix.harness.testing import (
# Core
DOEGMSBenchmark, DOEHarnessConfig, GMSHSettings,
# Evaluators + judges
LLMEvaluator, AgentEvaluator, make_evaluator, GMSJudge,
# Oracles + taxonomy
HallucinationOracle, SeverityClassifier, CompositeOracle,
# Agentic testing
ToolGateway, PolicyEngine, CampaignManager,
# …195 symbols total in __all__
)
# Runtime governance — the governed harness
from knowlytix.harness.governance import (
# Triple-gate tool gateway: schema validation + policy + GMS plausibility
GovernedToolGateway,
# Typed claim verification routed to GMS primitives
ClaimRouter, TypedClaim,
# Behavioral FSM contracts (advisory / recommendation / action-taking)
BehavioralContract,
# End-to-end orchestrator + lifecycle state machine
GovernedOrchestrator,
# Runtime gates + drift monitoring + bundle signing
RuntimeGate, DriftMonitor, GovernanceBundle,
)
See the top of harness/testing/__init__.py and harness/governance/__init__.py for the full declarations or USER_GUIDE.md for task-oriented navigation.
Related packages
| Package | Role |
|---|---|
knowlytix-core |
Geometric memory engine |
knowlytix-knowledge |
Document ingest + query front-end |
knowlytix-benchmark |
Structured-retrieval benchmark |
Links
- Source: knowlytix/gms
- Book: Geometric Memory Systems (forthcoming)
- Papers: DOE-GMS Benchmark, GMSH Black-Box Agentic Testing
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file knowlytix_harness-0.0.2-py3-none-any.whl.
File metadata
- Download URL: knowlytix_harness-0.0.2-py3-none-any.whl
- Upload date:
- Size: 6.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd96d09c82c56c521c2dcb258aa0a1468c56ee097736153a8f3307ac2da20f81
|
|
| MD5 |
b08db2750249920928f39f96dc9c8e96
|
|
| BLAKE2b-256 |
4f552efd0960252ab0f613c82e195da065d16f85dbb16a924d4cedbdbcbffbd8
|
Provenance
The following attestation bundles were made for knowlytix_harness-0.0.2-py3-none-any.whl:
Publisher:
publish-pypi.yml on knowlytix/GMS
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
knowlytix_harness-0.0.2-py3-none-any.whl -
Subject digest:
fd96d09c82c56c521c2dcb258aa0a1468c56ee097736153a8f3307ac2da20f81 - Sigstore transparency entry: 1565585344
- Sigstore integration time:
-
Permalink:
knowlytix/GMS@d3dc0ca80da49e06700ca6b3737ea1729cf06c3a -
Branch / Tag:
refs/heads/pypi-stub-0.0.1-v2 - Owner: https://github.com/knowlytix
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@d3dc0ca80da49e06700ca6b3737ea1729cf06c3a -
Trigger Event:
workflow_dispatch
-
Statement type: