Phionyx evaluation tooling — LLM-as-judge primitive (eval-side) plus the assessment-signal vocabulary from the Phionyx Evaluation Standard v0.2.0. Producers run a judge against a (claim, evidence, rubric) triple and emit a signed Judgment envelope. Caller supplies the LLM client; no hard dependency on any provider SDK.
Project description
phionyx-eval
LLM-as-judge primitive (eval-side) for Phionyx runtime-evidence chains. Score a (claim, evidence) pair under a rubric; produce a signed Judgment envelope; verify the chain end-to-end. The caller supplies the LLM client — there is no hard dependency on any provider SDK.
Status
v0.1.0a1 — alpha. Phionyx v0.6.0 W2 deliverable. Ships in the Viryel monorepo at tools/phionyx_eval/; promoted to the public halvrenofviryel/phionyx-eval repo at v0.6.0 release.
What this package is
A small eval-side toolkit:
LLMClient— Protocol surface (complete(prompt: str) -> str). Plug in Anthropic SDK, OpenAI SDK, LiteLLM, an HTTP wrapper, or a mock.Rubric— Pydantic model for a scoring rubric: criteria, integer scale, normalised pass threshold. Four canonical Phionyx rubrics ship by default.LLMAsJudge— judges one (claim, evidence) pair under a rubric. Produces aJudgmentwith per-criterion scores, an aggregate normalised score, a deterministic verdict (pass / fail / uncertain), and the model's overall rationale.build_judgment_envelope— wraps aJudgmentin a signed, hash-chained envelope. Mirrors the audit-chain pattern used byphionyx-langchain-langgraphandphionyx-mcp-server.
What this package is NOT
- NOT a runtime cognitive component. LLM-as-judge is a measurement tool. It does not enter the Phionyx mind-loop, does not update memory, does not affect determinism in
phionyx-core. Per the AGI invariants, this is infrastructure, not cognitive progress. - NOT a benchmark runner. It scores one (claim, evidence) pair at a time. Batch evaluation, score aggregation across many calls, and dashboarding are out of scope for v0.1.
- NOT a compliance certifier. Phionyx publishes mappings; it does not issue compliance guarantees. A passing judgment is passed structural rubric evaluation, not approved for production.
Install
pip install phionyx-eval
Requires Python ≥3.10 and phionyx-core >= 0.5.0.
60-second usage
from phionyx_eval import (
EVIDENCE_COVERAGE_RUBRIC,
LLMAsJudge,
build_judgment_envelope,
GENESIS_HASH,
__version__,
)
class MyClient:
"""Your existing LLM client — anything with .complete(prompt) -> str."""
def complete(self, prompt: str) -> str:
return your_llm.invoke(prompt) # replace with your call
judge = LLMAsJudge(MyClient())
verdict = judge.judge(
claim="Fixed the off-by-one in paginate() for the empty-input case",
evidence="pytest tests/unit/test_paginate.py -k off_by_one — 1/1 pass",
rubric=EVIDENCE_COVERAGE_RUBRIC,
)
print(verdict.verdict, verdict.aggregate_score)
# Wrap the judgment in a signed envelope for the audit chain:
envelope = build_judgment_envelope(
judgment=verdict,
package_version=__version__,
previous_hash=GENESIS_HASH, # or the previous envelope's integrity.current
turn_index=0,
)
Standard rubrics
| Rubric | Pass threshold | Criteria |
|---|---|---|
EVIDENCE_COVERAGE_RUBRIC |
0.7 | evidence_addresses_claim_scope, evidence_exercises_claimed_paths, evidence_independent_of_claim_text |
CORRECTNESS_RUBRIC |
0.7 | claim_consistent_with_evidence, no_internal_contradictions, scope_appropriately_qualified |
COMPLETENESS_RUBRIC |
0.6 | claim_addresses_full_user_scope, omissions_explicitly_acknowledged, edge_cases_considered |
INDEPENDENT_VERIFIABILITY_RUBRIC |
0.7 | evidence_contains_reproduction_steps, evidence_names_specific_paths_or_commands, evidence_independent_of_agent_narration |
All four use a 0–5 integer scale per criterion. Caller-authored rubrics work the same way; pass a Rubric instance to judge.judge(...).
Verdict derivation
Verdicts are deterministic, not LLM-emitted:
- Average the per-criterion integer scores.
- Normalise into [0, 1] against
(scale_max - scale_min). - If
aggregate >= pass_threshold→pass. - Else if
aggregate >= pass_threshold - 0.05→uncertain(near-miss band). - Else →
fail.
The LLM does not vote on its own pass/fail.
Composing with the Phionyx audit chain
The JudgmentEnvelope follows the same hash-chained pattern Phionyx uses for AgentMessageEnvelope and the subagent_chain block. A producer accumulating many judgments builds a single linear chain by passing the prior envelope's integrity.current as the next call's previous_hash. Tampering any envelope's payload (claim text, rubric name, score, rationale) breaks envelope_hash recomputation.
Cross-runtime importers (F13 v0.6.0 W3)
Import Langfuse traces and LangSmith runs into Phionyx envelope chains. Round-trip lossless for the mappable fields named below; non-mappable foreign fields are preserved verbatim under subject.metadata.imported_extras so a future Phionyx-side exporter could reconstruct the foreign shape.
Langfuse
from phionyx_eval import import_langfuse_trace
result = import_langfuse_trace(langfuse_trace_dict)
# result.envelopes[0] → trace_root envelope
# result.envelopes[1:] → one envelope per observation, in original order
# result.mapping_report → MappingReport (mapped_fields, preserved_extras, dropped_fields)
Mappable Langfuse fields:
| Foreign | Phionyx |
|---|---|
id |
subject.foreign_trace_id |
name, userId, sessionId, release, version, input, output, metadata, tags, public, createdAt, updatedAt |
record.<snake_case> |
Observation id |
record.observation_id |
Observation type |
subject.event_type |
Observation name, startTime, endTime, input, output, level, statusMessage, model, modelParameters, usage, parentObservationId |
record.<snake_case> |
Schema: phionyx.imported_langfuse_envelope.v1.
LangSmith
from phionyx_eval import import_langsmith_run
result = import_langsmith_run(
root_run_dict,
descendants=descendant_run_dicts, # optional; resolved via child_run_ids
)
# result.envelopes is in depth-first pre-order traversal of the run tree.
Mappable LangSmith fields per run:
| Foreign | Phionyx |
|---|---|
id |
subject.foreign_trace_id |
run_type |
subject.event_type |
name, inputs, outputs, start_time, end_time, error, extra, parent_run_id, child_run_ids, events, feedback |
record.<snake_case> |
Schema: phionyx.imported_langsmith_envelope.v1. Tree shape preserved in record.parent_run_id / record.child_run_ids so a downstream consumer can reconstruct the tree.
Composition with the judge
The output of either importer is a list of Phionyx envelopes. The LLMAsJudge can then run over any envelope's record payload to score a specific claim (e.g. the drafting step's output addresses the input) under an evidence-coverage rubric — turning a third-party trace into a Phionyx-evaluable evidence record without re-running the original system.
Composing with the Phionyx Evaluation Standard
The four standard rubrics implement Phionyx's cross-domain evidence baseline. They are not the same as the assessment_signal taxonomy in the Phionyx Evaluation Standard v0.2.0 — the standard names which signal a coverage claim is interpreted against; this package names how the judge grades evidence quality on the runtime-evidence dimension. The two compose: a Compliance-Mapping row whose assessment_signal is governance_envelope.integrity.canonical_json_hash_chain can use the EVIDENCE_COVERAGE_RUBRIC to grade whether a specific claim is supported by that signal.
License
AGPL-3.0-or-later, consistent with the rest of the Phionyx open-source distribution.
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phionyx_eval-0.1.0a1.tar.gz.
File metadata
- Download URL: phionyx_eval-0.1.0a1.tar.gz
- Upload date:
- Size: 36.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8bd8de38fc9cbb3eacedfc6f7ebee749c552b18708112b65df3b5bc345cb4183
|
|
| MD5 |
d31dd0d6e78468324dab365649ba566a
|
|
| BLAKE2b-256 |
12ed5aa8b38aae84600842297104a52d09ddf7c2a234fded7c7058aa1002f643
|
Provenance
The following attestation bundles were made for phionyx_eval-0.1.0a1.tar.gz:
Publisher:
release.yml on halvrenofviryel/phionyx-eval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phionyx_eval-0.1.0a1.tar.gz -
Subject digest:
8bd8de38fc9cbb3eacedfc6f7ebee749c552b18708112b65df3b5bc345cb4183 - Sigstore transparency entry: 1629602068
- Sigstore integration time:
-
Permalink:
halvrenofviryel/phionyx-eval@47b0c9b224d5e38fd1ac86b7805be1d945f0d72b -
Branch / Tag:
refs/tags/v0.1.0a1 - Owner: https://github.com/halvrenofviryel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@47b0c9b224d5e38fd1ac86b7805be1d945f0d72b -
Trigger Event:
push
-
Statement type:
File details
Details for the file phionyx_eval-0.1.0a1-py3-none-any.whl.
File metadata
- Download URL: phionyx_eval-0.1.0a1-py3-none-any.whl
- Upload date:
- Size: 32.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b42c285be0c156548da00b5965daab66422beaa515f593c06a0c33cb967c5b36
|
|
| MD5 |
b1940e788edef67ca1659935665dd63a
|
|
| BLAKE2b-256 |
b54c122f679f2bfd5977d4cc795047fd8851f700243dfa4a7a3829005f14d39d
|
Provenance
The following attestation bundles were made for phionyx_eval-0.1.0a1-py3-none-any.whl:
Publisher:
release.yml on halvrenofviryel/phionyx-eval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phionyx_eval-0.1.0a1-py3-none-any.whl -
Subject digest:
b42c285be0c156548da00b5965daab66422beaa515f593c06a0c33cb967c5b36 - Sigstore transparency entry: 1629602074
- Sigstore integration time:
-
Permalink:
halvrenofviryel/phionyx-eval@47b0c9b224d5e38fd1ac86b7805be1d945f0d72b -
Branch / Tag:
refs/tags/v0.1.0a1 - Owner: https://github.com/halvrenofviryel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@47b0c9b224d5e38fd1ac86b7805be1d945f0d72b -
Trigger Event:
push
-
Statement type: