Phionyx evaluation tooling — LLM-as-judge primitive (eval-side) plus the assessment-signal vocabulary from the Phionyx Evaluation Standard v0.2.0. Producers run a judge against a (claim, evidence, rubric) triple and emit a signed Judgment envelope. Caller supplies the LLM client; no hard dependency on any provider SDK.

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

phionyx

These details have not been verified by PyPI

Project links

Project description

phionyx-eval

LLM-as-judge primitive (eval-side) for Phionyx runtime-evidence chains. Score a (claim, evidence) pair under a rubric; produce a signed Judgment envelope; verify the chain end-to-end. The caller supplies the LLM client — there is no hard dependency on any provider SDK.

Status

v0.1.0a1 — alpha. Phionyx v0.6.0 W2 deliverable. Ships in the Viryel monorepo at tools/phionyx_eval/; promoted to the public halvrenofviryel/phionyx-eval repo at v0.6.0 release.

What this package is

A small eval-side toolkit:

LLMClient — Protocol surface (complete(prompt: str) -> str). Plug in Anthropic SDK, OpenAI SDK, LiteLLM, an HTTP wrapper, or a mock.
Rubric — Pydantic model for a scoring rubric: criteria, integer scale, normalised pass threshold. Four canonical Phionyx rubrics ship by default.
LLMAsJudge — judges one (claim, evidence) pair under a rubric. Produces a Judgment with per-criterion scores, an aggregate normalised score, a deterministic verdict (pass / fail / uncertain), and the model's overall rationale.
build_judgment_envelope — wraps a Judgment in a signed, hash-chained envelope. Mirrors the audit-chain pattern used by phionyx-langchain-langgraph and phionyx-mcp-server.

What this package is NOT

NOT a runtime cognitive component. LLM-as-judge is a measurement tool. It does not enter the Phionyx mind-loop, does not update memory, does not affect determinism in phionyx-core. Per the AGI invariants, this is infrastructure, not cognitive progress.
NOT a benchmark runner. It scores one (claim, evidence) pair at a time. Batch evaluation, score aggregation across many calls, and dashboarding are out of scope for v0.1.
NOT a compliance certifier. Phionyx publishes mappings; it does not issue compliance guarantees. A passing judgment is passed structural rubric evaluation, not approved for production.

Install

pip install phionyx-eval

Requires Python ≥3.10 and phionyx-core >= 0.5.0.

60-second usage

from phionyx_eval import (
    EVIDENCE_COVERAGE_RUBRIC,
    LLMAsJudge,
    build_judgment_envelope,
    GENESIS_HASH,
    __version__,
)

class MyClient:
    """Your existing LLM client — anything with .complete(prompt) -> str."""
    def complete(self, prompt: str) -> str:
        return your_llm.invoke(prompt)  # replace with your call

judge = LLMAsJudge(MyClient())
verdict = judge.judge(
    claim="Fixed the off-by-one in paginate() for the empty-input case",
    evidence="pytest tests/unit/test_paginate.py -k off_by_one — 1/1 pass",
    rubric=EVIDENCE_COVERAGE_RUBRIC,
)
print(verdict.verdict, verdict.aggregate_score)

# Wrap the judgment in a signed envelope for the audit chain:
envelope = build_judgment_envelope(
    judgment=verdict,
    package_version=__version__,
    previous_hash=GENESIS_HASH,  # or the previous envelope's integrity.current
    turn_index=0,
)

Standard rubrics

Rubric	Pass threshold	Criteria
`EVIDENCE_COVERAGE_RUBRIC`	0.7	`evidence_addresses_claim_scope`, `evidence_exercises_claimed_paths`, `evidence_independent_of_claim_text`
`CORRECTNESS_RUBRIC`	0.7	`claim_consistent_with_evidence`, `no_internal_contradictions`, `scope_appropriately_qualified`
`COMPLETENESS_RUBRIC`	0.6	`claim_addresses_full_user_scope`, `omissions_explicitly_acknowledged`, `edge_cases_considered`
`INDEPENDENT_VERIFIABILITY_RUBRIC`	0.7	`evidence_contains_reproduction_steps`, `evidence_names_specific_paths_or_commands`, `evidence_independent_of_agent_narration`

All four use a 0–5 integer scale per criterion. Caller-authored rubrics work the same way; pass a Rubric instance to judge.judge(...).

Verdict derivation

Verdicts are deterministic, not LLM-emitted:

Average the per-criterion integer scores.
Normalise into [0, 1] against (scale_max - scale_min).
If aggregate >= pass_threshold → pass.
Else if aggregate >= pass_threshold - 0.05 → uncertain (near-miss band).
Else → fail.

The LLM does not vote on its own pass/fail.

Composing with the Phionyx audit chain

The JudgmentEnvelope follows the same hash-chained pattern Phionyx uses for AgentMessageEnvelope and the subagent_chain block. A producer accumulating many judgments builds a single linear chain by passing the prior envelope's integrity.current as the next call's previous_hash. Tampering any envelope's payload (claim text, rubric name, score, rationale) breaks envelope_hash recomputation.

Cross-runtime importers (F13 v0.6.0 W3)

Import Langfuse traces and LangSmith runs into Phionyx envelope chains. Round-trip lossless for the mappable fields named below; non-mappable foreign fields are preserved verbatim under subject.metadata.imported_extras so a future Phionyx-side exporter could reconstruct the foreign shape.

Langfuse

from phionyx_eval import import_langfuse_trace

result = import_langfuse_trace(langfuse_trace_dict)
# result.envelopes[0]   → trace_root envelope
# result.envelopes[1:]  → one envelope per observation, in original order
# result.mapping_report → MappingReport (mapped_fields, preserved_extras, dropped_fields)

Mappable Langfuse fields:

Foreign	Phionyx
`id`	`subject.foreign_trace_id`
`name`, `userId`, `sessionId`, `release`, `version`, `input`, `output`, `metadata`, `tags`, `public`, `createdAt`, `updatedAt`	`record.<snake_case>`
Observation `id`	`record.observation_id`
Observation `type`	`subject.event_type`
Observation `name`, `startTime`, `endTime`, `input`, `output`, `level`, `statusMessage`, `model`, `modelParameters`, `usage`, `parentObservationId`	`record.<snake_case>`

Schema: phionyx.imported_langfuse_envelope.v1.

LangSmith

from phionyx_eval import import_langsmith_run

result = import_langsmith_run(
    root_run_dict,
    descendants=descendant_run_dicts,  # optional; resolved via child_run_ids
)
# result.envelopes is in depth-first pre-order traversal of the run tree.

Mappable LangSmith fields per run:

Foreign	Phionyx
`id`	`subject.foreign_trace_id`
`run_type`	`subject.event_type`
`name`, `inputs`, `outputs`, `start_time`, `end_time`, `error`, `extra`, `parent_run_id`, `child_run_ids`, `events`, `feedback`	`record.<snake_case>`

Schema: phionyx.imported_langsmith_envelope.v1. Tree shape preserved in record.parent_run_id / record.child_run_ids so a downstream consumer can reconstruct the tree.

Composition with the judge

The output of either importer is a list of Phionyx envelopes. The LLMAsJudge can then run over any envelope's record payload to score a specific claim (e.g. the drafting step's output addresses the input) under an evidence-coverage rubric — turning a third-party trace into a Phionyx-evaluable evidence record without re-running the original system.

Composing with the Phionyx Evaluation Standard

The four standard rubrics implement Phionyx's cross-domain evidence baseline. They are not the same as the assessment_signal taxonomy in the Phionyx Evaluation Standard v0.2.0 — the standard names which signal a coverage claim is interpreted against; this package names how the judge grades evidence quality on the runtime-evidence dimension. The two compose: a Compliance-Mapping row whose assessment_signal is governance_envelope.integrity.canonical_json_hash_chain can use the EVIDENCE_COVERAGE_RUBRIC to grade whether a specific claim is supported by that signal.

License

AGPL-3.0-or-later, consistent with the rest of the Phionyx open-source distribution.

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

phionyx

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0a1 pre-release

May 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phionyx_eval-0.1.0a1.tar.gz (36.7 kB view details)

Uploaded May 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

phionyx_eval-0.1.0a1-py3-none-any.whl (32.4 kB view details)

Uploaded May 25, 2026 Python 3

File details

Details for the file phionyx_eval-0.1.0a1.tar.gz.

File metadata

Download URL: phionyx_eval-0.1.0a1.tar.gz
Upload date: May 25, 2026
Size: 36.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for phionyx_eval-0.1.0a1.tar.gz
Algorithm	Hash digest
SHA256	`8bd8de38fc9cbb3eacedfc6f7ebee749c552b18708112b65df3b5bc345cb4183`
MD5	`d31dd0d6e78468324dab365649ba566a`
BLAKE2b-256	`12ed5aa8b38aae84600842297104a52d09ddf7c2a234fded7c7058aa1002f643`

See more details on using hashes here.

Provenance

The following attestation bundles were made for phionyx_eval-0.1.0a1.tar.gz:

Publisher: release.yml on halvrenofviryel/phionyx-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: phionyx_eval-0.1.0a1.tar.gz
- Subject digest: 8bd8de38fc9cbb3eacedfc6f7ebee749c552b18708112b65df3b5bc345cb4183
- Sigstore transparency entry: 1629602068
- Sigstore integration time: May 25, 2026
Source repository:
- Permalink: halvrenofviryel/phionyx-eval@47b0c9b224d5e38fd1ac86b7805be1d945f0d72b
- Branch / Tag: refs/tags/v0.1.0a1
- Owner: https://github.com/halvrenofviryel
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@47b0c9b224d5e38fd1ac86b7805be1d945f0d72b
- Trigger Event: push

File details

Details for the file phionyx_eval-0.1.0a1-py3-none-any.whl.

File metadata

Download URL: phionyx_eval-0.1.0a1-py3-none-any.whl
Upload date: May 25, 2026
Size: 32.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for phionyx_eval-0.1.0a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b42c285be0c156548da00b5965daab66422beaa515f593c06a0c33cb967c5b36`
MD5	`b1940e788edef67ca1659935665dd63a`
BLAKE2b-256	`b54c122f679f2bfd5977d4cc795047fd8851f700243dfa4a7a3829005f14d39d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for phionyx_eval-0.1.0a1-py3-none-any.whl:

Publisher: release.yml on halvrenofviryel/phionyx-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: phionyx_eval-0.1.0a1-py3-none-any.whl
- Subject digest: b42c285be0c156548da00b5965daab66422beaa515f593c06a0c33cb967c5b36
- Sigstore transparency entry: 1629602074
- Sigstore integration time: May 25, 2026
Source repository:
- Permalink: halvrenofviryel/phionyx-eval@47b0c9b224d5e38fd1ac86b7805be1d945f0d72b
- Branch / Tag: refs/tags/v0.1.0a1
- Owner: https://github.com/halvrenofviryel
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@47b0c9b224d5e38fd1ac86b7805be1d945f0d72b
- Trigger Event: push

phionyx-eval 0.1.0a1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

phionyx-eval

Status

What this package is

What this package is NOT

Install

60-second usage

Standard rubrics

Verdict derivation

Composing with the Phionyx audit chain

Cross-runtime importers (F13 v0.6.0 W3)

Langfuse

LangSmith

Composition with the judge

Composing with the Phionyx Evaluation Standard

License

Links

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance