Skip to main content

Behavioral auditing toolkit for LLMs — audit any model across 8 dimensions (factual, toxicity, bias, sycophancy, reasoning, refusal, deception, over-refusal) using teacher-forced confidence probes.

Project description

rho-eval v2.3.0: Behavioral Auditing for LLMs

PyPI Paper: SFT Paper: Grassmann Paper: Phase Transitions Paper: Confidence Cartography Paper: CF90 Tests Demo License: MIT MLX Sponsor

Measure where language models get surprising truths wrong — then fix it.

rho-eval measures 8 behavioral dimensions — factual accuracy, toxicity, bias, sycophancy, reasoning, refusal, deception, and over-refusal — using Spearman rank correlation over teacher-forced confidence probes. It ships 1,826 probes as JSON with no internet required.

Formerly knowledge-fidelity. All v1.x imports still work.

What It Does

Module Purpose
rho-audit Behavioral auditing across 8 dimensions via confidence probes
rho-interpret SVD subspace extraction and Grassmann angle analysis
rho-align Rho-Guided SFT with contrastive auxiliary loss
rho-steer SAE-based disentangled behavioral steering
rho-bench Fidelity-Bench 2.0: adversarial pressure testing
rho-surgery End-to-end behavioral repair: diagnose, compress, LoRA SFT, verify
rho-benchmark Full benchmarking (8-dim audit + TruthfulQA MC2) with comparison

Install

pip install rho-eval                    # Core (auditing + SVD + probes)
pip install "rho-eval[full]"            # Everything including MLX

Or from source:

git clone https://github.com/SolomonB14D3/knowledge-fidelity
cd knowledge-fidelity
pip install -e ".[full]"

Quick Start

Python API

import rho_eval

# Audit any model across all 8 behaviors
report = rho_eval.audit("Qwen/Qwen2.5-7B-Instruct")
print(report)
# <AuditReport model='Qwen/Qwen2.5-7B-Instruct' behaviors=8 status=WARN>

# Compare two models
baseline = rho_eval.audit("Qwen/Qwen2.5-7B-Instruct")
repaired = rho_eval.audit("./repaired-7b/model/")
delta = rho_eval.compare(repaired, baseline)
print(delta.to_table())  # Colored delta table

CLI

# Full behavioral report card
rho-eval Qwen/Qwen2.5-7B-Instruct --behaviors all

# Specific behaviors, JSON output
rho-eval my-model/ --behaviors factual,bias,sycophancy --format json

# Compare against a baseline
rho-eval compressed-model/ --compare baseline.json

# Adversarial pressure test
rho-bench Qwen/Qwen2.5-7B-Instruct

# One-command behavioral repair
rho-surgery Qwen/Qwen2.5-7B-Instruct -o ./repaired-7b/

# Benchmark before vs after (8-dim audit + TruthfulQA MC2)
rho-benchmark ./repaired-7b/model/ --baseline Qwen/Qwen2.5-7B-Instruct

Why This Exists

Language models fail where truth is surprising. Sycophancy picks the expected answer over the true one. Bias picks the stereotype over the individual. Standard SFT makes this worse in ways benchmarks don't catch.

rho-eval measures exactly where a model gets surprising truths wrong, and rho-guided SFT repairs it — without the alignment tax. See our papers for the full experimental story.

Built-In Probes (1,826 total)

All probes ship as JSON. No internet download needed.

Behavior Probe Sets Count
Factual default, mandela, medical, commonsense, truthfulqa, expanded 206
Bias BBQ 300 + bridge probes + biology-grounded 357
Sycophancy Anthropic model-written-evals 150
Toxicity ToxiGen balanced 200
Reasoning GSM8K + adversarial flattery 100
Deception HH-RLHF honest/deceptive pairs 100
Refusal harmful/benign pairs + expanded 150
Over-refusal benign-but-edgy + expanded 150
Bench (Fidelity-Bench) logic, social, clinical 120

Run rho-eval --list-probes to see all available sets.

Custom Behaviors (Plugin System)

from rho_eval.behaviors import ABCBehavior, register

@register
class MyDomainBehavior(ABCBehavior):
    name = "my_domain"
    description = "Domain-specific evaluation"
    probe_type = "confidence"
    default_n = 50

    def load_probes(self, n=None, seed=42, **kwargs):
        return self._load_json_probes("my_domain/probes.json", n=n, seed=seed)

    def evaluate(self, model, tokenizer, probes, device="cpu", **kwargs):
        # Your evaluation logic
        return BehaviorResult(behavior=self.name, rho=0.7, ...)

# Now available everywhere:
report = rho_eval.audit("my-model", behaviors=["factual", "my_domain"])

Model Compatibility

Works on any HuggingFace causal LM with standard attention layouts.

Validated: Qwen2.5 (0.5B-32B), Mistral 7B, Llama 3.1 8B, GPT-2 (7M-210M scale ladder)

Apple Silicon (MLX)

rho-eval auto-dispatches to MLX on Apple Silicon. No code changes needed.

pip install mlx mlx-lm  # or: pip install "rho-eval[full]"
import mlx_lm
from rho_eval import audit

model, tokenizer = mlx_lm.load("mlx-community/Qwen2.5-7B-Instruct-4bit")
report = audit(model=model, tokenizer=tokenizer, behaviors="all")
# Same API — ~5-10x faster on Apple Silicon
Component MLX Speedup
audit() — 8-behavior probe suite ~5x
mlx_rho_guided_sft() — alignment training ~10x
analyze_confidence() — cartography ~5x

Compression Safety Guide

Layer Type Safe to Compress Notes
Q, K, O projections Yes (70% rank) Main target
V projection 90-95% only High risk below 90%
MLP layers Never Destroys model at any level

Limitations

  • Probe scale: 1,826 probes across 37 sets. Spearman correlation is robust to small samples, but statistical power for subtle shifts is limited.
  • Western-centric: Probes cover primarily English-language, U.S.-centric social categories.
  • 7B scale. Merge and steering results validated on 7B models. Larger scales (70B+) should not be extrapolated without verification.
  • Toxicity is unaffected by weight edits — it relies on highly distributed lexical features that structural interventions cannot modulate.

Key Findings

  • Broad truth fixes propagate to narrow ones. Repairing sycophancy via rho-guided SFT spontaneously improves bias across multiple demographic categories, contradicting the prevailing "alignment tax" assumption.
  • Behavioral capabilities emerge through sharp phase transitions. Training small language models from scratch reveals that behaviors like over-refusal appear in discrete jumps, not gradual improvement.
  • Geometry precedes emergence. Effective dimensionality expansion in weight subspaces predicts behavioral phase transitions by hundreds of training steps — the geometry reorganizes before the behavior appears.
  • Surgery concentrates, not rotates. Grassmann angle analysis of rho-guided SFT shows behavioral subspaces sharpen (effective dimension compresses) rather than rotating to new orientations.
  • Compression preserves behavioral structure when protecting the right singular values. SVD at 70% rank on Q/K/O projections retains behavioral fidelity; V and MLP layers are fragile.

Full experimental details, tables, and statistical analysis are in the papers below.

Papers

  1. Rho-Guided Supervised Fine-Tuning: Post-Training Repair of Calibration Damage in Large Language ModelsDOI: 10.5281/zenodo.18854944
  2. Behavioral Entanglement in Transformers: Grassmann Geometry of Rho-Guided SFTDOI: 10.5281/zenodo.18865862
  3. Behavioral Phase Transitions in Small Language Models: Geometric Scaffolding Precedes Behavioral EmergenceDOI: 10.5281/zenodo.18865199
  4. Confidence Cartography: Teacher-Forced Probability as a False-Belief Sensor in Language ModelsDOI: 10.5281/zenodo.18703506 | Repo
  5. CF90: Knowledge-Preserving SVD Compression for Large Language ModelsDOI: 10.5281/zenodo.18718545 | Repo

Citation

@article{sanchez2026rhoguided,
  author = {Sanchez, Bryan},
  title = {Rho-Guided Supervised Fine-Tuning: Post-Training Repair of
           Calibration Damage in Large Language Models},
  year = {2026},
  doi = {10.5281/zenodo.18854944},
  url = {https://doi.org/10.5281/zenodo.18854944}
}

@article{sanchez2026grassmann,
  author = {Sanchez, Bryan},
  title = {Behavioral Entanglement in Transformers: Grassmann Geometry
           of Rho-Guided Supervised Fine-Tuning},
  year = {2026},
  doi = {10.5281/zenodo.18865862},
  url = {https://doi.org/10.5281/zenodo.18865862}
}

@article{sanchez2026phasetransitions,
  author = {Sanchez, Bryan},
  title = {Behavioral Phase Transitions in Small Language Models:
           Geometric Scaffolding Precedes Behavioral Emergence},
  year = {2026},
  doi = {10.5281/zenodo.18865199},
  url = {https://doi.org/10.5281/zenodo.18865199}
}

@article{sanchez2026cartography,
  author = {Sanchez, Bryan},
  title = {Confidence Cartography: Teacher-Forced Probability as a
           False-Belief Sensor in Language Models},
  year = {2026},
  doi = {10.5281/zenodo.18703506},
  url = {https://doi.org/10.5281/zenodo.18703506}
}

@software{sanchez2026cf90,
  author = {Sanchez, Bryan},
  title = {CF90: Knowledge-Preserving SVD Compression for Large
           Language Models},
  year = {2026},
  doi = {10.5281/zenodo.18718545},
  url = {https://doi.org/10.5281/zenodo.18718545}
}

@software{sanchez2026rhoeval,
  author = {Sanchez, Bryan},
  title = {rho-eval: Behavioral Auditing for Large Language Models},
  year = {2026},
  doi = {10.5281/zenodo.18743959},
  url = {https://doi.org/10.5281/zenodo.18743959}
}

Contributing

PRs welcome for new probes, model support, or bug fixes. See open issues.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rho_eval-2.3.0.tar.gz (583.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rho_eval-2.3.0-py3-none-any.whl (633.5 kB view details)

Uploaded Python 3

File details

Details for the file rho_eval-2.3.0.tar.gz.

File metadata

  • Download URL: rho_eval-2.3.0.tar.gz
  • Upload date:
  • Size: 583.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for rho_eval-2.3.0.tar.gz
Algorithm Hash digest
SHA256 36f14bc32e46b0fd850506aecd2c7187f0147c6daba52809ab4549b4c2b80e9d
MD5 31916ff2e997e79facae721126ff869b
BLAKE2b-256 c4c67a422831d9fbb017fcc7c98161385b54cb0fd23dd26b33ca3fffb7cb2441

See more details on using hashes here.

File details

Details for the file rho_eval-2.3.0-py3-none-any.whl.

File metadata

  • Download URL: rho_eval-2.3.0-py3-none-any.whl
  • Upload date:
  • Size: 633.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for rho_eval-2.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2bf8eaf2651299039075fe3abd7fc68e596baddd89942c5a70a06df4439f7b4
MD5 da440cf344543c1423300126c2912def
BLAKE2b-256 be939d5eb87ae58ac73693a43533641b8e95a9ff418e01a40a47a2f96794e694

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page