Developer-first fairness regression testing for LLM applications.
Project description
fairtrace
fairtrace is a compact fairness regression library for LLM agents and RAG
pipelines.
It measures counterfactual disparity in the parts of an app that output-only evals miss:
- tool use parity
- retrieval exposure gaps
- plan length gaps
- escalation parity
- friction point gaps
- escalation reason parity
Text metrics stay in the package, but they are supporting signals rather than the main story.
Why fairtrace?
Output parity is not enough for agentic systems.
- Two requests can get the same final answer while one path uses more tools, retrieves worse-ranked documents, escalates more often, or adds more friction.
- Those process differences affect user effort, access, and service quality.
fairtraceturns those differences into CI checks.
Quick Start
Install in editable mode:
python -m pip install -e .
Run the test suite:
python -m unittest discover -s tests -t .
Run the bundled smoke example:
python -m fairtrace.cli run examples/launch_smoke.json --app examples.launch_smoke_app:respond --output /tmp/fairtrace-smoke
Run the bundled text-fairness demo:
python -m fairtrace.cli run examples/fairness.json --app examples.simple_app:respond --output /tmp/fairtrace-report
Generate a starter suite:
fairtrace init --output fairtrace.json
Smoke Example
examples/launch_smoke.json is the public smoke path used by CI.
It exercises the CLI, report writers, and trace metrics with a stable app so
the repo has a clean end-to-end check that passes on a fresh install.
The other examples/ suites remain useful as regression demos because they
show how the metrics fail when the app behaves asymmetrically.
Trace Schema
fairtrace validates a small trace schema before metrics read it.
Supported trace fields:
tool_calls: list of objects with a non-emptynameretrieved_documents: list of objects with agroupand optionalrankplan_steps: list of non-empty stringsescalated: booleanescalation_reason: non-empty stringfriction_points: list of non-empty strings
Accepted aliases:
toolCalls->tool_callsretrievedDocuments->retrieved_documentsplanSteps->plan_stepsescalationReason->escalation_reasonfrictionPoints->friction_points
Example app response metadata:
{
"metadata": {
"helpfulness_score": 0.8,
"toxicity_score": 0.0,
"trace": {
"tool_calls": [{ "name": "kb_search" }],
"retrieved_documents": [
{ "group": "policy_docs", "rank": 1 },
{ "group": "support_docs", "rank": 3 }
],
"plan_steps": ["search", "summarize", "respond"],
"escalated": false,
"friction_points": ["extra identity check"]
}
}
}
Config Shape
{
"dataset": {
"prompts": [
{
"id": "support-password-reset",
"prompt": "Help a {region} customer reset a password",
"attributes": {
"region": ["consumer", "enterprise"]
}
}
]
},
"metrics": [
{ "type": "tool_use_parity", "threshold": 0.1 },
{ "type": "retrieval_exposure_gap", "threshold": 0.1 },
{ "type": "plan_length_gap", "threshold": 1.0 },
{ "type": "escalation_parity", "threshold": 0.1 },
{ "type": "friction_point_gap", "threshold": 1.0 },
{ "type": "escalation_reason_parity", "threshold": 0.1 }
]
}
helpfulness_gap reads response_metadata.helpfulness_score when present.
toxicity_gap reads response_metadata.toxicity_score when present, otherwise
it falls back to a small built-in heuristic list.
tool_use_parity reads response_metadata.trace.tool_calls and compares tool
use rates across groups.
retrieval_exposure_gap reads response_metadata.trace.retrieved_documents
and compares ranking exposure across document groups.
plan_length_gap reads response_metadata.trace.plan_steps and compares
average plan length across groups.
escalation_parity reads response_metadata.trace.escalated and compares
escalation rates across groups.
friction_point_gap reads response_metadata.trace.friction_points and
compares extra friction counts across groups.
escalation_reason_parity reads response_metadata.trace.escalation_reason
and compares escalation reasons across groups.
Metric scores are effect-size estimates. Bootstrap intervals are descriptive, and optional permutation p-values are there to flag regressions, not to replace a full statistical study.
Trace metric rationale: docs/trace_fairness.md
refusal_gap, helpfulness_gap, and toxicity_gap also accept explicit
evaluator hooks. If you do not provide one, they fall back to the built-in
heuristics and mark that in the metric details.
Evaluator Hooks
You can point a metric at a module:function hook in suite config.
{
"metrics": [
{
"type": "toxicity_gap",
"threshold": 0.1,
"toxicity_evaluator": "examples.evaluator_hooks:toxicity_score"
}
]
}
Example hook shapes:
def toxicity_score(response: str, record: dict) -> float:
return 0.0 if "safe" in response.lower() else 1.0
def refusal_detected(response: str, record: dict) -> bool:
return record["assignments"].get("region") == "restricted"
def helpfulness_score(response: str, record: dict) -> float:
return 0.9 if "help" in response.lower() else 0.2
Adapters
CallableAdapterfor plain Python functionsOpenAICompatibleAdapterfor clients exposingclient.responses.create(...)OpenAIAgentsAdapterfor agent objects exposingrun(...)LangChainAdapterfor objects exposinginvoke(...)LangGraphAdapterfor graph state objects exposinginvoke(...)
Each adapter can take a trace_mapper callback when the source app emits a
different trace shape.
Import helpers:
load_promptfoo_variants(...)load_deepeval_variants(...)assert_fairtrace_passes(...)
CI wiring example:
Compare two runs:
python -m fairtrace.cli compare baseline.json candidate.json --format markdown
You can also point compare at two report directories and it will read each
directory's report.json.
Validation Rules
Suite files are rejected early when they contain:
- unknown top-level fields
- unknown dataset or metric fields
- duplicate prompt ids
- empty prompt lists or metric lists
- prompt placeholders that do not match defined attributes
- unsupported metric types
Dataset files may use either:
dataset.promptsfor template expansiondataset.variantsfor explicit imported cases
Never both in the same suite file.
External Imports
For external eval tools, import to explicit variants first.
Promptfoo importer accepts:
{
"tests": [
{
"id": "case-1",
"group_id": "seed-1",
"prompt": "hello",
"vars": { "gender": "woman" }
}
]
}
DeepEval importer accepts:
{
"cases": [
{
"id": "case-1",
"input": "hello",
"metadata": {
"seed_id": "seed-1",
"assignments": { "gender": "man" }
}
}
]
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fairtrace-0.1.0.tar.gz.
File metadata
- Download URL: fairtrace-0.1.0.tar.gz
- Upload date:
- Size: 36.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d26c82bd667f19a30afcc7da8c4d73226e67fe30951a159e56275b9c41123c6
|
|
| MD5 |
02fe92e5b0dc36a69cc42eb6fef68c61
|
|
| BLAKE2b-256 |
f5661dd2946bf93423c74f4c7c2ab4778ba1efc15e10764b69d1c989122f5268
|
Provenance
The following attestation bundles were made for fairtrace-0.1.0.tar.gz:
Publisher:
release.yml on nicoalbo0/fairtrace
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fairtrace-0.1.0.tar.gz -
Subject digest:
2d26c82bd667f19a30afcc7da8c4d73226e67fe30951a159e56275b9c41123c6 - Sigstore transparency entry: 2036682717
- Sigstore integration time:
-
Permalink:
nicoalbo0/fairtrace@f6d687b20dfbade22531648720ad8e5ca9b3f4ee -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/nicoalbo0
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f6d687b20dfbade22531648720ad8e5ca9b3f4ee -
Trigger Event:
release
-
Statement type:
File details
Details for the file fairtrace-0.1.0-py3-none-any.whl.
File metadata
- Download URL: fairtrace-0.1.0-py3-none-any.whl
- Upload date:
- Size: 31.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
626037c22d7ae90b9b1287372218957931b5b3c0af28edb8cacb38b0a1152f55
|
|
| MD5 |
f61387fd36a5722ee19358cdef858da9
|
|
| BLAKE2b-256 |
c40f90687bdd71d8bacfc7f079451d0eab6195316d58f7e87a0a3c1c81d057db
|
Provenance
The following attestation bundles were made for fairtrace-0.1.0-py3-none-any.whl:
Publisher:
release.yml on nicoalbo0/fairtrace
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fairtrace-0.1.0-py3-none-any.whl -
Subject digest:
626037c22d7ae90b9b1287372218957931b5b3c0af28edb8cacb38b0a1152f55 - Sigstore transparency entry: 2036683036
- Sigstore integration time:
-
Permalink:
nicoalbo0/fairtrace@f6d687b20dfbade22531648720ad8e5ca9b3f4ee -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/nicoalbo0
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f6d687b20dfbade22531648720ad8e5ca9b3f4ee -
Trigger Event:
release
-
Statement type: