Developer-first fairness regression testing for LLM applications.

Project description

fairtrace

fairtrace is a compact fairness regression library for LLM agents and RAG pipelines.

It measures counterfactual disparity in the parts of an app that output-only evals miss:

tool use parity
retrieval exposure gaps
plan length gaps
escalation parity
friction point gaps
escalation reason parity

Text metrics stay in the package, but they are supporting signals rather than the main story.

Why fairtrace?

Output parity is not enough for agentic systems.

Two requests can get the same final answer while one path uses more tools, retrieves worse-ranked documents, escalates more often, or adds more friction.
Those process differences affect user effort, access, and service quality.
fairtrace turns those differences into CI checks.

Quick Start

Install in editable mode:

python -m pip install -e .

Run the test suite:

python -m unittest discover -s tests -t .

Run the bundled smoke example:

python -m fairtrace.cli run examples/launch_smoke.json --app examples.launch_smoke_app:respond --output /tmp/fairtrace-smoke

Run the bundled text-fairness demo:

python -m fairtrace.cli run examples/fairness.json --app examples.simple_app:respond --output /tmp/fairtrace-report

Generate a starter suite:

fairtrace init --output fairtrace.json

Smoke Example

examples/launch_smoke.json is the public smoke path used by CI. It exercises the CLI, report writers, and trace metrics with a stable app so the repo has a clean end-to-end check that passes on a fresh install.

The other examples/ suites remain useful as regression demos because they show how the metrics fail when the app behaves asymmetrically.

Trace Schema

fairtrace validates a small trace schema before metrics read it.

Supported trace fields:

tool_calls: list of objects with a non-empty name
retrieved_documents: list of objects with a group and optional rank
plan_steps: list of non-empty strings
escalated: boolean
escalation_reason: non-empty string
friction_points: list of non-empty strings

Accepted aliases:

toolCalls -> tool_calls
retrievedDocuments -> retrieved_documents
planSteps -> plan_steps
escalationReason -> escalation_reason
frictionPoints -> friction_points

Example app response metadata:

{
  "metadata": {
    "helpfulness_score": 0.8,
    "toxicity_score": 0.0,
    "trace": {
      "tool_calls": [{ "name": "kb_search" }],
      "retrieved_documents": [
        { "group": "policy_docs", "rank": 1 },
        { "group": "support_docs", "rank": 3 }
      ],
      "plan_steps": ["search", "summarize", "respond"],
      "escalated": false,
      "friction_points": ["extra identity check"]
    }
  }
}

Config Shape

{
  "dataset": {
    "prompts": [
      {
        "id": "support-password-reset",
        "prompt": "Help a {region} customer reset a password",
        "attributes": {
          "region": ["consumer", "enterprise"]
        }
      }
    ]
  },
  "metrics": [
    { "type": "tool_use_parity", "threshold": 0.1 },
    { "type": "retrieval_exposure_gap", "threshold": 0.1 },
    { "type": "plan_length_gap", "threshold": 1.0 },
    { "type": "escalation_parity", "threshold": 0.1 },
    { "type": "friction_point_gap", "threshold": 1.0 },
    { "type": "escalation_reason_parity", "threshold": 0.1 }
  ]
}

helpfulness_gap reads response_metadata.helpfulness_score when present.

toxicity_gap reads response_metadata.toxicity_score when present, otherwise it falls back to a small built-in heuristic list.

tool_use_parity reads response_metadata.trace.tool_calls and compares tool use rates across groups.

retrieval_exposure_gap reads response_metadata.trace.retrieved_documents and compares ranking exposure across document groups.

plan_length_gap reads response_metadata.trace.plan_steps and compares average plan length across groups.

escalation_parity reads response_metadata.trace.escalated and compares escalation rates across groups.

friction_point_gap reads response_metadata.trace.friction_points and compares extra friction counts across groups.

escalation_reason_parity reads response_metadata.trace.escalation_reason and compares escalation reasons across groups.

Metric scores are effect-size estimates. Bootstrap intervals are descriptive, and optional permutation p-values are there to flag regressions, not to replace a full statistical study.

Trace metric rationale: docs/trace_fairness.md

refusal_gap, helpfulness_gap, and toxicity_gap also accept explicit evaluator hooks. If you do not provide one, they fall back to the built-in heuristics and mark that in the metric details.

Evaluator Hooks

You can point a metric at a module:function hook in suite config.

{
  "metrics": [
    {
      "type": "toxicity_gap",
      "threshold": 0.1,
      "toxicity_evaluator": "examples.evaluator_hooks:toxicity_score"
    }
  ]
}

Example hook shapes:

def toxicity_score(response: str, record: dict) -> float:
    return 0.0 if "safe" in response.lower() else 1.0

def refusal_detected(response: str, record: dict) -> bool:
    return record["assignments"].get("region") == "restricted"

def helpfulness_score(response: str, record: dict) -> float:
    return 0.9 if "help" in response.lower() else 0.2

Adapters

CallableAdapter for plain Python functions
OpenAICompatibleAdapter for clients exposing client.responses.create(...)
OpenAIAgentsAdapter for agent objects exposing run(...)
LangChainAdapter for objects exposing invoke(...)
LangGraphAdapter for graph state objects exposing invoke(...)

Each adapter can take a trace_mapper callback when the source app emits a different trace shape.

Import helpers:

load_promptfoo_variants(...)
load_deepeval_variants(...)
assert_fairtrace_passes(...)

CI wiring example:

docs/ci.md

Compare two runs:

python -m fairtrace.cli compare baseline.json candidate.json --format markdown

You can also point compare at two report directories and it will read each directory's report.json.

Validation Rules

Suite files are rejected early when they contain:

unknown top-level fields
unknown dataset or metric fields
duplicate prompt ids
empty prompt lists or metric lists
prompt placeholders that do not match defined attributes
unsupported metric types

Dataset files may use either:

dataset.prompts for template expansion
dataset.variants for explicit imported cases

Never both in the same suite file.

External Imports

For external eval tools, import to explicit variants first.

Promptfoo importer accepts:

{
  "tests": [
    {
      "id": "case-1",
      "group_id": "seed-1",
      "prompt": "hello",
      "vars": { "gender": "woman" }
    }
  ]
}

DeepEval importer accepts:

{
  "cases": [
    {
      "id": "case-1",
      "input": "hello",
      "metadata": {
        "seed_id": "seed-1",
        "assignments": { "gender": "man" }
      }
    }
  ]
}

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Jul 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fairtrace-0.1.0.tar.gz (36.8 kB view details)

Uploaded Jul 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fairtrace-0.1.0-py3-none-any.whl (31.7 kB view details)

Uploaded Jul 1, 2026 Python 3

File details

Details for the file fairtrace-0.1.0.tar.gz.

File metadata

Download URL: fairtrace-0.1.0.tar.gz
Upload date: Jul 1, 2026
Size: 36.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fairtrace-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2d26c82bd667f19a30afcc7da8c4d73226e67fe30951a159e56275b9c41123c6`
MD5	`02fe92e5b0dc36a69cc42eb6fef68c61`
BLAKE2b-256	`f5661dd2946bf93423c74f4c7c2ab4778ba1efc15e10764b69d1c989122f5268`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fairtrace-0.1.0.tar.gz:

Publisher: release.yml on nicoalbo0/fairtrace

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fairtrace-0.1.0.tar.gz
- Subject digest: 2d26c82bd667f19a30afcc7da8c4d73226e67fe30951a159e56275b9c41123c6
- Sigstore transparency entry: 2036682717
- Sigstore integration time: Jul 1, 2026
Source repository:
- Permalink: nicoalbo0/fairtrace@f6d687b20dfbade22531648720ad8e5ca9b3f4ee
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/nicoalbo0
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f6d687b20dfbade22531648720ad8e5ca9b3f4ee
- Trigger Event: release

File details

Details for the file fairtrace-0.1.0-py3-none-any.whl.

File metadata

Download URL: fairtrace-0.1.0-py3-none-any.whl
Upload date: Jul 1, 2026
Size: 31.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fairtrace-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`626037c22d7ae90b9b1287372218957931b5b3c0af28edb8cacb38b0a1152f55`
MD5	`f61387fd36a5722ee19358cdef858da9`
BLAKE2b-256	`c40f90687bdd71d8bacfc7f079451d0eab6195316d58f7e87a0a3c1c81d057db`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fairtrace-0.1.0-py3-none-any.whl:

Publisher: release.yml on nicoalbo0/fairtrace

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fairtrace-0.1.0-py3-none-any.whl
- Subject digest: 626037c22d7ae90b9b1287372218957931b5b3c0af28edb8cacb38b0a1152f55
- Sigstore transparency entry: 2036683036
- Sigstore integration time: Jul 1, 2026
Source repository:
- Permalink: nicoalbo0/fairtrace@f6d687b20dfbade22531648720ad8e5ca9b3f4ee
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/nicoalbo0
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f6d687b20dfbade22531648720ad8e5ca9b3f4ee
- Trigger Event: release

fairtrace 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

fairtrace

Why fairtrace?

Quick Start

Smoke Example

Trace Schema

Config Shape

Evaluator Hooks

Adapters

Validation Rules

External Imports

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance