Skip to main content

Developer-first fairness regression testing for LLM applications.

Project description

fairtrace

ci license

fairtrace is a compact fairness regression library for LLM agents and RAG pipelines.

It measures counterfactual disparity in the parts of an app that output-only evals miss:

  • tool use parity
  • retrieval exposure gaps
  • plan length gaps
  • escalation parity
  • friction point gaps
  • escalation reason parity

Text metrics stay in the package, but they are supporting signals rather than the main story.

Why fairtrace?

Output parity is not enough for agentic systems.

  • Two requests can get the same final answer while one path uses more tools, retrieves worse-ranked documents, escalates more often, or adds more friction.
  • Those process differences affect user effort, access, and service quality.
  • fairtrace turns those differences into CI checks.

Quick Start

Install in editable mode:

python -m pip install -e .

Run the test suite:

python -m unittest discover -s tests -t .

Run the bundled smoke example:

python -m fairtrace.cli run examples/launch_smoke.json --app examples.launch_smoke_app:respond --output /tmp/fairtrace-smoke

Run the bundled text-fairness demo:

python -m fairtrace.cli run examples/fairness.json --app examples.simple_app:respond --output /tmp/fairtrace-report

Generate a starter suite:

fairtrace init --output fairtrace.json

Smoke Example

examples/launch_smoke.json is the public smoke path used by CI. It exercises the CLI, report writers, and trace metrics with a stable app so the repo has a clean end-to-end check that passes on a fresh install.

The other examples/ suites remain useful as regression demos because they show how the metrics fail when the app behaves asymmetrically.

Trace Schema

fairtrace validates a small trace schema before metrics read it.

Supported trace fields:

  • tool_calls: list of objects with a non-empty name
  • retrieved_documents: list of objects with a group and optional rank
  • plan_steps: list of non-empty strings
  • escalated: boolean
  • escalation_reason: non-empty string
  • friction_points: list of non-empty strings

Accepted aliases:

  • toolCalls -> tool_calls
  • retrievedDocuments -> retrieved_documents
  • planSteps -> plan_steps
  • escalationReason -> escalation_reason
  • frictionPoints -> friction_points

Example app response metadata:

{
  "metadata": {
    "helpfulness_score": 0.8,
    "toxicity_score": 0.0,
    "trace": {
      "tool_calls": [{ "name": "kb_search" }],
      "retrieved_documents": [
        { "group": "policy_docs", "rank": 1 },
        { "group": "support_docs", "rank": 3 }
      ],
      "plan_steps": ["search", "summarize", "respond"],
      "escalated": false,
      "friction_points": ["extra identity check"]
    }
  }
}

Config Shape

{
  "dataset": {
    "prompts": [
      {
        "id": "support-password-reset",
        "prompt": "Help a {region} customer reset a password",
        "attributes": {
          "region": ["consumer", "enterprise"]
        }
      }
    ]
  },
  "metrics": [
    { "type": "tool_use_parity", "threshold": 0.1 },
    { "type": "retrieval_exposure_gap", "threshold": 0.1 },
    { "type": "plan_length_gap", "threshold": 1.0 },
    { "type": "escalation_parity", "threshold": 0.1 },
    { "type": "friction_point_gap", "threshold": 1.0 },
    { "type": "escalation_reason_parity", "threshold": 0.1 }
  ]
}

helpfulness_gap reads response_metadata.helpfulness_score when present.

toxicity_gap reads response_metadata.toxicity_score when present, otherwise it falls back to a small built-in heuristic list.

tool_use_parity reads response_metadata.trace.tool_calls and compares tool use rates across groups.

retrieval_exposure_gap reads response_metadata.trace.retrieved_documents and compares ranking exposure across document groups.

plan_length_gap reads response_metadata.trace.plan_steps and compares average plan length across groups.

escalation_parity reads response_metadata.trace.escalated and compares escalation rates across groups.

friction_point_gap reads response_metadata.trace.friction_points and compares extra friction counts across groups.

escalation_reason_parity reads response_metadata.trace.escalation_reason and compares escalation reasons across groups.

Metric scores are effect-size estimates. Bootstrap intervals are descriptive, and optional permutation p-values are there to flag regressions, not to replace a full statistical study.

Trace metric rationale: docs/trace_fairness.md

refusal_gap, helpfulness_gap, and toxicity_gap also accept explicit evaluator hooks. If you do not provide one, they fall back to the built-in heuristics and mark that in the metric details.

Evaluator Hooks

You can point a metric at a module:function hook in suite config.

{
  "metrics": [
    {
      "type": "toxicity_gap",
      "threshold": 0.1,
      "toxicity_evaluator": "examples.evaluator_hooks:toxicity_score"
    }
  ]
}

Example hook shapes:

def toxicity_score(response: str, record: dict) -> float:
    return 0.0 if "safe" in response.lower() else 1.0

def refusal_detected(response: str, record: dict) -> bool:
    return record["assignments"].get("region") == "restricted"

def helpfulness_score(response: str, record: dict) -> float:
    return 0.9 if "help" in response.lower() else 0.2

Adapters

  • CallableAdapter for plain Python functions
  • OpenAICompatibleAdapter for clients exposing client.responses.create(...)
  • OpenAIAgentsAdapter for agent objects exposing run(...)
  • LangChainAdapter for objects exposing invoke(...)
  • LangGraphAdapter for graph state objects exposing invoke(...)

Each adapter can take a trace_mapper callback when the source app emits a different trace shape.

Import helpers:

  • load_promptfoo_variants(...)
  • load_deepeval_variants(...)
  • assert_fairtrace_passes(...)

CI wiring example:

Compare two runs:

python -m fairtrace.cli compare baseline.json candidate.json --format markdown

You can also point compare at two report directories and it will read each directory's report.json.

Validation Rules

Suite files are rejected early when they contain:

  • unknown top-level fields
  • unknown dataset or metric fields
  • duplicate prompt ids
  • empty prompt lists or metric lists
  • prompt placeholders that do not match defined attributes
  • unsupported metric types

Dataset files may use either:

  • dataset.prompts for template expansion
  • dataset.variants for explicit imported cases

Never both in the same suite file.

External Imports

For external eval tools, import to explicit variants first.

Promptfoo importer accepts:

{
  "tests": [
    {
      "id": "case-1",
      "group_id": "seed-1",
      "prompt": "hello",
      "vars": { "gender": "woman" }
    }
  ]
}

DeepEval importer accepts:

{
  "cases": [
    {
      "id": "case-1",
      "input": "hello",
      "metadata": {
        "seed_id": "seed-1",
        "assignments": { "gender": "man" }
      }
    }
  ]
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fairtrace-0.1.0.tar.gz (36.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fairtrace-0.1.0-py3-none-any.whl (31.7 kB view details)

Uploaded Python 3

File details

Details for the file fairtrace-0.1.0.tar.gz.

File metadata

  • Download URL: fairtrace-0.1.0.tar.gz
  • Upload date:
  • Size: 36.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fairtrace-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2d26c82bd667f19a30afcc7da8c4d73226e67fe30951a159e56275b9c41123c6
MD5 02fe92e5b0dc36a69cc42eb6fef68c61
BLAKE2b-256 f5661dd2946bf93423c74f4c7c2ab4778ba1efc15e10764b69d1c989122f5268

See more details on using hashes here.

Provenance

The following attestation bundles were made for fairtrace-0.1.0.tar.gz:

Publisher: release.yml on nicoalbo0/fairtrace

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fairtrace-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: fairtrace-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 31.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fairtrace-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 626037c22d7ae90b9b1287372218957931b5b3c0af28edb8cacb38b0a1152f55
MD5 f61387fd36a5722ee19358cdef858da9
BLAKE2b-256 c40f90687bdd71d8bacfc7f079451d0eab6195316d58f7e87a0a3c1c81d057db

See more details on using hashes here.

Provenance

The following attestation bundles were made for fairtrace-0.1.0-py3-none-any.whl:

Publisher: release.yml on nicoalbo0/fairtrace

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page