Skip to main content

SaMD-specific fairness evaluation CLI for foundation-model medical AI; emits AI Act Art. 10 / Art. 9 evidence artifacts

Project description

fmm-fairness-eval

SaMD-specific fairness evaluation CLI for foundation-model medical AI. Emits AI Act Art. 10 / Art. 9 evidence artifacts.

License: MIT Python 3.10+

fmm-fairness-eval (fmm-fairness on the command line) is a small, focused CLI that takes a predictions CSV from any SaMD or SaMD-adjacent medical-AI model and produces a regulator-friendly fairness evidence pack: a Markdown report + a machine-readable JSON pack + a SHA-256 audit chain. It is built around the failure mode regulators actually care about — inter-hospital / inter-site bias — and packaged so that the output drops straight into an EU AI Act Art. 10 / Art. 9 dossier.


Why this exists

Modern medical-AI systems are increasingly built on foundation-model embeddings (CONCH for histopathology, DINOv2 for general radiology, RadFM-style models for multi-modal radiology) plus a small downstream classifier. The dominant failure mode is no longer "the model is biased against women" or "the model misses dark skin tones" in isolation — it is inter-hospital generalization collapse: the model that scores F1=0.89 on one cohort drops to F1=0.70 on the cohort across the river, and the gap is largest in subgroups the training data under-represented.

The author's TFG (Universitat Politècnica de València, 2024) measured exactly this on dermatology AI using CONCH embeddings and multiple-instance learning over the AI4SkIN cohort: weighted F1 = 0.89, but an inter-hospital fairness gap of 0.19 between sites. That is not a corner case — it is the modal failure mode for any SaMD that crosses a hospital network boundary.

Existing fairness libraries (FairLearn, AIF360, Holistic AI, Microsoft Responsible AI Toolbox) are general-purpose ML fairness tools. None of them ship a SaMD-specific evaluation pipeline that:

  • Treats site / hospital as a first-class protected attribute distinct from individual demographics.
  • Emits AI Act Art. 10 / Art. 9 cross-cited evidence by default.
  • Defines a composite SaMD fairness score whose weighting reflects how regulators actually prioritize bias categories.
  • Ships a SHA-256 audit chain so the evidence pack is tamper-evident the moment it leaves your pipeline.

This tool fills that gap. Nothing more, nothing less.

Citation for the underlying TFG work: César Pereiro, Foundation-model-based fairness evaluation in dermatology classification using the AI4SkIN dataset, Universitat Politècnica de València, 2024. https://riunet.upv.es/handle/10251/226903


Install

pip install fmm-fairness-eval

Or from source:

git clone https://github.com/<handle>/fmm-fairness-eval
cd fmm-fairness-eval
pip install -e .

Requires Python 3.10+, numpy ≥ 1.24, pandas ≥ 2.0, scikit-learn ≥ 1.3. No GPU dependency.


What it does

1. Run an evaluation

fmm-fairness evaluate predictions.csv \
    --protected-attrs site,sex,age_bucket \
    --site-attribute site \
    --output fairness-report/

predictions.csv must contain these columns:

column type meaning
y_true int ∈ {0, 1} Ground-truth label
y_pred int ∈ {0, 1} Thresholded prediction
y_score float ∈ [0, 1] Raw model probability / score
(declared) str One column per --protected-attrs value

2. Read the output

The CLI produces three files in fairness-report/:

  • fairness-report.md — human-readable, regulator-friendly summary.
  • fairness-evidence.json — machine-readable evidence pack (stable schema, sorted keys for deterministic SHA).
  • audit.sha256 — SHA-256 of the above two files; pin in your QMS / change-control record.

3. Cross-cite to the AI Act

fmm-fairness evaluate predictions.csv \
    --protected-attrs site,sex,age_bucket \
    --manifest-mode ai-act \
    --output fairness-report/

In ai-act mode the JSON pack gains a regulatory_mapping block that cross-cites each metric to the EU AI Act article it evidences:

  • Art. 9 (Risk management system)samd_fairness_score, inter_site_auc_variance.
  • Art. 10 (Data and data governance)equal_opportunity_gap, demographic_parity_gap, calibration_gap (evidences Art. 10(2)(f-g) examination of biases and shortcomings).
  • Art. 15 (Accuracy, robustness)inter_site_auc_variance (evidences generalization claims).

Metrics computed

Metric Formula (short) When it matters
equal_opportunity_gap max-min TPR across groups (Hardt et al. 2016) Under-diagnosis disparity (Pierson et al. 2021)
demographic_parity_gap max-min P(ŷ=1) across groups Selection-rate disparity
calibration_gap max-min ECE across groups Score-trust differs by subgroup
inter_site_auc_variance Var(AUC) across sites Inter-hospital generalization risk (the SaMD failure mode)
samd_fairness_score composite ∈ [0,1] (see docs/samd-fairness-score.md) Single-number summary for QMS dashboards

All gap metrics ship with percentile bootstrap 95% CIs computed over a stratified resample.

The composite samd_fairness_score is defined explicitly with documented weights and a sensitivity analysis in docs/samd-fairness-score.md. It is not a black box and is not an FDA-blessed metric — it is a transparent aggregate the operator can defend, override, or replace.


Scientific context

  • CONCH (Lu et al. 2024) is the visual-language pathology foundation model used in the underlying TFG work. Lu, M. Y. et al. "A visual-language foundation model for computational pathology." Nature Medicine 30, 863–874 (2024). doi:10.1038/s41591-024-02856-4
  • AI4SkIN is the multi-hospital dermatopathology dataset (Spain, multi-site) on which the TFG measured the 0.19 inter-hospital gap.
  • Under-diagnosis bias on chest X-rays (Seyyed-Kalantari et al. 2021, Nat. Med. 27, 2176-2182) is the canonical demonstration that single-site fairness audits miss the dominant failure mode.
  • Pain disparity reduction (Pierson et al. 2021, Nat. Med. 27, 136-140) demonstrates the inverse — that algorithmic predictions can outperform human-graded severity in capturing real disparities, motivating better measurement, not less.
  • Ethical implementation (Char, Shah, Magnus 2018, NEJM 378, 981-983) sets the still-canonical framing for healthcare-ML ethics.

A one-paragraph literature pointer for the formal definitions: the equal-opportunity criterion is Hardt, Price, Srebro (NeurIPS 2016); the demographic-parity definition follows Dwork et al. (ITCS 2012); calibration-by-group follows Pleiss et al. (NeurIPS 2017). The composite weighting is justified from FDA GMLP guidance (2021, updated 2024 IMDRF GMLP) + EU AI Act Art. 10 prioritisation of multi-site data governance.


What it does NOT do

  • Not a model-training framework. Bring your own predictions.
  • Not a foundation-model serving stack. Embeddings are outside scope.
  • Not auto-detection of protected attributes. You must declare them — silent attribute inference is itself a bias risk.
  • Not a certification. A fairness evaluation is evidence; certification is a regulatory process this tool helps you prepare for.
  • Not an explainability tool. It surfaces where bias lives, not why.

Honest scientific caveats (read before quoting numbers)

  1. Threshold sensitivity. equal_opportunity_gap and demographic_parity_gap both depend on the operating threshold used to produce y_pred. Re-run the evaluation at any threshold you would actually deploy at.
  2. Small-sample bootstrap. Percentile bootstrap is approximate for small groups; for n < 50 prefer the BCa interval or treat CIs as exploratory. Groups with n < min_group_n (default 20) are excluded with a warning rather than silently producing a near-zero gap.
  3. Prevalence confound. Inter-site bias is frequently confounded with prevalence shift. A site that sees twice the disease prevalence will have different TPR even from a perfectly fair model. The tool reports both per-site AUC (less prevalence-sensitive) and per-site rates; interpret jointly.
  4. Composite score is opinionated. The default weights (w_site=0.4, w_eo=0.3, w_dp=0.15, w_cal=0.15) reflect this author's read of regulatory priority. Override with weights= in the Python API or treat the components separately. The single number is for dashboards; the components are for decisions.
  5. No causal inference. A measured gap does not identify the mechanism. Combine with subgroup analysis, training-set provenance audit, and (where possible) prospective evaluation.

Pricing

  • CLI: MIT, free, forever.
  • Hosted "fairness CI" (Phase 2 — not yet shipped): planned at €99/month for teams that want every commit to a model repo to fire an evaluation against a frozen multi-site cohort and post the evidence pack as a CI artifact. Mailing list opens at validation green-light.
  • Consulting: the author is available for SaMD fairness review / AI Act Art. 10 evidence-pack design at €60-100/hour. Contact via the linked GitHub profile; introductions through the academic-DM channel are welcome.

Citing

If you use this tool in published research:

@software{pereiro2026fmmfairness,
  author       = {Pereiro, C{\'e}sar},
  title        = {{fmm-fairness-eval}: SaMD-specific fairness evaluation for foundation-model medical AI},
  year         = {2026},
  publisher    = {Zenodo},
  doi          = {<assigned-on-first-release>},
  url          = {https://github.com/<handle>/fmm-fairness-eval}
}

@thesis{pereiro2024dermfairness,
  author = {Pereiro, C{\'e}sar},
  title  = {Foundation-model-based fairness evaluation in dermatology classification using the AI4SkIN dataset},
  school = {Universitat Polit{\`e}cnica de Val{\`e}ncia},
  year   = {2024},
  url    = {https://riunet.upv.es/handle/10251/226903}
}

Roadmap

  • v0.1 (this release): CLI, 4 fairness gap metrics, composite score, AI Act manifest mode, SHA-256 audit chain.
  • v0.2: BCa bootstrap, sub-group intersectionality (site × sex), CSV-of-CSVs batch mode.
  • v0.3: HTML report option, hosted fairness-CI (Phase 2 — gated on validation pass).
  • v0.4: subgroup-aware threshold optimisation (opt-in, with the appropriate caveats).

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fmm_fairness_eval-0.1.0.tar.gz (22.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fmm_fairness_eval-0.1.0-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file fmm_fairness_eval-0.1.0.tar.gz.

File metadata

  • Download URL: fmm_fairness_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for fmm_fairness_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4ab21b59509f5b1b5aac7271c9dc0ae461dea1c68c1e230d066eb381e5862c1f
MD5 2a81523cb124000548322da6c536bf8f
BLAKE2b-256 5dd41a272ab7509b437e0f078c90d556056905ffed1d1f8daaf16c22f8f2097e

See more details on using hashes here.

File details

Details for the file fmm_fairness_eval-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for fmm_fairness_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9dfbcef4b17b5ae8d2a212d02ea36f4ef06a811c60d7b69c5a57165592727cb9
MD5 889ebb50b7519f0c3af4a49d6dec6fca
BLAKE2b-256 c160e075d3141586cc872e92f38b496996d63a584855e807f310d94952677050

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page