SaMD-specific fairness evaluation CLI for foundation-model medical AI; emits AI Act Art. 10 / Art. 9 evidence artifacts
Project description
fmm-fairness-eval
SaMD-specific fairness evaluation CLI for foundation-model medical AI. Emits AI Act Art. 10 / Art. 9 evidence artifacts.
fmm-fairness-eval (fmm-fairness on the command line) is a small, focused CLI that takes a predictions CSV from any SaMD or SaMD-adjacent medical-AI model and produces a regulator-friendly fairness evidence pack: a Markdown report + a machine-readable JSON pack + a SHA-256 audit chain. It is built around the failure mode regulators actually care about — inter-hospital / inter-site bias — and packaged so that the output drops straight into an EU AI Act Art. 10 / Art. 9 dossier.
Why this exists
Modern medical-AI systems are increasingly built on foundation-model embeddings (CONCH for histopathology, DINOv2 for general radiology, RadFM-style models for multi-modal radiology) plus a small downstream classifier. The dominant failure mode is no longer "the model is biased against women" or "the model misses dark skin tones" in isolation — it is inter-hospital generalization collapse: the model that scores F1=0.89 on one cohort drops to F1=0.70 on the cohort across the river, and the gap is largest in subgroups the training data under-represented.
The author's TFG (Universitat Politècnica de València, 2024) measured exactly this on dermatology AI using CONCH embeddings and multiple-instance learning over the AI4SkIN cohort: weighted F1 = 0.89, but an inter-hospital fairness gap of 0.19 between sites. That is not a corner case — it is the modal failure mode for any SaMD that crosses a hospital network boundary.
Existing fairness libraries (FairLearn, AIF360, Holistic AI, Microsoft Responsible AI Toolbox) are general-purpose ML fairness tools. None of them ship a SaMD-specific evaluation pipeline that:
- Treats
site/hospitalas a first-class protected attribute distinct from individual demographics. - Emits AI Act Art. 10 / Art. 9 cross-cited evidence by default.
- Defines a composite SaMD fairness score whose weighting reflects how regulators actually prioritize bias categories.
- Ships a SHA-256 audit chain so the evidence pack is tamper-evident the moment it leaves your pipeline.
This tool fills that gap. Nothing more, nothing less.
Citation for the underlying TFG work: César Pereiro, Foundation-model-based fairness evaluation in dermatology classification using the AI4SkIN dataset, Universitat Politècnica de València, 2024. https://riunet.upv.es/handle/10251/226903
Install
pip install fmm-fairness-eval
Or from source:
git clone https://github.com/<handle>/fmm-fairness-eval
cd fmm-fairness-eval
pip install -e .
Requires Python 3.10+, numpy ≥ 1.24, pandas ≥ 2.0, scikit-learn ≥ 1.3. No GPU dependency.
What it does
1. Run an evaluation
fmm-fairness evaluate predictions.csv \
--protected-attrs site,sex,age_bucket \
--site-attribute site \
--output fairness-report/
predictions.csv must contain these columns:
| column | type | meaning |
|---|---|---|
y_true |
int ∈ {0, 1} | Ground-truth label |
y_pred |
int ∈ {0, 1} | Thresholded prediction |
y_score |
float ∈ [0, 1] | Raw model probability / score |
| (declared) | str | One column per --protected-attrs value |
2. Read the output
The CLI produces three files in fairness-report/:
fairness-report.md— human-readable, regulator-friendly summary.fairness-evidence.json— machine-readable evidence pack (stable schema, sorted keys for deterministic SHA).audit.sha256— SHA-256 of the above two files; pin in your QMS / change-control record.
3. Cross-cite to the AI Act
fmm-fairness evaluate predictions.csv \
--protected-attrs site,sex,age_bucket \
--manifest-mode ai-act \
--output fairness-report/
In ai-act mode the JSON pack gains a regulatory_mapping block that cross-cites each metric to the EU AI Act article it evidences:
- Art. 9 (Risk management system) ↔
samd_fairness_score,inter_site_auc_variance. - Art. 10 (Data and data governance) ↔
equal_opportunity_gap,demographic_parity_gap,calibration_gap(evidences Art. 10(2)(f-g) examination of biases and shortcomings). - Art. 15 (Accuracy, robustness) ↔
inter_site_auc_variance(evidences generalization claims).
Metrics computed
| Metric | Formula (short) | When it matters |
|---|---|---|
equal_opportunity_gap |
max-min TPR across groups (Hardt et al. 2016) | Under-diagnosis disparity (Pierson et al. 2021) |
demographic_parity_gap |
max-min P(ŷ=1) across groups | Selection-rate disparity |
calibration_gap |
max-min ECE across groups | Score-trust differs by subgroup |
inter_site_auc_variance |
Var(AUC) across sites | Inter-hospital generalization risk (the SaMD failure mode) |
samd_fairness_score |
composite ∈ [0,1] (see docs/samd-fairness-score.md) |
Single-number summary for QMS dashboards |
All gap metrics ship with percentile bootstrap 95% CIs computed over a stratified resample.
The composite samd_fairness_score is defined explicitly with documented weights and a sensitivity analysis in docs/samd-fairness-score.md. It is not a black box and is not an FDA-blessed metric — it is a transparent aggregate the operator can defend, override, or replace.
Scientific context
- CONCH (Lu et al. 2024) is the visual-language pathology foundation model used in the underlying TFG work. Lu, M. Y. et al. "A visual-language foundation model for computational pathology." Nature Medicine 30, 863–874 (2024). doi:10.1038/s41591-024-02856-4
- AI4SkIN is the multi-hospital dermatopathology dataset (Spain, multi-site) on which the TFG measured the 0.19 inter-hospital gap.
- Under-diagnosis bias on chest X-rays (Seyyed-Kalantari et al. 2021, Nat. Med. 27, 2176-2182) is the canonical demonstration that single-site fairness audits miss the dominant failure mode.
- Pain disparity reduction (Pierson et al. 2021, Nat. Med. 27, 136-140) demonstrates the inverse — that algorithmic predictions can outperform human-graded severity in capturing real disparities, motivating better measurement, not less.
- Ethical implementation (Char, Shah, Magnus 2018, NEJM 378, 981-983) sets the still-canonical framing for healthcare-ML ethics.
A one-paragraph literature pointer for the formal definitions: the equal-opportunity criterion is Hardt, Price, Srebro (NeurIPS 2016); the demographic-parity definition follows Dwork et al. (ITCS 2012); calibration-by-group follows Pleiss et al. (NeurIPS 2017). The composite weighting is justified from FDA GMLP guidance (2021, updated 2024 IMDRF GMLP) + EU AI Act Art. 10 prioritisation of multi-site data governance.
What it does NOT do
- Not a model-training framework. Bring your own predictions.
- Not a foundation-model serving stack. Embeddings are outside scope.
- Not auto-detection of protected attributes. You must declare them — silent attribute inference is itself a bias risk.
- Not a certification. A fairness evaluation is evidence; certification is a regulatory process this tool helps you prepare for.
- Not an explainability tool. It surfaces where bias lives, not why.
Honest scientific caveats (read before quoting numbers)
- Threshold sensitivity.
equal_opportunity_gapanddemographic_parity_gapboth depend on the operating threshold used to producey_pred. Re-run the evaluation at any threshold you would actually deploy at. - Small-sample bootstrap. Percentile bootstrap is approximate for small groups; for n < 50 prefer the BCa interval or treat CIs as exploratory. Groups with n <
min_group_n(default 20) are excluded with a warning rather than silently producing a near-zero gap. - Prevalence confound. Inter-site bias is frequently confounded with prevalence shift. A site that sees twice the disease prevalence will have different TPR even from a perfectly fair model. The tool reports both per-site AUC (less prevalence-sensitive) and per-site rates; interpret jointly.
- Composite score is opinionated. The default weights (
w_site=0.4, w_eo=0.3, w_dp=0.15, w_cal=0.15) reflect this author's read of regulatory priority. Override withweights=in the Python API or treat the components separately. The single number is for dashboards; the components are for decisions. - No causal inference. A measured gap does not identify the mechanism. Combine with subgroup analysis, training-set provenance audit, and (where possible) prospective evaluation.
Pricing
- CLI: MIT, free, forever.
- Hosted "fairness CI" (Phase 2 — not yet shipped): planned at €99/month for teams that want every commit to a model repo to fire an evaluation against a frozen multi-site cohort and post the evidence pack as a CI artifact. Mailing list opens at validation green-light.
- Consulting: the author is available for SaMD fairness review / AI Act Art. 10 evidence-pack design at €60-100/hour. Contact via the linked GitHub profile; introductions through the academic-DM channel are welcome.
Citing
If you use this tool in published research:
@software{pereiro2026fmmfairness,
author = {Pereiro, C{\'e}sar},
title = {{fmm-fairness-eval}: SaMD-specific fairness evaluation for foundation-model medical AI},
year = {2026},
publisher = {Zenodo},
doi = {<assigned-on-first-release>},
url = {https://github.com/<handle>/fmm-fairness-eval}
}
@thesis{pereiro2024dermfairness,
author = {Pereiro, C{\'e}sar},
title = {Foundation-model-based fairness evaluation in dermatology classification using the AI4SkIN dataset},
school = {Universitat Polit{\`e}cnica de Val{\`e}ncia},
year = {2024},
url = {https://riunet.upv.es/handle/10251/226903}
}
Roadmap
- v0.1 (this release): CLI, 4 fairness gap metrics, composite score, AI Act manifest mode, SHA-256 audit chain.
- v0.2: BCa bootstrap, sub-group intersectionality (
site × sex), CSV-of-CSVs batch mode. - v0.3: HTML report option, hosted fairness-CI (Phase 2 — gated on validation pass).
- v0.4: subgroup-aware threshold optimisation (opt-in, with the appropriate caveats).
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fmm_fairness_eval-0.1.0.tar.gz.
File metadata
- Download URL: fmm_fairness_eval-0.1.0.tar.gz
- Upload date:
- Size: 22.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ab21b59509f5b1b5aac7271c9dc0ae461dea1c68c1e230d066eb381e5862c1f
|
|
| MD5 |
2a81523cb124000548322da6c536bf8f
|
|
| BLAKE2b-256 |
5dd41a272ab7509b437e0f078c90d556056905ffed1d1f8daaf16c22f8f2097e
|
File details
Details for the file fmm_fairness_eval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: fmm_fairness_eval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9dfbcef4b17b5ae8d2a212d02ea36f4ef06a811c60d7b69c5a57165592727cb9
|
|
| MD5 |
889ebb50b7519f0c3af4a49d6dec6fca
|
|
| BLAKE2b-256 |
c160e075d3141586cc872e92f38b496996d63a584855e807f310d94952677050
|