SaMD-specific fairness evaluation CLI for foundation-model medical AI; emits AI Act Art. 10 / Art. 9 evidence artifacts

These details have not been verified by PyPI

Project links

Project description

fmm-fairness-eval

SaMD-specific fairness evaluation CLI for foundation-model medical AI. Emits AI Act Art. 10 / Art. 9 evidence artifacts.

fmm-fairness-eval (fmm-fairness on the command line) is a small, focused CLI that takes a predictions CSV from any SaMD or SaMD-adjacent medical-AI model and produces a regulator-friendly fairness evidence pack: a Markdown report + a machine-readable JSON pack + a SHA-256 audit chain. It is built around the failure mode regulators actually care about — inter-hospital / inter-site bias — and packaged so that the output drops straight into an EU AI Act Art. 10 / Art. 9 dossier.

Why this exists

Modern medical-AI systems are increasingly built on foundation-model embeddings (CONCH for histopathology, DINOv2 for general radiology, RadFM-style models for multi-modal radiology) plus a small downstream classifier. The dominant failure mode is no longer "the model is biased against women" or "the model misses dark skin tones" in isolation — it is inter-hospital generalization collapse: the model that scores F1=0.89 on one cohort drops to F1=0.70 on the cohort across the river, and the gap is largest in subgroups the training data under-represented.

The author's TFG (Universitat Politècnica de València, 2024) measured exactly this on dermatology AI using CONCH embeddings and multiple-instance learning over the AI4SkIN cohort: weighted F1 = 0.89, but an inter-hospital fairness gap of 0.19 between sites. That is not a corner case — it is the modal failure mode for any SaMD that crosses a hospital network boundary.

Existing fairness libraries (FairLearn, AIF360, Holistic AI, Microsoft Responsible AI Toolbox) are general-purpose ML fairness tools. None of them ship a SaMD-specific evaluation pipeline that:

Treats site / hospital as a first-class protected attribute distinct from individual demographics.
Emits AI Act Art. 10 / Art. 9 cross-cited evidence by default.
Defines a composite SaMD fairness score whose weighting reflects how regulators actually prioritize bias categories.
Ships a SHA-256 audit chain so the evidence pack is tamper-evident the moment it leaves your pipeline.

This tool fills that gap. Nothing more, nothing less.

Citation for the underlying TFG work: César Pereiro, Foundation-model-based fairness evaluation in dermatology classification using the AI4SkIN dataset, Universitat Politècnica de València, 2024. https://riunet.upv.es/handle/10251/226903

Install

pip install fmm-fairness-eval

Or from source:

git clone https://github.com/<handle>/fmm-fairness-eval
cd fmm-fairness-eval
pip install -e .

Requires Python 3.10+, numpy ≥ 1.24, pandas ≥ 2.0, scikit-learn ≥ 1.3. No GPU dependency.

What it does

1. Run an evaluation

fmm-fairness evaluate predictions.csv \
    --protected-attrs site,sex,age_bucket \
    --site-attribute site \
    --output fairness-report/

predictions.csv must contain these columns:

column	type	meaning
`y_true`	int ∈ {0, 1}	Ground-truth label
`y_pred`	int ∈ {0, 1}	Thresholded prediction
`y_score`	float ∈ [0, 1]	Raw model probability / score
(declared)	str	One column per `--protected-attrs` value

2. Read the output

The CLI produces three files in fairness-report/:

fairness-report.md — human-readable, regulator-friendly summary.
fairness-evidence.json — machine-readable evidence pack (stable schema, sorted keys for deterministic SHA).
audit.sha256 — SHA-256 of the above two files; pin in your QMS / change-control record.

3. Cross-cite to the AI Act

fmm-fairness evaluate predictions.csv \
    --protected-attrs site,sex,age_bucket \
    --manifest-mode ai-act \
    --output fairness-report/

In ai-act mode the JSON pack gains a regulatory_mapping block that cross-cites each metric to the EU AI Act article it evidences:

Art. 9 (Risk management system) ↔ samd_fairness_score, inter_site_auc_variance.
Art. 10 (Data and data governance) ↔ equal_opportunity_gap, demographic_parity_gap, calibration_gap (evidences Art. 10(2)(f-g) examination of biases and shortcomings).
Art. 15 (Accuracy, robustness) ↔ inter_site_auc_variance (evidences generalization claims).

Metrics computed

Metric	Formula (short)	When it matters
`equal_opportunity_gap`	max-min TPR across groups (Hardt et al. 2016)	Under-diagnosis disparity (Pierson et al. 2021)
`demographic_parity_gap`	max-min P(ŷ=1) across groups	Selection-rate disparity
`calibration_gap`	max-min ECE across groups	Score-trust differs by subgroup
`inter_site_auc_variance`	Var(AUC) across sites	Inter-hospital generalization risk (the SaMD failure mode)
`samd_fairness_score`	composite ∈ [0,1] (see `docs/samd-fairness-score.md`)	Single-number summary for QMS dashboards

All gap metrics ship with percentile bootstrap 95% CIs computed over a stratified resample.

The composite samd_fairness_score is defined explicitly with documented weights and a sensitivity analysis in docs/samd-fairness-score.md. It is not a black box and is not an FDA-blessed metric — it is a transparent aggregate the operator can defend, override, or replace.

Scientific context

CONCH (Lu et al. 2024) is the visual-language pathology foundation model used in the underlying TFG work. Lu, M. Y. et al. "A visual-language foundation model for computational pathology." Nature Medicine 30, 863–874 (2024). doi:10.1038/s41591-024-02856-4
AI4SkIN is the multi-hospital dermatopathology dataset (Spain, multi-site) on which the TFG measured the 0.19 inter-hospital gap.
Under-diagnosis bias on chest X-rays (Seyyed-Kalantari et al. 2021, Nat. Med. 27, 2176-2182) is the canonical demonstration that single-site fairness audits miss the dominant failure mode.
Pain disparity reduction (Pierson et al. 2021, Nat. Med. 27, 136-140) demonstrates the inverse — that algorithmic predictions can outperform human-graded severity in capturing real disparities, motivating better measurement, not less.
Ethical implementation (Char, Shah, Magnus 2018, NEJM 378, 981-983) sets the still-canonical framing for healthcare-ML ethics.

A one-paragraph literature pointer for the formal definitions: the equal-opportunity criterion is Hardt, Price, Srebro (NeurIPS 2016); the demographic-parity definition follows Dwork et al. (ITCS 2012); calibration-by-group follows Pleiss et al. (NeurIPS 2017). The composite weighting is justified from FDA GMLP guidance (2021, updated 2024 IMDRF GMLP) + EU AI Act Art. 10 prioritisation of multi-site data governance.

What it does NOT do

Not a model-training framework. Bring your own predictions.
Not a foundation-model serving stack. Embeddings are outside scope.
Not auto-detection of protected attributes. You must declare them — silent attribute inference is itself a bias risk.
Not a certification. A fairness evaluation is evidence; certification is a regulatory process this tool helps you prepare for.
Not an explainability tool. It surfaces where bias lives, not why.

Honest scientific caveats (read before quoting numbers)

Threshold sensitivity. equal_opportunity_gap and demographic_parity_gap both depend on the operating threshold used to produce y_pred. Re-run the evaluation at any threshold you would actually deploy at.
Small-sample bootstrap. Percentile bootstrap is approximate for small groups; for n < 50 prefer the BCa interval or treat CIs as exploratory. Groups with n < min_group_n (default 20) are excluded with a warning rather than silently producing a near-zero gap.
Prevalence confound. Inter-site bias is frequently confounded with prevalence shift. A site that sees twice the disease prevalence will have different TPR even from a perfectly fair model. The tool reports both per-site AUC (less prevalence-sensitive) and per-site rates; interpret jointly.
Composite score is opinionated. The default weights (w_site=0.4, w_eo=0.3, w_dp=0.15, w_cal=0.15) reflect this author's read of regulatory priority. Override with weights= in the Python API or treat the components separately. The single number is for dashboards; the components are for decisions.
No causal inference. A measured gap does not identify the mechanism. Combine with subgroup analysis, training-set provenance audit, and (where possible) prospective evaluation.

Pricing

CLI: MIT, free, forever.
Hosted "fairness CI" (Phase 2 — not yet shipped): planned at €99/month for teams that want every commit to a model repo to fire an evaluation against a frozen multi-site cohort and post the evidence pack as a CI artifact. Mailing list opens at validation green-light.
Consulting: the author is available for SaMD fairness review / AI Act Art. 10 evidence-pack design at €60-100/hour. Contact via the linked GitHub profile; introductions through the academic-DM channel are welcome.

Citing

If you use this tool in published research:

@software{pereiro2026fmmfairness,
  author       = {Pereiro, C{\'e}sar},
  title        = {{fmm-fairness-eval}: SaMD-specific fairness evaluation for foundation-model medical AI},
  year         = {2026},
  publisher    = {Zenodo},
  doi          = {<assigned-on-first-release>},
  url          = {https://github.com/<handle>/fmm-fairness-eval}
}

@thesis{pereiro2024dermfairness,
  author = {Pereiro, C{\'e}sar},
  title  = {Foundation-model-based fairness evaluation in dermatology classification using the AI4SkIN dataset},
  school = {Universitat Polit{\`e}cnica de Val{\`e}ncia},
  year   = {2024},
  url    = {https://riunet.upv.es/handle/10251/226903}
}

Roadmap

v0.1 (this release): CLI, 4 fairness gap metrics, composite score, AI Act manifest mode, SHA-256 audit chain.
v0.2: BCa bootstrap, sub-group intersectionality (site × sex), CSV-of-CSVs batch mode.
v0.3: HTML report option, hosted fairness-CI (Phase 2 — gated on validation pass).
v0.4: subgroup-aware threshold optimisation (opt-in, with the appropriate caveats).

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0a11 pre-release

May 27, 2026

This version

0.1.0

May 19, 2026

0.0.0

May 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fmm_fairness_eval-0.1.0.tar.gz (22.4 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fmm_fairness_eval-0.1.0-py3-none-any.whl (16.3 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file fmm_fairness_eval-0.1.0.tar.gz.

File metadata

Download URL: fmm_fairness_eval-0.1.0.tar.gz
Upload date: May 19, 2026
Size: 22.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for fmm_fairness_eval-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4ab21b59509f5b1b5aac7271c9dc0ae461dea1c68c1e230d066eb381e5862c1f`
MD5	`2a81523cb124000548322da6c536bf8f`
BLAKE2b-256	`5dd41a272ab7509b437e0f078c90d556056905ffed1d1f8daaf16c22f8f2097e`

See more details on using hashes here.

File details

Details for the file fmm_fairness_eval-0.1.0-py3-none-any.whl.

File metadata

Download URL: fmm_fairness_eval-0.1.0-py3-none-any.whl
Upload date: May 19, 2026
Size: 16.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for fmm_fairness_eval-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9dfbcef4b17b5ae8d2a212d02ea36f4ef06a811c60d7b69c5a57165592727cb9`
MD5	`889ebb50b7519f0c3af4a49d6dec6fca`
BLAKE2b-256	`c160e075d3141586cc872e92f38b496996d63a584855e807f310d94952677050`

See more details on using hashes here.

fmm-fairness-eval 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

fmm-fairness-eval

Why this exists

Install

What it does

1. Run an evaluation

2. Read the output

3. Cross-cite to the AI Act

Metrics computed

Scientific context

What it does NOT do

Honest scientific caveats (read before quoting numbers)

Pricing

Citing

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes