A package to generate equity-focused cohort selection flow diagrams
Project description
equiflow
equiflow is a Python package for generating Equity-focused Cohort Selection Flow Diagrams. It facilitates transparent, reproducible documentation of cohort curation in clinical and machine learning research, helping researchers identify and quantify potential selection bias.
Features
- Cohort Flow Visualization: Generate publication-ready flow diagrams showing patient counts at each exclusion step
- Distribution Analysis: Track categorical, normal, and non-normal continuous variables through the selection process
- Demographic Drift Detection: Calculate standardized mean differences (SMDs) to quantify how exclusion criteria affect variable distributions
- Statistical Testing: Compute p-values with optional multiple testing correction (Bonferroni, Benjamini-Hochberg)
- Flexible Interfaces: Use the detailed
EquiFlowclass or the streamlinedEasyFlowAPI
Installation
pip install equiflow
System Dependencies
For flow diagram generation, you need Graphviz installed:
# Ubuntu/Debian
sudo apt-get install graphviz
# macOS
brew install graphviz
# Windows
choco install graphviz
Python Dependencies
- pandas
- numpy
- matplotlib
- graphviz
- scipy
Quick Start
Using EquiFlow (Full Control)
from equiflow import *
import pandas as pd
# Initialize with your dataset
flow = EquiFlow(
data=your_dataframe,
categorical=['sex', 'race', 'insurance_type'],
normal=['age', 'weight', 'height'],
nonnormal=['hospital_stay_days', 'num_previous_admissions']
)
# Add exclusion steps (keep=True means KEEP the row)
flow.add_exclusion(
keep=your_dataframe['age'] >= 18,
exclusion_reason="Age < 18 years",
new_cohort_label="Adult patients"
)
flow.add_exclusion(
keep=your_dataframe['has_complete_data'] == True,
exclusion_reason="Incomplete data",
new_cohort_label="Complete cases"
)
# View tables
flow_table = flow.view_table_flows()
characteristics_table = flow.view_table_characteristics()
drifts_table = flow.view_table_drifts()
pvalues_table = flow.view_table_pvalues(correction="fdr_bh")
# Generate flow diagram
flow.plot_flows(
output_file="patient_selection_flow",
plot_dists=True,
smds=True,
legend=True
)
Using EasyFlow (Streamlined API)
from equiflow import *
# Chainable API for quick analysis
flow = (
EasyFlow(your_dataframe, title="Initial Cohort")
.categorize(['sex', 'race', 'insurance_type'])
.measure_normal(['age', 'weight', 'height'])
.measure_nonnormal(['hospital_stay_days'])
.exclude(your_dataframe['age'] >= 18, "Age < 18 years")
.exclude(lambda df: df['has_complete_data'] == True, "Incomplete data")
.generate(output="patient_flow", show=True)
)
# Access results
print(flow.flow_table)
print(flow.characteristics)
print(flow.drifts)
Core Classes
EquiFlow
The main class for creating cohort flow diagrams with full customization:
| Method | Description |
|---|---|
add_exclusion(keep, exclusion_reason, new_cohort_label) |
Add an exclusion step; rows where keep=True are retained |
view_table_flows() |
Get cohort sizes at each step |
view_table_characteristics() |
Get variable distributions for each cohort |
view_table_drifts() |
Get SMDs between consecutive cohorts |
view_table_pvalues(correction) |
Get p-values with optional multiple testing correction |
plot_flows() |
Generate the visual flow diagram |
EasyFlow
A simplified, chainable interface for rapid analysis:
| Method | Description |
|---|---|
categorize(variables) |
Set categorical variables |
measure_normal(variables) |
Set normally-distributed continuous variables |
measure_nonnormal(variables) |
Set non-normal continuous variables |
exclude(condition, label) |
Add exclusion step; rows where condition=True are kept |
generate(output, show) |
Create the flow diagram |
Distribution Analysis
EquiFlow supports three variable types:
| Type | Display Format | Example |
|---|---|---|
| Categorical | N (%), %, or N | Sex: Male 52.3% |
| Normal | Mean ± SD | Age: 45.2 ± 12.3 |
| Non-normal | Median [IQR] | LOS: 4.0 [2.0, 8.0] |
Standardized Mean Differences (SMDs)
SMDs quantify distribution changes between consecutive cohorts:
- Categorical variables: Cohen's h with Hedges' correction
- Continuous variables: Cohen's d with Hedges' correction
- Interpretation: SMD > 0.1 suggests meaningful drift; SMD > 0.2 indicates substantial change
Statistical Testing
The view_table_pvalues() method supports:
- Categorical variables: Chi-square test (Fisher's exact for 2×2 tables)
- Normal continuous: Welch's t-test
- Non-normal continuous: Kruskal-Wallis test
- Missingness: Two-proportion z-test
Multiple testing correction options:
| Option | Description |
|---|---|
"none" |
No correction (default) |
"bonferroni" |
Bonferroni correction (controls FWER) |
"fdr_bh" |
Benjamini-Hochberg procedure (controls FDR) |
Benefits for Research Equity
EquiFlow helps researchers:
- Make cohort selection decisions transparent and reproducible
- Identify when exclusion criteria disproportionately affect certain groups
- Quantify demographic drift at each selection step
- Document cohort curation in a standardized format
- Comply with equity-focused reporting guidelines
Motivation
Selection bias can arise through many aspects of a study, including recruitment, inclusion/exclusion criteria, input-level exclusion and outcome-level exclusion, and often reflects the underrepresentation of populations historically disadvantaged in medical research. The effects of selection bias can be further amplified when non-representative samples are used in artificial intelligence (AI) and machine learning (ML) applications to construct clinical algorithms.
Building on the "Data Cards" initiative for transparency in AI research, we advocate for the addition of a participant flow diagram for AI studies detailing relevant sociodemographic and clinical characteristics of excluded participants across study phases, with the goal of identifying potential algorithmic biases before clinical implementation.
Citation
If you use EquiFlow in your research, please cite our position paper:
Ellen JG, Matos J, Viola M, et al. Participant flow diagrams for health equity in AI. J Biomed Inform. 2024;152:104631. https://doi.org/10.1016/j.jbi.2024.104631
@article{ellen2024participant,
title={Participant flow diagrams for health equity in AI},
author={Ellen, Jacob G and Matos, Jo{\~a}o and Viola, Matteo and others},
journal={Journal of Biomedical Informatics},
volume={152},
pages={104631},
year={2024},
publisher={Elsevier},
doi={10.1016/j.jbi.2024.104631}
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to:
- Set up a development environment
- Run tests
- Submit pull requests
- Report issues
Related Tools
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file equiflow-0.1.10.tar.gz.
File metadata
- Download URL: equiflow-0.1.10.tar.gz
- Upload date:
- Size: 27.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df752e96ab567d0e83615797b0ce03e7dcf4ceb559fbfec83736f5a8d50bbd6a
|
|
| MD5 |
40daa663a98897d9fd0c735bcba6ba34
|
|
| BLAKE2b-256 |
72c900edf584a138fb35bbeec9da565488645baabc37680245fcb47b3805a2ca
|
File details
Details for the file equiflow-0.1.10-py3-none-any.whl.
File metadata
- Download URL: equiflow-0.1.10-py3-none-any.whl
- Upload date:
- Size: 19.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b43675d42d16312df605bce9d5da2ae84263d053c7aaa1f5411e512f7fd2960d
|
|
| MD5 |
69cf03613717552e635dd77a7e74d486
|
|
| BLAKE2b-256 |
30d9df3bf0ec17207ab896981db4ab08aad8f7fdf2b475fcbf3fe309bb450e88
|