Skip to main content

A package to generate equity-focused cohort selection flow diagrams

Project description

equiflow

Tests PyPI version License: MIT Python 3.9+

equiflow is a Python package for generating Equity-focused Cohort Selection Flow Diagrams. It facilitates transparent, reproducible documentation of cohort curation in clinical and machine learning research, helping researchers identify and quantify potential selection bias.

Features

  • Cohort Flow Visualization: Generate publication-ready flow diagrams showing patient counts at each exclusion step
  • Distribution Analysis: Track categorical, normal, and non-normal continuous variables through the selection process
  • Demographic Drift Detection: Calculate standardized mean differences (SMDs) to quantify how exclusion criteria affect variable distributions
  • Statistical Testing: Compute p-values with optional multiple testing correction (Bonferroni, Benjamini-Hochberg)
  • Flexible Interfaces: Use the detailed EquiFlow class or the streamlined EasyFlow API

Installation

pip install equiflow

System Dependencies

For flow diagram generation, you need Graphviz installed:

# Ubuntu/Debian
sudo apt-get install graphviz

# macOS
brew install graphviz

# Windows
choco install graphviz

Python Dependencies

  • pandas
  • numpy
  • matplotlib
  • graphviz
  • scipy

Quick Start

Using EquiFlow (Full Control)

from equiflow import *
import pandas as pd

# Initialize with your dataset
flow = EquiFlow(
    data=your_dataframe,
    categorical=['sex', 'race', 'insurance_type'],
    normal=['age', 'weight', 'height'],
    nonnormal=['hospital_stay_days', 'num_previous_admissions']
)

# Add exclusion steps (keep=True means KEEP the row)
flow.add_exclusion(
    keep=your_dataframe['age'] >= 18,
    exclusion_reason="Age < 18 years",
    new_cohort_label="Adult patients"
)

flow.add_exclusion(
    keep=your_dataframe['has_complete_data'] == True,
    exclusion_reason="Incomplete data",
    new_cohort_label="Complete cases"
)

# View tables
flow_table = flow.view_table_flows()
characteristics_table = flow.view_table_characteristics()
drifts_table = flow.view_table_drifts()
pvalues_table = flow.view_table_pvalues(correction="fdr_bh")

# Generate flow diagram
flow.plot_flows(
    output_file="patient_selection_flow",
    plot_dists=True,
    smds=True,
    legend=True
)

Using EasyFlow (Streamlined API)

from equiflow import *

# Chainable API for quick analysis
flow = (
    EasyFlow(your_dataframe, title="Initial Cohort")
    .categorize(['sex', 'race', 'insurance_type'])
    .measure_normal(['age', 'weight', 'height'])
    .measure_nonnormal(['hospital_stay_days'])
    .exclude(your_dataframe['age'] >= 18, "Age < 18 years")
    .exclude(lambda df: df['has_complete_data'] == True, "Incomplete data")
    .generate(output="patient_flow", show=True)
)

# Access results
print(flow.flow_table)
print(flow.characteristics)
print(flow.drifts)

Core Classes

EquiFlow

The main class for creating cohort flow diagrams with full customization:

Method Description
add_exclusion(keep, exclusion_reason, new_cohort_label) Add an exclusion step; rows where keep=True are retained
view_table_flows() Get cohort sizes at each step
view_table_characteristics() Get variable distributions for each cohort
view_table_drifts() Get SMDs between consecutive cohorts
view_table_pvalues(correction) Get p-values with optional multiple testing correction
plot_flows() Generate the visual flow diagram

EasyFlow

A simplified, chainable interface for rapid analysis:

Method Description
categorize(variables) Set categorical variables
measure_normal(variables) Set normally-distributed continuous variables
measure_nonnormal(variables) Set non-normal continuous variables
exclude(condition, label) Add exclusion step; rows where condition=True are kept
generate(output, show) Create the flow diagram

Distribution Analysis

EquiFlow supports three variable types:

Type Display Format Example
Categorical N (%), %, or N Sex: Male 52.3%
Normal Mean ± SD Age: 45.2 ± 12.3
Non-normal Median [IQR] LOS: 4.0 [2.0, 8.0]

Standardized Mean Differences (SMDs)

SMDs quantify distribution changes between consecutive cohorts:

  • Categorical variables: Cohen's h with Hedges' correction
  • Continuous variables: Cohen's d with Hedges' correction
  • Interpretation: SMD > 0.1 suggests meaningful drift; SMD > 0.2 indicates substantial change

Statistical Testing

The view_table_pvalues() method supports:

  • Categorical variables: Chi-square test (Fisher's exact for 2×2 tables)
  • Normal continuous: Welch's t-test
  • Non-normal continuous: Kruskal-Wallis test
  • Missingness: Two-proportion z-test

Multiple testing correction options:

Option Description
"none" No correction (default)
"bonferroni" Bonferroni correction (controls FWER)
"fdr_bh" Benjamini-Hochberg procedure (controls FDR)

Benefits for Research Equity

EquiFlow helps researchers:

  • Make cohort selection decisions transparent and reproducible
  • Identify when exclusion criteria disproportionately affect certain groups
  • Quantify demographic drift at each selection step
  • Document cohort curation in a standardized format
  • Comply with equity-focused reporting guidelines

Motivation

Selection bias can arise through many aspects of a study, including recruitment, inclusion/exclusion criteria, input-level exclusion and outcome-level exclusion, and often reflects the underrepresentation of populations historically disadvantaged in medical research. The effects of selection bias can be further amplified when non-representative samples are used in artificial intelligence (AI) and machine learning (ML) applications to construct clinical algorithms.

Building on the "Data Cards" initiative for transparency in AI research, we advocate for the addition of a participant flow diagram for AI studies detailing relevant sociodemographic and clinical characteristics of excluded participants across study phases, with the goal of identifying potential algorithmic biases before clinical implementation.

Citation

If you use EquiFlow in your research, please cite our position paper:

Ellen JG, Matos J, Viola M, et al. Participant flow diagrams for health equity in AI. J Biomed Inform. 2024;152:104631. https://doi.org/10.1016/j.jbi.2024.104631

@article{ellen2024participant,
  title={Participant flow diagrams for health equity in AI},
  author={Ellen, Jacob G and Matos, Jo{\~a}o and Viola, Matteo and others},
  journal={Journal of Biomedical Informatics},
  volume={152},
  pages={104631},
  year={2024},
  publisher={Elsevier},
  doi={10.1016/j.jbi.2024.104631}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to:

  • Set up a development environment
  • Run tests
  • Submit pull requests
  • Report issues

Related Tools

  • tableone - Summary statistics for patient populations
  • aequitas - Bias and fairness audit toolkit

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

equiflow-0.1.10.tar.gz (27.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

equiflow-0.1.10-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file equiflow-0.1.10.tar.gz.

File metadata

  • Download URL: equiflow-0.1.10.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for equiflow-0.1.10.tar.gz
Algorithm Hash digest
SHA256 df752e96ab567d0e83615797b0ce03e7dcf4ceb559fbfec83736f5a8d50bbd6a
MD5 40daa663a98897d9fd0c735bcba6ba34
BLAKE2b-256 72c900edf584a138fb35bbeec9da565488645baabc37680245fcb47b3805a2ca

See more details on using hashes here.

File details

Details for the file equiflow-0.1.10-py3-none-any.whl.

File metadata

  • Download URL: equiflow-0.1.10-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for equiflow-0.1.10-py3-none-any.whl
Algorithm Hash digest
SHA256 b43675d42d16312df605bce9d5da2ae84263d053c7aaa1f5411e512f7fd2960d
MD5 69cf03613717552e635dd77a7e74d486
BLAKE2b-256 30d9df3bf0ec17207ab896981db4ab08aad8f7fdf2b475fcbf3fe309bb450e88

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page