Skip to main content

Batch-effect harmonisation for machine learning frameworks.

Project description

combatlearn

combatlearn logo

combatlearn makes the popular ComBat (and CovBat) batch-effect correction algorithm available for use into machine learning frameworks. It lets you harmonise high-dimensional data inside a scikit-learn Pipeline, so that cross-validation and grid-search automatically take batch structure into account, without data leakage.

Three methods:

  • method="johnson" - classic ComBat (Johnson et al., 2007)
  • method="fortin" - covariate-aware ComBat (Fortin et al., 2018)
  • method="chen" - CovBat (Chen et al., 2022)

Installation

pip install combatlearn

Quick start

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from combatlearn import ComBat

df = pd.read_csv("data.csv", index_col=0)
X, y = df.drop(columns="y"), df["y"]

batch = pd.read_csv("batch.csv", index_col=0, squeeze=True)
diag = pd.read_csv("diagnosis.csv", index_col=0) # categorical
age = pd.read_csv("age.csv", index_col=0) # continuous

pipe = Pipeline([
    ("combat", ComBat(
        batch=batch,
        discrete_covariates=diag,
        continuous_covariates=age,
        method="fortin", # or "johnson" or "chen"
        parametric=True
    )),
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression())
])

param_grid = {
    "combat__mean_only": [True, False],
    "clf__C": [0.01, 0.1, 1, 10],
}

grid = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    cv=5,
    scoring="roc_auc",
)

grid.fit(X, y)

print("Best parameters:", grid.best_params_)
print(f"Best CV AUROC: {grid.best_score_:.3f}")

For a full example of how to use combatlearn see the notebook demo

Contributing

Pull requests, bug reports, and feature ideas are welcome: feel free to open a PR!

Acknowledgements

This project builds on the excellent work of the ComBat family of harmonisation methods. We gratefully acknowledge:

Citation

If combatlearn is useful in your research, please cite the original papers:

  • Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007 Jan;8(1):118-27. doi: 10.1093/biostatistics/kxj037

  • Fortin JP, Cullen N, Sheline YI, Taylor WD, Aselcioglu I, Cook PA, Adams P, Cooper C, Fava M, McGrath PJ, McInnis M, Phillips ML, Trivedi MH, Weissman MM, Shinohara RT. Harmonization of cortical thickness measurements across scanners and sites. Neuroimage. 2018 Feb 15;167:104-120. doi: 10.1016/j.neuroimage.2017.11.024

  • Chen AA, Beer JC, Tustison NJ, Cook PA, Shinohara RT, Shou H; Alzheimer's Disease Neuroimaging Initiative. Mitigating site effects in covariance for machine learning in neuroimaging data. Hum Brain Mapp. 2022 Mar;43(4):1179-1195. doi: 10.1002/hbm.25688

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

combatlearn-0.1.0.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

combatlearn-0.1.0-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file combatlearn-0.1.0.tar.gz.

File metadata

  • Download URL: combatlearn-0.1.0.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for combatlearn-0.1.0.tar.gz
Algorithm Hash digest
SHA256 90657492d4129cf0d3d38a9a909e274afb5cfffb9ec795cef2ef1051bfbfa1c3
MD5 c1731dd3809f9c7ab948c396dd7308c0
BLAKE2b-256 8b8861478312f7334ced46472de06cbc424a03ab75bb78dfe851bf1ff6b62cd2

See more details on using hashes here.

File details

Details for the file combatlearn-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: combatlearn-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for combatlearn-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 57729384cb7d30ed27731d7b11a7f270cab7a033f02cdad0a608eee9df988d43
MD5 4eaee746a8bafdf244d0e0243dfe2dc7
BLAKE2b-256 1036f9fd9f6d1056fa01e4954809eb4cc68202f10c88851fec761c795c0f697b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page