Consistent Reference External Batch harmonization for neuroimaging data

These details have not been verified by PyPI

Project links

Project description

CREB: Consistent Reference External Batch Harmonization

CREB is a harmonization tool for harmonizing data with empirical Bayes methods similar to ComBat. Unlike ComBat, CREB is capable of harmonizing train and test data independently which prevents data leakage. CREB has two main functions CREBLearn and CREBApply, which first learns 'site' priors and then applies this to new unseen data, respectively. This model can be easily deployed in machine learning models to harmonize data prior to conducting prediction in test sets. This package provides functionality for correcting site effects in data while preserving biologically relevant covariate effects.

Citations

Cite the CREB manuscript: Kharade A., Pan Y., Andreescu C., Karim H., CREB: Consistent Reference External Batch Harmonization. Under review.

Contacts:

Helmet Karim
Yiyan Pan
Ameya Kharade

Overview
Installation
Quick Start
Working with Parquet Files
API Reference
How It Works
Usage Tips
Troubleshooting
Contributing

Overview

This package provides a simple, memory-efficient way to correct for site effects in data using empirical Bayes methods. The core algorithm first builds a harmonization bundle from training data, then applies the learned corrections to new datasets. The package allow for harmonization of completely unseen sites that were not present in the original training data.

Installation

Option 1: Install from PyPI

pip install creb

Option 2: Install from Source with uv

We recommend using uv for fast and reliable dependency management:

# Clone the repository
git clone https://github.com/tetra-tools/CREB.git
cd creb

# Create virtual environment and install dependencies
uv venv
uv sync

Quick Start

The package requires two inputs:

Data matrix: A pandas DataFrame containing features to harmonize (e.g., brain volumes, connectivity measures)
Covariates matrix: A pandas DataFrame containing covariates with a required "SITE" column

Data Matrix Example

Your data should be a numeric matrix where rows are subjects and columns are features:

       feature_1  feature_2  feature_3  ...  feature_103741
0       3138.0    3164.2      206.4    ...     1847.3
1       1708.4    2351.2      364.0    ...     1942.1
...      ...       ...        ...      ...        ...

Covariates Matrix Example

The covariates DataFrame must contain a "SITE" column and any other numeric covariates:

     SITE   AGE  SEX_M
0  SITE_A  76.5      1
1  SITE_B  80.1      1
2  SITE_A  82.9      0
...   ...   ...    ...

Important notes:

Both matrices must have the same number of rows (subjects)
All covariates must be numeric (handle categorical variables with pandas.get_dummies beforehand)
No missing values are allowed - perform complete case analysis first
The order of subjects must be identical in both matrices

Basic Usage

import pandas as pd
import creb.creb as cr

# Load your data
data = pd.read_csv('brain_features.csv')
covars = pd.read_csv('subject_covariates.csv')

# Create harmonization bundle from training data
bundle = cr.crebLearn(
    covars=covars,
    data=data,
    output_path="harmonization_bundle.pkl",
    verbose=True
)

# Load bundle info
print(cr.getBundleInfo(cr.loadBundle("harmonization_bundle.pkl")))

# Harmonize new data using the bundle
harmonized_data = cr.crebApply(
    covars=covars,
    data=data,
    bundle_path="harmonization_bundle.pkl",
    method="joint",  # or "iterative"
    verbose=True,
)

Using Pre-uploaded Synthetic Bundle

For quick testing and prototyping, you can use our pre-uploaded bundle that was trained on 9 diverse neuroimaging sites:

import pandas as pd
import creb.creb as cr

# Load your data
data = pd.read_csv('your_brain_features.csv')
covars = pd.read_csv('your_subject_covariates.csv')

harmonized_data = cr.crebApply(
    covars=covars,
    data=data,
    bundle_path="pretrained_bundle_9_sites.pkl",
    method="joint"
)
# Note: The pre-uploaded bundle will be regularly updated and expanded

Working with Parquet Files

For large datasets, the package supports Parquet format:

import pandas as pd
import creb.creb as cr

# Load data from Parquet
data = pd.read_parquet('brain_features.parquet')
covars = pd.read_parquet('subject_covariates.parquet')

# Create bundle
bundle = cr.crebLearn(covars=covars, data=data, output_path="bundle.pkl")

# Harmonize and save as Parquet
harmonized = cr.crebApply(
    covars=covars,
    data=data,
    bundle_path="bundle.pkl",
    output_path="harmonized_data.parquet"
)

API Reference

crebLearn

Creates a harmonization bundle from training data.

def crebLearn(covars: pd.DataFrame,
              data: pd.DataFrame,
              include_site_dummies: bool = False,
              output_path: Optional[str] = None) -> Dict[str, Any]

Parameters:

covars: DataFrame with covariates (must contain 'SITE' column)
data: DataFrame with feature data (same number of rows as covars)
output_path: Optional path to save the bundle as pickle

Returns:

Dictionary containing the harmonization bundle with learned parameters

crebApply

Harmonizes new data using a pre-trained bundle.

def crebApply(covars: pd.DataFrame,
                data: pd.DataFrame,
                bundle_path: str,
                method: str = "joint",  
                output_path: Optional[str] = None,
                verbose: bool = False,
                makeplot: bool = False,
                log_level: Optional[str] = None) -> pd.DataFrame:

Parameters:

covars: DataFrame with covariates (must contain 'SITE' column)
data: DataFrame with feature data (same number of rows as covars)
bundle_path: Path to harmonization bundle
method: "joint" (default) or "iterative" posterior updates
output_path: Optional path to save harmonized data
verbose: Optional flag for verbose output printing
makeplot: Optional flag to plot distribution of site effect mean and variance before and after posterior update
log_level: Optional input for logging level

Returns:

DataFrame with harmonized data

loadBundle

Loads a harmonization bundle from file.

def loadBundle(bundle_path: str, verbose: bool = False) -> Dict[str, Any]

getBundleInfo

Gets summary information about a bundle.

def getBundleInfo(bundle: Dict[str, Any]) -> Dict[str, Any]

How It Works

1. Bundle Creation (crebLearn)

Covariate regression: Fit linear model Y = X * B + R where X includes intercept + covariates. Normalize R with pooled variance
Site effect aggregation: Compute site-level summary statistics (means, SST) from residuals R
Prior estimation: Learn Empirical Bayes priors from site statistics
Bundle creation: Save all parameters needed for harmonization

2. Harmonization (crebApply)

Residual computation: Compute residuals R = Y - X * B using learned coefficients from bundle. Normalize with pooled variance from train bundle.
Site statistics: Compute per-site means and SST from residuals
Posterior updates: Apply empirical Bayes update with learned priors
Reconstruction: Multiply by pooled variance, add biological covariate effects back

Correction Methods

"joint": Uses group-wise priors for simultaneous location/scale correction
"iterative": iteratively assume we know mean or variance of the Residual distribution to make update, return when reach convergence.

Usage Tips

Typical Workflow:

To use CREB, user create harmonization bundle using their training data. They can then upload/share only the the trained bundle (not the training data). Anyone with access to the bundle can apply harmonization to new datasets at runtime.

import creb.creb as cr
harmonized_data = cr.crebApply(
    covars=new_data_covars,
    data=new_data_features,
    bundle_path='harmonization_bundle.pkl'
)

This approach enables:

Privacy compliance: Training data stays secure while harmonization capabilities are deployed
Reproducibility: Consistent harmonization parameters across deployments

Handling Missing Values

The package requires complete data. Handle missing values before harmonization:

# Remove subjects with missing covariates
mask = covars.notna().all(axis=1) & data.notna().all(axis=1)
covars_clean = covars[mask].copy()
data_clean = data[mask].copy()

Custom Feature Selection

# Select specific feature types
feature_cols = [col for col in data.columns if col.startswith('connectivity_')]
data_selected = data[feature_cols]

Multiple External Datasets

# Harmonize multiple external datasets with same bundle
external_datasets = ['camcan', 'aging', 'hcp']
for name in external_datasets:
    ext_data = pd.read_parquet(f'external_{name}.parquet')
    ext_covars = pd.read_parquet(f'covars_{name}.parquet')

    harmonized = cr.crebApply(
        covars=ext_covars,
        data=ext_data,
        bundle_path="bundle.pkl",
        output_path=f'harm_{name}.parquet'
    )

Troubleshooting

Common Issues

"SITE column not found"

Ensure your covariates DataFrame contains a column named exactly "SITE"

"Missing required covariates"

Make sure all covariates present in training are also in new data
Check for exact column name matches (case-sensitive)

Contributing

Contributions are welcome! Please submit pull requests to the GitHub repository.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Nov 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

creb-0.1.0.tar.gz (28.8 kB view details)

Uploaded Nov 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

creb-0.1.0-py3-none-any.whl (26.8 kB view details)

Uploaded Nov 1, 2025 Python 3

File details

Details for the file creb-0.1.0.tar.gz.

File metadata

Download URL: creb-0.1.0.tar.gz
Upload date: Nov 1, 2025
Size: 28.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.19

File hashes

Hashes for creb-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`bc4cf5a8709053ec681998f44888a843826998fced1bb6347842d2d74ca24f21`
MD5	`c53945fde73904b0cf59e4b61ef3fa4b`
BLAKE2b-256	`bd65d2b824189d58c8b229a180f97086f0f0a76a60fdfdb79f8b9833c9aefc1f`

See more details on using hashes here.

File details

Details for the file creb-0.1.0-py3-none-any.whl.

File metadata

Download URL: creb-0.1.0-py3-none-any.whl
Upload date: Nov 1, 2025
Size: 26.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.19

File hashes

Hashes for creb-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`130a125ecd5084a2bcdb3e3044c2bc5e9e1549e01c04bc1dfbcab909263529a2`
MD5	`a8f8227b1cd6feca489d7e61b5fd73ea`
BLAKE2b-256	`60781a754e60eeebb510277c5956add6ed7fa4e9c92deb97e5d959a0a36d26d5`

See more details on using hashes here.

creb 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CREB: Consistent Reference External Batch Harmonization

Citations

Table of Contents

Overview

Installation

Option 1: Install from PyPI

Option 2: Install from Source with uv

Quick Start

Data Matrix Example

Covariates Matrix Example

Basic Usage

Using Pre-uploaded Synthetic Bundle

Working with Parquet Files

API Reference

crebLearn

crebApply

loadBundle

getBundleInfo

How It Works

1. Bundle Creation (crebLearn)

2. Harmonization (crebApply)

Correction Methods

Usage Tips

Typical Workflow:

Handling Missing Values

Custom Feature Selection

Multiple External Datasets

Troubleshooting

Common Issues

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes