Skip to main content

Multimodal Epigenetic Sequencing Analysis (MESA) is a flexible and sensitive method of capturing and integrating multimodal epigenetic information of cfDNA using a single experimental assay.

Project description

Multimodal Epigenetic Sequencing Analysis (MESA)

A flexible and sensitive method for capturing and integrating multimodal epigenetic information from cell-free DNA (cfDNA) using a single experimental assay.

Check demo.ipynd for the example.

Overview

MESA (Multimodal Epigenetic Sequencing Analysis) provides a comprehensive framework for analyzing multimodal epigenetic data from cfDNA. The package features a sklearn-compatible API that seamlessly integrates preprocessing, scaling, feature selection, model training, and cross-validation workflows.

Key Features

  • Multimodal Integration: Combine multiple epigenetic data modalities using ensemble stacking
  • Advanced Feature Selection: Boruta algorithm combined with univariate selection to keep a balance between computation time and biomarker discovery
  • Robust Cross-Validation: Built-in evaluation framework with performance metrics for easy finetuning
  • Flexible Pipeline: Customizable preprocessing and classification components
  • Missing Value Handling: Intelligent filtering and imputation strategies

Installation

# Install package with pip
pip install mesa-cfdna

Quick Start

from mesa import MESA_modality, MESA, MESA_CV
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load your data
X_train, y_train = load_data()  # Your data loading function

# Single modality analysis
modality_1 = MESA_modality(top_n=50, classifier=RandomForestClassifier(random_state=0), variance_threshold=0, normalization=True)
modality_1.fit(X_train, y_train)
predictions = modality_1.transform_predict_proba(X_test)

modality_2 = MESA_modality(top_n=100, classifier=LogisticRegression(random_state=0), variance_threshold=0, normalization=False, missing=0)
modality_2.fit(X_train, y_train)
predictions = modality_2.transform_predict_proba(X_test)

# Multi-modality ensemble
modalities = [modality_1, modality_2]
mesa = MESA(modalities)
mesa.fit([X1_train, X2_train], y_train)
mesa_predictions = mesa.predict_proba([X1_test, X2_test])

API Reference

MESA_modality

Single modality analysis with comprehensive preprocessing and feature selection pipeline.

Parameters

Parameter Type Default Description
top_n int 100 Number of features to select using Boruta algorithm
variance_threshold float 0 Minimum variance threshold for feature filtering
normalization bool False Whether to apply L2 normalization
missing float 0.1 Maximum proportion of missing values allowed per feature
classifier estimator RandomForestClassifier() Final classifier for predictions
selector int/estimator GenericUnivariateSelect() Univariate feature selector
boruta_estimator estimator RandomForestClassifier() Base estimator for Boruta selection
random_state int 0 Random seed for reproducibility

Methods

  • fit(X, y): Fit the preprocessing pipeline and classifier
  • transform(X): Apply preprocessing pipeline only
  • predict(X): Predict class labels for preprocessed data
  • predict_proba(X): Predict class probabilities for preprocessed data
  • transform_predict(X): Apply pipeline and predict in one step
  • transform_predict_proba(X): Apply pipeline and predict probabilities
  • get_support(step=None): Get indices of selected features
  • get_params(deep=True): Get model parameters

MESA

Multi-modality ensemble with stacking architecture for integrating multiple data types.

Parameters

Parameter Type Default Description
modalities list Required List of MESA_modality objects
meta_estimator estimator LogisticRegression() Meta-learner for ensemble combination
random_state int 0 Random seed for reproducibility
cv cv generator RepeatedStratifiedKFold() Cross-validation strategy for meta-features

Methods

  • fit(X_list, y): Fit all modalities and meta-estimator
  • predict(X_list_test): Predict class labels using ensemble
  • predict_proba(X_list_test): Predict class probabilities using ensemble
  • get_support(step=None): Get feature support from all modalities

MESA_CV

Cross-validation wrapper for performance evaluation of MESA models.

Parameters

Parameter Type Default Description
modality estimator Required MESA_modality or MESA object to evaluate
random_state int 0 Random seed for reproducibility
cv cv generator StratifiedKFold(n_splits=5) Cross-validation strategy

Methods

  • fit(X, y): Perform cross-validation on provided data
  • get_performance(): Calculate mean ROC AUC score across CV folds

Attributes

  • cv_result: List of (y_pred, y_true) tuples from each CV fold
  • modality: The fitted modality estimator being evaluated

Usage Examples

Example 1: Single Modality Analysis

import pandas as pd
import numpy as np
from mesa import MESA_modality, MESA_CV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold

# Load single modality data
X = pd.read_csv('methylation_data.csv', index_col=0)
y = pd.read_csv('labels.csv', index_col=0).values.ravel()

# Create and configure modality
modality = MESA_modality(
    top_n=50,
    missing=0.2,
    normalization=True,
    classifier=RandomForestClassifier(n_estimators=100, random_state=42)
)

# Fit the modality
modality.fit(X, y)

# Make predictions on new data
X_test = pd.read_csv('test_data.csv', index_col=0)
predictions = modality.transform_predict_proba(X_test)
print(f"Prediction probabilities shape: {predictions.shape}")

# Get selected features
selected_features = modality.get_support()
print(f"Number of selected features: {len(selected_features)}")

# Cross-validation evaluation
cv_eval = MESA_CV(
    modality=MESA_modality(top_n=50, missing=0.2),
    cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
)
cv_eval.fit(X, y)
auc_score = cv_eval.get_performance()
print(f"Cross-validation AUC: {auc_score:.3f}")

Example 2: Multi-Modality Ensemble

import pandas as pd
from mesa import MESA_modality, MESA, MESA_CV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Load multi-modal data
methylation_data = pd.read_csv('methylation.csv', index_col=0)
histone_data = pd.read_csv('histone_marks.csv', index_col=0)
chromatin_data = pd.read_csv('chromatin_accessibility.csv', index_col=0)
y = pd.read_csv('labels.csv', index_col=0).values.ravel()

# Define modality-specific configurations
modalities = [
    MESA_modality(
        top_n=100,
        missing=0.1,
        classifier=RandomForestClassifier(n_estimators=200, random_state=42),
        normalization=True
    ),
    MESA_modality(
        top_n=80,
        missing=0.15,
        classifier=SVC(probability=True, random_state=42),
        normalization=False
    ),
    MESA_modality(
        top_n=60,
        missing=0.2,
        classifier=LogisticRegression(random_state=42),
        normalization=True
    )
]

# Create MESA ensemble
mesa = MESA(
    modalities=modalities,
    meta_estimator=LogisticRegression(random_state=42),
    random_state=42
)

# Fit the ensemble
X_list = [methylation_data, histone_data, chromatin_data]
mesa.fit(X_list, y)

# Make ensemble predictions
X_test_list = [
    pd.read_csv('methylation_test.csv', index_col=0),
    pd.read_csv('histone_test.csv', index_col=0),
    pd.read_csv('chromatin_test.csv', index_col=0)
]
ensemble_predictions = mesa.predict_proba(X_test_list)
print(f"Ensemble predictions shape: {ensemble_predictions.shape}")

# Get feature support from all modalities
feature_supports = mesa.get_support()
for i, support in enumerate(feature_supports):
    print(f"Modality {i+1}: {len(support)} features selected")

Example 3: Cross-Validation for Multi-Modality

from mesa import MESA, MESA_modality, MESA_CV
from sklearn.model_selection import RepeatedStratifiedKFold

# Define ensemble for CV evaluation
modalities = [
    MESA_modality(top_n=50, missing=0.1),
    MESA_modality(top_n=40, missing=0.15),
    MESA_modality(top_n=60, missing=0.2)
]

mesa_ensemble = MESA(
    modalities=modalities,
    cv=RepeatedStratifiedKFold(n_splits=5, n_repeats=5, random_state=42)
)

# Cross-validation evaluation
cv_eval = MESA_CV(
    modality=mesa_ensemble,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
)

# Perform CV on multi-modal data
X_list = [methylation_data, histone_data, chromatin_data]
cv_eval.fit(X_list, y)

# Get performance metrics
mean_auc = cv_eval.get_performance()
print(f"Multi-modal ensemble CV AUC: {mean_auc:.3f}")

# Access individual fold results
for i, (y_pred, y_true) in enumerate(cv_eval.cv_result):
    fold_auc = roc_auc_score(y_true, y_pred[:, 1])
    print(f"Fold {i+1} AUC: {fold_auc:.3f}")

Example 4: Custom Feature Selection Pipeline

from mesa import MESA_modality
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import GradientBoostingClassifier

# Custom modality with different feature selection
custom_modality = MESA_modality(
    top_n=30,
    variance_threshold=0.01,
    missing=0.05,
    normalization=True,
    selector=SelectKBest(score_func=f_classif, k=1000),
    classifier=GradientBoostingClassifier(n_estimators=100, random_state=42),
    boruta_estimator=GradientBoostingClassifier(n_estimators=50, random_state=42)
)

# Fit and evaluate
custom_modality.fit(X, y)
custom_predictions = custom_modality.transform_predict_proba(X_test)

# Compare with default configuration
default_modality = MESA_modality()
cv_custom = MESA_CV(custom_modality)
cv_default = MESA_CV(default_modality)

cv_custom.fit(X, y)
cv_default.fit(X, y)

print(f"Custom configuration AUC: {cv_custom.get_performance():.3f}")
print(f"Default configuration AUC: {cv_default.get_performance():.3f}")

Example 5: Feature Importance Analysis

# Analyze feature importance across modalities
modality = MESA_modality(top_n=100)
modality.fit(X, y)

# Get feature support at different pipeline steps
missing_support = modality.get_support(step=0)  # After missing value filtering
variance_support = modality.get_support(step=1)  # After variance filtering
univariate_support = modality.get_support(step=2)  # After univariate selection
final_support = modality.get_support()  # Final selected features

print(f"Features after missing value filter: {len(missing_support)}")
print(f"Features after variance filter: {len(variance_support)}")
print(f"Features after univariate selection: {len(univariate_support)}")
print(f"Final selected features: {len(final_support)}")

# Get feature names if using DataFrame
if hasattr(X, 'columns'):
    selected_feature_names = X.columns[final_support]
    print(f"Selected features: {selected_feature_names.tolist()}")

Performance Tips

  1. Memory Management: For large datasets, consider reducing top_n and using n_jobs=1 for Boruta
  2. Feature Selection: Adjust missing threshold based on data quality
  3. Cross-Validation: Use fewer repeats for initial exploration, more for final evaluation
  4. Ensemble Size: Start with 2-3 modalities, add more based on performance gains

Citation

If you use MESA in your research, please cite:

Li, Y., Xu, J., Chen, C. et al. Multimodal epigenetic sequencing analysis (MESA) of cell-free DNA for non-invasive colorectal cancer detection. Genome Med 16, 9 (2024). https://doi.org/10.1186/s13073-023-01280-6

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Support

For questions and support:

  • Open an issue on GitHub
  • Email: c.chen@uci.edu
  • Documentation: [Link to full documentation]

Keywords: cfDNA, epigenetics, multimodal analysis, machine learning, feature selection, ensemble learning, stacking, bioinformatics, biomarker discovery, methylation, computational biology, early detection

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mesa_cfdna-0.6.0.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mesa_cfdna-0.6.0-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file mesa_cfdna-0.6.0.tar.gz.

File metadata

  • Download URL: mesa_cfdna-0.6.0.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mesa_cfdna-0.6.0.tar.gz
Algorithm Hash digest
SHA256 2dd6f5beb1c8ded54ad76145b9b63fbc7a00f024d0fba1162971ec5cb286cb77
MD5 9dd9388be99e78204cad4e567c6ef9da
BLAKE2b-256 1b19892fc50daf0ce5a91fb059395ca93c112ea81fa70e6852c478bdfa566bf3

See more details on using hashes here.

Provenance

The following attestation bundles were made for mesa_cfdna-0.6.0.tar.gz:

Publisher: python-publish.yml on ChaorongC/mesa_cfdna

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mesa_cfdna-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: mesa_cfdna-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 14.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mesa_cfdna-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4d41b23bcc689611dfe9bff28fbbb987f95c9d47201917e61922298ea4dd33dd
MD5 4bebb4521ab2a0267910f1b57810a8dc
BLAKE2b-256 420a83c2fb92eb3689e65c3a1b5f6e2f7dfe2fed609a6baa1761f4e225042ced

See more details on using hashes here.

Provenance

The following attestation bundles were made for mesa_cfdna-0.6.0-py3-none-any.whl:

Publisher: python-publish.yml on ChaorongC/mesa_cfdna

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page