Multimodal Epigenetic Sequencing Analysis (MESA) is a flexible and sensitive method of capturing and integrating multimodal epigenetic information of cfDNA using a single experimental assay.
Project description
Multimodal Epigenetic Sequencing Analysis (MESA)
A flexible and sensitive method for capturing and integrating multimodal epigenetic information from cell-free DNA (cfDNA) using a single experimental assay.
Overview
MESA (Multimodal Epigenetic Sequencing Analysis) provides a comprehensive framework for analyzing multimodal epigenetic data from cfDNA. The package features a sklearn-compatible API that seamlessly integrates preprocessing, scaling, feature selection, model training, and cross-validation workflows.
Key Features
- Multimodal Integration: Combine multiple epigenetic data modalities using ensemble stacking
- Advanced Feature Selection: Boruta algorithm combined with univariate selection to keep a balance between computation time and biomarker discovery
- Robust Cross-Validation: Built-in evaluation framework with performance metrics for easy finetuning
- Flexible Pipeline: Customizable preprocessing and classification components
- Missing Value Handling: Intelligent filtering and imputation strategies
Installation
# Install package with pip
pip install mesa-cfdna
Quick Start
from mesa import MESA_modality, MESA, MESA_CV
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# Load your data
X_train, y_train = load_data() # Your data loading function
# Single modality analysis
modality_1 = MESA_modality(top_n=50, classifier=RandomForestClassifier(random_state=0), variance_threshold=0, normalization=True)
modality_1.fit(X_train, y_train)
predictions = modality_1.transform_predict_proba(X_test)
modality_2 = MESA_modality(top_n=100, classifier=LogisticRegression(random_state=0), variance_threshold=0, normalization=False, missing=0)
modality_2.fit(X_train, y_train)
predictions = modality_2.transform_predict_proba(X_test)
# Multi-modality ensemble
modalities = [modality_1, modality_2]
mesa = MESA(modalities)
mesa.fit([X1_train, X2_train], y_train)
mesa_predictions = mesa.predict_proba([X1_test, X2_test])
API Reference
MESA_modality
Single modality analysis with comprehensive preprocessing and feature selection pipeline.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
top_n |
int | 100 | Number of features to select using Boruta algorithm |
variance_threshold |
float | 0 | Minimum variance threshold for feature filtering |
normalization |
bool | False | Whether to apply L2 normalization |
missing |
float | 0.1 | Maximum proportion of missing values allowed per feature |
classifier |
estimator | RandomForestClassifier() | Final classifier for predictions |
selector |
int/estimator | GenericUnivariateSelect() | Univariate feature selector |
boruta_estimator |
estimator | RandomForestClassifier() | Base estimator for Boruta selection |
random_state |
int | 0 | Random seed for reproducibility |
Methods
fit(X, y): Fit the preprocessing pipeline and classifiertransform(X): Apply preprocessing pipeline onlypredict(X): Predict class labels for preprocessed datapredict_proba(X): Predict class probabilities for preprocessed datatransform_predict(X): Apply pipeline and predict in one steptransform_predict_proba(X): Apply pipeline and predict probabilitiesget_support(step=None): Get indices of selected featuresget_params(deep=True): Get model parameters
MESA
Multi-modality ensemble with stacking architecture for integrating multiple data types.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
modalities |
list | Required | List of MESA_modality objects |
meta_estimator |
estimator | LogisticRegression() | Meta-learner for ensemble combination |
random_state |
int | 0 | Random seed for reproducibility |
cv |
cv generator | RepeatedStratifiedKFold() | Cross-validation strategy for meta-features |
Methods
fit(X_list, y): Fit all modalities and meta-estimatorpredict(X_list_test): Predict class labels using ensemblepredict_proba(X_list_test): Predict class probabilities using ensembleget_support(step=None): Get feature support from all modalities
MESA_CV
Cross-validation wrapper for performance evaluation of MESA models.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
modality |
estimator | Required | MESA_modality or MESA object to evaluate |
random_state |
int | 0 | Random seed for reproducibility |
cv |
cv generator | StratifiedKFold(n_splits=5) | Cross-validation strategy |
Methods
fit(X, y): Perform cross-validation on provided dataget_performance(): Calculate mean ROC AUC score across CV folds
Attributes
cv_result: List of (y_pred, y_true) tuples from each CV foldmodality: The fitted modality estimator being evaluated
Usage Examples
Example 1: Single Modality Analysis
import pandas as pd
import numpy as np
from mesa import MESA_modality, MESA_CV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
# Load single modality data
X = pd.read_csv('methylation_data.csv', index_col=0)
y = pd.read_csv('labels.csv', index_col=0).values.ravel()
# Create and configure modality
modality = MESA_modality(
top_n=50,
missing=0.2,
normalization=True,
classifier=RandomForestClassifier(n_estimators=100, random_state=42)
)
# Fit the modality
modality.fit(X, y)
# Make predictions on new data
X_test = pd.read_csv('test_data.csv', index_col=0)
predictions = modality.transform_predict_proba(X_test)
print(f"Prediction probabilities shape: {predictions.shape}")
# Get selected features
selected_features = modality.get_support()
print(f"Number of selected features: {len(selected_features)}")
# Cross-validation evaluation
cv_eval = MESA_CV(
modality=MESA_modality(top_n=50, missing=0.2),
cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
)
cv_eval.fit(X, y)
auc_score = cv_eval.get_performance()
print(f"Cross-validation AUC: {auc_score:.3f}")
Example 2: Multi-Modality Ensemble
import pandas as pd
from mesa import MESA_modality, MESA, MESA_CV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Load multi-modal data
methylation_data = pd.read_csv('methylation.csv', index_col=0)
histone_data = pd.read_csv('histone_marks.csv', index_col=0)
chromatin_data = pd.read_csv('chromatin_accessibility.csv', index_col=0)
y = pd.read_csv('labels.csv', index_col=0).values.ravel()
# Define modality-specific configurations
modalities = [
MESA_modality(
top_n=100,
missing=0.1,
classifier=RandomForestClassifier(n_estimators=200, random_state=42),
normalization=True
),
MESA_modality(
top_n=80,
missing=0.15,
classifier=SVC(probability=True, random_state=42),
normalization=False
),
MESA_modality(
top_n=60,
missing=0.2,
classifier=LogisticRegression(random_state=42),
normalization=True
)
]
# Create MESA ensemble
mesa = MESA(
modalities=modalities,
meta_estimator=LogisticRegression(random_state=42),
random_state=42
)
# Fit the ensemble
X_list = [methylation_data, histone_data, chromatin_data]
mesa.fit(X_list, y)
# Make ensemble predictions
X_test_list = [
pd.read_csv('methylation_test.csv', index_col=0),
pd.read_csv('histone_test.csv', index_col=0),
pd.read_csv('chromatin_test.csv', index_col=0)
]
ensemble_predictions = mesa.predict_proba(X_test_list)
print(f"Ensemble predictions shape: {ensemble_predictions.shape}")
# Get feature support from all modalities
feature_supports = mesa.get_support()
for i, support in enumerate(feature_supports):
print(f"Modality {i+1}: {len(support)} features selected")
Example 3: Cross-Validation for Multi-Modality
from mesa import MESA, MESA_modality, MESA_CV
from sklearn.model_selection import RepeatedStratifiedKFold
# Define ensemble for CV evaluation
modalities = [
MESA_modality(top_n=50, missing=0.1),
MESA_modality(top_n=40, missing=0.15),
MESA_modality(top_n=60, missing=0.2)
]
mesa_ensemble = MESA(
modalities=modalities,
cv=RepeatedStratifiedKFold(n_splits=5, n_repeats=5, random_state=42)
)
# Cross-validation evaluation
cv_eval = MESA_CV(
modality=mesa_ensemble,
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
)
# Perform CV on multi-modal data
X_list = [methylation_data, histone_data, chromatin_data]
cv_eval.fit(X_list, y)
# Get performance metrics
mean_auc = cv_eval.get_performance()
print(f"Multi-modal ensemble CV AUC: {mean_auc:.3f}")
# Access individual fold results
for i, (y_pred, y_true) in enumerate(cv_eval.cv_result):
fold_auc = roc_auc_score(y_true, y_pred[:, 1])
print(f"Fold {i+1} AUC: {fold_auc:.3f}")
Example 4: Custom Feature Selection Pipeline
from mesa import MESA_modality
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import GradientBoostingClassifier
# Custom modality with different feature selection
custom_modality = MESA_modality(
top_n=30,
variance_threshold=0.01,
missing=0.05,
normalization=True,
selector=SelectKBest(score_func=f_classif, k=1000),
classifier=GradientBoostingClassifier(n_estimators=100, random_state=42),
boruta_estimator=GradientBoostingClassifier(n_estimators=50, random_state=42)
)
# Fit and evaluate
custom_modality.fit(X, y)
custom_predictions = custom_modality.transform_predict_proba(X_test)
# Compare with default configuration
default_modality = MESA_modality()
cv_custom = MESA_CV(custom_modality)
cv_default = MESA_CV(default_modality)
cv_custom.fit(X, y)
cv_default.fit(X, y)
print(f"Custom configuration AUC: {cv_custom.get_performance():.3f}")
print(f"Default configuration AUC: {cv_default.get_performance():.3f}")
Example 5: Feature Importance Analysis
# Analyze feature importance across modalities
modality = MESA_modality(top_n=100)
modality.fit(X, y)
# Get feature support at different pipeline steps
missing_support = modality.get_support(step=0) # After missing value filtering
variance_support = modality.get_support(step=1) # After variance filtering
univariate_support = modality.get_support(step=2) # After univariate selection
final_support = modality.get_support() # Final selected features
print(f"Features after missing value filter: {len(missing_support)}")
print(f"Features after variance filter: {len(variance_support)}")
print(f"Features after univariate selection: {len(univariate_support)}")
print(f"Final selected features: {len(final_support)}")
# Get feature names if using DataFrame
if hasattr(X, 'columns'):
selected_feature_names = X.columns[final_support]
print(f"Selected features: {selected_feature_names.tolist()}")
Performance Tips
- Memory Management: For large datasets, consider reducing
top_nand usingn_jobs=1for Boruta - Feature Selection: Adjust
missingthreshold based on data quality - Cross-Validation: Use fewer repeats for initial exploration, more for final evaluation
- Ensemble Size: Start with 2-3 modalities, add more based on performance gains
Citation
If you use MESA in your research, please cite:
Li, Y., Xu, J., Chen, C. et al. Multimodal epigenetic sequencing analysis (MESA) of cell-free DNA for non-invasive colorectal cancer detection. Genome Med 16, 9 (2024). https://doi.org/10.1186/s13073-023-01280-6
Authors
- Chaorong Chen - Lead Developer - c.chen@uci.edu
- Wei Li - Principal Investigator - wei.li@uci.edu
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Support
For questions and support:
- Open an issue on GitHub
- Email: c.chen@uci.edu
- Documentation: [Link to full documentation]
Keywords: cfDNA, epigenetics, multimodal analysis, machine learning, feature selection, ensemble learning, stacking, bioinformatics, biomarker discovery, methylation, computational biology, early detection
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mesa_cfdna-0.2.0.tar.gz.
File metadata
- Download URL: mesa_cfdna-0.2.0.tar.gz
- Upload date:
- Size: 18.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
443d8970ab7bc591d5c3765070d98fc8273035aea93a4e98d9de4ac7161e511b
|
|
| MD5 |
02a0fd7c6252bd7d637031f322224cf3
|
|
| BLAKE2b-256 |
72dbb223160d6d2e9a88065c71f301a9cd8537b7f8d2fd4f196cb56f2e80c983
|
Provenance
The following attestation bundles were made for mesa_cfdna-0.2.0.tar.gz:
Publisher:
python-publish.yml on ChaorongC/mesa_cfdna
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mesa_cfdna-0.2.0.tar.gz -
Subject digest:
443d8970ab7bc591d5c3765070d98fc8273035aea93a4e98d9de4ac7161e511b - Sigstore transparency entry: 219053222
- Sigstore integration time:
-
Permalink:
ChaorongC/mesa_cfdna@94f3bde0bc5d1d58574d07418c3f6cae51d9f6a0 -
Branch / Tag:
refs/tags/0.2.1 - Owner: https://github.com/ChaorongC
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@94f3bde0bc5d1d58574d07418c3f6cae51d9f6a0 -
Trigger Event:
release
-
Statement type:
File details
Details for the file mesa_cfdna-0.2.0-py3-none-any.whl.
File metadata
- Download URL: mesa_cfdna-0.2.0-py3-none-any.whl
- Upload date:
- Size: 15.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca26093cbbb72b4ada736bd56adbdbbe01775042fcd6fb567a068a711c1fcdd5
|
|
| MD5 |
fad57e02760983f64abd799ddfe358ad
|
|
| BLAKE2b-256 |
b217f91110bc3e18c387d9144fe38a23097b74232fbd0c2ecd68e6e22890bd92
|
Provenance
The following attestation bundles were made for mesa_cfdna-0.2.0-py3-none-any.whl:
Publisher:
python-publish.yml on ChaorongC/mesa_cfdna
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mesa_cfdna-0.2.0-py3-none-any.whl -
Subject digest:
ca26093cbbb72b4ada736bd56adbdbbe01775042fcd6fb567a068a711c1fcdd5 - Sigstore transparency entry: 219053227
- Sigstore integration time:
-
Permalink:
ChaorongC/mesa_cfdna@94f3bde0bc5d1d58574d07418c3f6cae51d9f6a0 -
Branch / Tag:
refs/tags/0.2.1 - Owner: https://github.com/ChaorongC
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@94f3bde0bc5d1d58574d07418c3f6cae51d9f6a0 -
Trigger Event:
release
-
Statement type: