Evolution-Inspired Data Augmentation for Genomic Sequences - DataLoader Version
Project description
EvoAug2
EvoAug2 is a PyTorch package to pretrain sequence-based deep learning models for regulatory genomics with evolution-inspired data augmentations, followed by fine-tuning on the original, unperturbed data. The new version replaces the prior model-wrapper (RobustModel) with a loader-first design (RobustLoader) that applies augmentations on-the-fly within a drop-in DataLoader.
All augmentations are length-preserving: inputs with shape (N, A, L) always return outputs with the exact same shape.
For questions, email: koo@cshl.edu
Install
pip install evoaug2
Installation Options
Option 1: Install from PyPI (Recommended)
# Install the latest stable release
pip install evoaug2
# Install with specific version
pip install evoaug2==2.0.0
# Install with optional dependencies for examples
pip install evoaug2[examples]
# Install with all optional dependencies
pip install evoaug2[full]
Option 2: Install from Source (Development)
# Clone the repository
git clone https://github.com/pkoo/evoaug2.git
cd evoaug2
# Install in development mode
pip install -e .
# Or install with development dependencies
pip install -e .[dev]
Option 3: Install with Conda/Mamba
# Create a new environment (recommended)
conda create -n evoaug2 python=3.8
conda activate evoaug2
# Install PyTorch first (choose appropriate version)
conda install pytorch pytorch-cuda=11.8 -c pytorch -c nvidia
# Install EvoAug2
pip install evoaug2
Dependencies
torch >= 1.9.0
pytorch-lightning >= 1.5.0
numpy >= 1.20.0
scipy >= 1.7.0
h5py >= 3.1.0
scikit-learn >= 1.0.0
Note: The examples use pytorch_lightning (imported as import pytorch_lightning as pl). If you use the newer lightning.pytorch package, adapt the Trainer import and arguments accordingly.
Quick Start
# Install the package
pip install evoaug2
# Import and use
from evoaug import evoaug, augment
from utils import utils
# Create augmentations
augment_list = [
augment.RandomDeletion(delete_min=0, delete_max=20),
augment.RandomRC(rc_prob=0.5),
augment.RandomMutation(mut_frac=0.05),
]
# Create a RobustLoader
loader = evoaug.RobustLoader(
base_dataset=your_dataset,
augment_list=augment_list,
max_augs_per_seq=2,
hard_aug=True,
batch_size=32
)
# Use in training
for x, y in loader:
# x has shape (N, A, L) with augmentations applied
# Your training code here
pass
Troubleshooting
Common Issues
Import Error: No module named 'evoaug'
# Make sure you installed the correct package name
pip install evoaug2 # NOT evoaug
CUDA/GPU Issues
# Install PyTorch with CUDA support first
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# Then install EvoAug2
pip install evoaug2
Version Conflicts
# Create a clean environment
conda create -n evoaug2 python=3.8
conda activate evoaug2
pip install evoaug2
Memory Issues with Large Datasets
# Reduce batch size or use gradient accumulation
loader = evoaug.RobustLoader(
base_dataset=dataset,
augment_list=augment_list,
batch_size=16, # Reduce from 32
num_workers=2 # Reduce workers if needed
)
Getting Help
- GitHub Issues: Report bugs at https://github.com/pkoo/evoaug2/issues
- Email: koo@cshl.edu
- Documentation: See
example_training.pyfor complete usage examples
Package Structure
evoaug2/
├── evoaug/ # Core augmentation package
│ ├── __init__.py # Package exports
│ ├── augment.py # Augmentation implementations
│ └── evoaug.py # RobustLoader and dataset classes
├── utils/ # Utility functions
│ ├── __init__.py # Utility exports
│ ├── model_zoo.py # Model architectures
│ └── utils.py # H5Dataset and evaluation tools
├── example_training.py # Complete training example
├── setup.py # Package configuration
├── pyproject.toml # Modern Python packaging
├── requirements.txt # Core dependencies
└── README.md # This file
What changed (RobustModel → RobustLoader)
- The training wrapper is no longer required. Instead of wrapping a model in
RobustModel, EvoAug2 provides aRobustLoaderthat augments data during loading. - Works with any PyTorch model, any dataset returning
(sequence, target)withsequenceshaped as (A, L). - Augmentations can be toggled per-loader:
loader.enable_augmentations()/loader.disable_augmentations(). - Fine-tuning stage is implemented by disabling augmentations on the same dataset/loader.
Quick migration:
- Before: wrap model with
evoaug.RobustModel(...)and pass a normal DataLoader. - Now: create a
RobustLoader(base_dataset, augment_list, ...)and pass the loader to your Trainer or training loop.
Augmentations
from evoaug import augment
augment_list = [
augment.RandomDeletion(delete_min=0, delete_max=30),
augment.RandomTranslocation(shift_min=0, shift_max=20),
augment.RandomInsertion(insert_min=0, insert_max=20),
augment.RandomRC(rc_prob=0.0),
augment.RandomMutation(mut_frac=0.05),
augment.RandomNoise(noise_mean=0.0, noise_std=0.3),
]
All transforms keep sequence length exactly L and operate on batches shaped (N, A, L).
Use case 1: Lightning DataModule over a base dataset
This pattern mirrors example_training.py and is recommended for the two-stage workflow.
import pytorch_lightning as pl
from evoaug.evoaug import RobustLoader
from evoaug import augment
from utils import utils # provides H5Dataset with train/val/test splits
# Define augmentations (DeepSTARR-optimal shown in example_training.py)
augment_list = [
# augment.RandomDeletion(delete_min=0, delete_max=30),
augment.RandomTranslocation(shift_min=0, shift_max=20),
# augment.RandomInsertion(insert_min=0, insert_max=20),
augment.RandomRC(rc_prob=0.0),
augment.RandomMutation(mut_frac=0.05),
augment.RandomNoise(noise_mean=0.0, noise_std=0.3),
]
# Base dataset (returns per-split datasets)
base = utils.H5Dataset(filepath, batch_size=batch_size, lower_case=False, transpose=False)
class AugmentedDataModule(pl.LightningDataModule):
def __init__(self, base_dataset, augment_list, max_augs_per_seq, hard_aug):
super().__init__()
self.base_dataset = base_dataset
self.augment_list = augment_list
self.max_augs_per_seq = max_augs_per_seq
self.hard_aug = hard_aug
def train_dataloader(self):
train_ds = self.base_dataset.get_train_dataset()
return RobustLoader(
base_dataset=train_ds,
augment_list=self.augment_list,
max_augs_per_seq=self.max_augs_per_seq,
hard_aug=self.hard_aug,
batch_size=self.base_dataset.batch_size,
shuffle=True,
)
def val_dataloader(self):
val_ds = self.base_dataset.get_val_dataset()
loader = RobustLoader(
base_dataset=val_ds,
augment_list=self.augment_list,
max_augs_per_seq=self.max_augs_per_seq,
hard_aug=self.hard_aug,
batch_size=self.base_dataset.batch_size,
shuffle=False,
)
loader.disable_augmentations() # no augs for validation
return loader
def test_dataloader(self):
test_ds = self.base_dataset.get_test_dataset()
loader = RobustLoader(
base_dataset=test_ds,
augment_list=self.augment_list,
max_augs_per_seq=self.max_augs_per_seq,
hard_aug=self.hard_aug,
batch_size=self.base_dataset.batch_size,
shuffle=False,
)
loader.disable_augmentations() # no augs for test
return loader
# Stage 1: pretrain with augmentations (e.g., 100 epochs)
data_module = AugmentedDataModule(base, augment_list, max_augs_per_seq=2, hard_aug=True)
trainer = pl.Trainer(max_epochs=100, accelerator='auto', devices='auto')
trainer.fit(model, datamodule=data_module)
# Stage 2: fine-tune on original data (disable augmentations)
class FineTuneDataModule(pl.LightningDataModule):
def __init__(self, base_dataset):
super().__init__()
self.base_dataset = base_dataset
def train_dataloader(self):
return self.base_dataset.train_dataloader()
def val_dataloader(self):
return self.base_dataset.val_dataloader()
def test_dataloader(self):
return self.base_dataset.test_dataloader()
finetune_dm = FineTuneDataModule(base)
trainer_finetune = pl.Trainer(max_epochs=5, accelerator='auto', devices='auto')
trainer_finetune.fit(model_finetune, datamodule=finetune_dm)
Use case 2: Vanilla PyTorch loop with RobustLoader
from evoaug.evoaug import RobustLoader
from evoaug import augment
# Your dataset must return (sequence, target) with sequence shape (A, L)
base_dataset = YourDataset(...)
augment_list = [
augment.RandomTranslocation(shift_min=0, shift_max=20),
augment.RandomRC(rc_prob=0.0),
augment.RandomMutation(mut_frac=0.05),
augment.RandomNoise(noise_mean=0.0, noise_std=0.3),
]
train_loader = RobustLoader(
base_dataset=base_dataset,
augment_list=augment_list,
max_augs_per_seq=2,
hard_aug=True,
batch_size=128,
shuffle=True,
num_workers=4,
)
for epoch in range(num_epochs):
model.train()
for x, y in train_loader: # x is (N, A, L)
x = x.to(device)
y = y.to(device)
optimizer.zero_grad()
y_hat = model(x)
loss = criterion(y_hat, y)
loss.backward()
optimizer.step()
# Validation/test: either use your original non-augmented loader
# or temporarily disable augmentations on the same loader
# train_loader.disable_augmentations()
# for x, y in val_loader: ...
Optional: checkpointing and plotting
EvoAug2 leaves checkpointing/plotting to user code. They are easy to add, and example_training.py shows complete, ready-to-use helpers.
- Checkpoints (best-practice): use
pytorch_lightningcallbacks
import os
import pytorch_lightning as pl
ckpt_name = f"{expt_name}_aug"
ckpt_cb = pl.callbacks.ModelCheckpoint(
monitor='val_loss', save_top_k=1,
dirpath=output_dir, filename=ckpt_name,
)
trainer = pl.Trainer(callbacks=[ckpt_cb], max_epochs=100, accelerator='auto', devices='auto')
# trainer.fit(...)
best_ckpt_path = os.path.join(output_dir, ckpt_name + '.ckpt')
# model = LightningModule.load_from_checkpoint(best_ckpt_path, model=model_arch)
- Skipping redundant runs: a minimal helper
import os, torch
def check_existing_checkpoints(output_dir, expt_name):
paths = {
'augmented': os.path.join(output_dir, f"{expt_name}_aug.ckpt"),
'finetuned': os.path.join(output_dir, f"{expt_name}_finetune.ckpt"),
'control': os.path.join(output_dir, f"{expt_name}_standard.ckpt"),
}
return {k: (os.path.exists(p), p) for k, p in paths.items()}
- Plotting performance: compute metrics (Pearson/Spearman) from predictions and create comparison plots with matplotlib/seaborn. See
example_training.pyfor a comprehensiveplot_metrics_comparison(...)implementation.
API overview
AugmentedGenomicDataset(base_dataset, augment_list, max_augs_per_seq=0, hard_aug=True, apply_augmentations=True)- Wraps any dataset and applies augmentations on-the-fly.
enable_augmentations()/disable_augmentations()to toggle.
RobustLoader(base_dataset, augment_list, max_augs_per_seq, hard_aug, batch_size, shuffle, num_workers, ...)- Inherits from
torch.utils.data.DataLoader. enable_augmentations()/disable_augmentations()on the underlying dataset.set_augmentations(augment_list, max_augs_per_seq, hard_aug)to update settings without recreating the loader.
- Inherits from
All augmentations preserve sequence length L for stable model shapes across training/validation.
Two-stage workflow (recommended)
- Pretrain with EvoAug2 augmentations using
RobustLoader(e.g., 100 epochs). - Fine-tune the same architecture on original data with augmentations disabled (e.g., 5 epochs, lower LR).
- Optionally, train a control model on original data only for baseline comparison.
This mirrors the EvoAug methodology and typically improves robustness and generalization.
Reference
- Paper: "EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations" (Genome Biology, 2023).
@article{lee2023evoaug,
title={EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations},
author={Lee, Nicholas Keone and Tang, Ziqi and Toneyan, Shushan and Koo, Peter K},
journal={Genome Biology},
volume={24},
number={1},
pages={105},
year={2023},
publisher={Springer}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evoaug2-2.0.1.tar.gz.
File metadata
- Download URL: evoaug2-2.0.1.tar.gz
- Upload date:
- Size: 28.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9a261155d0162f28e304d9d149c2fde1c192e807031b77d46013afa2b7e0ecf
|
|
| MD5 |
36542b7b53df98bde986dfab1146edc1
|
|
| BLAKE2b-256 |
389498613a5078ef3aabcb7300954ee454f15b83befb6fc18280f173af39c9e6
|
File details
Details for the file evoaug2-2.0.1-py3-none-any.whl.
File metadata
- Download URL: evoaug2-2.0.1-py3-none-any.whl
- Upload date:
- Size: 24.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0f84c03612062251a06331d3de6d21ab62d979acb6e99dc291072473def0a95
|
|
| MD5 |
e9956b54879d0d267ccd9ce3873798e7
|
|
| BLAKE2b-256 |
d1ea2cebc7d2e2fb9fe565a11df5304b7f856cedf392712f78e933f533bd63c9
|