Evolution-Inspired Data Augmentation for Genomic Sequences - DataLoader Version

These details have not been verified by PyPI

Project links

Project description

EvoAug2

EvoAug2 is a PyTorch package to pretrain sequence-based deep learning models for regulatory genomics with evolution-inspired data augmentations, followed by fine-tuning on the original, unperturbed data. The new version replaces the prior model-wrapper (RobustModel) with a loader-first design (RobustLoader) that applies augmentations on-the-fly within a drop-in DataLoader.

All augmentations are length-preserving: inputs with shape (N, A, L) always return outputs with the exact same shape.

📚 Read the Documentation

For questions, email: koo@cshl.edu

Install

pip install evoaug2

Installation Options

Option 1: Install from PyPI (Recommended)

# Install the latest stable release
pip install evoaug2

# Install with specific version
pip install evoaug2==2.0.2

# Install with optional dependencies for examples
pip install evoaug2[examples]

# Install with all optional dependencies
pip install evoaug2[full]

Option 2: Install from Source (Development)

# Clone the repository
git clone https://github.com/aduranu/evoaug.git
cd evoaug2

# Install in development mode
pip install -e .

# Or install with development dependencies
pip install -e .[dev]

Option 3: Install with Conda/Mamba

# Create a new environment (recommended)
conda create -n evoaug2 python=3.8
conda activate evoaug2

# Install PyTorch first (choose appropriate version)
conda install pytorch pytorch-cuda=11.8 -c pytorch -c nvidia

# Install EvoAug2
pip install evoaug2

Dependencies

torch >= 1.9.0
pytorch-lightning >= 1.5.0
numpy >= 1.20.0
scipy >= 1.7.0
h5py >= 3.1.0
scikit-learn >= 1.0.0

Note: The examples use pytorch_lightning (imported as import pytorch_lightning as pl). If you use the newer lightning.pytorch package, adapt the Trainer import and arguments accordingly.

Documentation

📚 Full documentation is available at evoaug2.readthedocs.io

The documentation includes:

User Guide: Installation, configuration, and usage examples
API Reference: Complete API documentation for all classes and functions
Examples: Detailed examples with PyTorch Lightning and vanilla PyTorch
Advanced Topics: Architecture details and customization options

Quick Start

# Install the package
pip install evoaug2

# Import and use
from evoaug import evoaug, augment
from evoaug_utils import utils

# Create augmentations
augment_list = [
    augment.RandomDeletion(delete_min=0, delete_max=20),
    augment.RandomRC(rc_prob=0.5),
    augment.RandomMutation(mut_frac=0.05),
]

# Create a RobustLoader
loader = evoaug.RobustLoader(
    base_dataset=your_dataset,
    augment_list=augment_list,
    max_augs_per_seq=2,
    hard_aug=True,
    batch_size=32
)

# Use in training
for x, y in loader:
    # x has shape (N, A, L) with augmentations applied
    # Your training code here
    pass

Use Cases

EvoAug2 provides two main usage patterns, both demonstrated in the included example scripts:

Use Case 1: PyTorch Lightning DataModule (Recommended)

The example_lightning_module.py script demonstrates the complete two-stage training workflow:

from evoaug.evoaug import RobustLoader
from evoaug import augment
import pytorch_lightning as pl

# Define augmentations
augment_list = [
    augment.RandomTranslocation(shift_min=0, shift_max=20),
    augment.RandomRC(rc_prob=0.0),
    augment.RandomMutation(mut_frac=0.05),
    augment.RandomNoise(noise_mean=0.0, noise_std=0.3),
]

# Create Lightning DataModule with augmentations
class AugmentedDataModule(pl.LightningDataModule):
    def __init__(self, base_dataset, augment_list, max_augs_per_seq, hard_aug):
        super().__init__()
        self.base_dataset = base_dataset
        self.augment_list = augment_list
        self.max_augs_per_seq = max_augs_per_seq
        self.hard_aug = hard_aug
        
    def train_dataloader(self):
        # Training with augmentations
        train_dataset = self.base_dataset.get_train_dataset()
        return RobustLoader(
            base_dataset=train_dataset,
            augment_list=self.augment_list,
            max_augs_per_seq=self.max_augs_per_seq,
            hard_aug=self.hard_aug,
            batch_size=self.base_dataset.batch_size,
            shuffle=True
        )
    
    def val_dataloader(self):
        # Validation without augmentations
        val_dataset = self.base_dataset.get_val_dataset()
        loader = RobustLoader(
            base_dataset=val_dataset,
            augment_list=self.augment_list,
            max_augs_per_seq=self.max_augs_per_seq,
            hard_aug=self.hard_aug,
            batch_size=self.base_dataset.batch_size,
            shuffle=False
        )
        loader.disable_augmentations()  # No augs for validation
        return loader

# Two-stage training workflow
# Stage 1: Train with augmentations
data_module = AugmentedDataModule(base_dataset, augment_list, max_augs_per_seq=2, hard_aug=True)
trainer = pl.Trainer(max_epochs=100, accelerator='auto', devices='auto')
trainer.fit(model, datamodule=data_module)

# Stage 2: Fine-tune on original data
class FineTuneDataModule(pl.LightningDataModule):
    def __init__(self, base_dataset):
        super().__init__()
        self.base_dataset = base_dataset
    def train_dataloader(self):
        return self.base_dataset.train_dataloader()
    def val_dataloader(self):
        return self.base_dataset.val_dataloader()

finetune_dm = FineTuneDataModule(base_dataset)
trainer_finetune = pl.Trainer(max_epochs=5, accelerator='auto', devices='auto')
trainer_finetune.fit(model_finetune, datamodule=finetune_dm)

Key Features:

Automatic checkpoint management and resuming
Comprehensive performance comparison plots
Two-stage training: augmentations → fine-tuning
Control model training for baseline comparison

Use Case 2: Vanilla PyTorch Training Loop

The example_vanilla_pytorch.py script shows direct usage without Lightning:

from evoaug.evoaug import RobustLoader
from evoaug import augment
import torch
import torch.nn as nn

# Create augmentations
augment_list = [
    augment.RandomTranslocation(shift_min=0, shift_max=20),
    augment.RandomRC(rc_prob=0.0),
    augment.RandomMutation(mut_frac=0.05),
    augment.RandomNoise(noise_mean=0.0, noise_std=0.3),
]

# Create RobustLoader
train_loader = RobustLoader(
    base_dataset=base_dataset,
    augment_list=augment_list,
    max_augs_per_seq=2,
    hard_aug=True,
    batch_size=128,
    shuffle=True,
    num_workers=4,
)

# Training loop
model = Model(...)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in range(num_epochs):
    model.train()
    for x, y in train_loader:
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad()
        y_hat = model(x)
        loss = criterion(y_hat, y)
        loss.backward()
        optimizer.step()

Key Features:

Minimal dependencies (no Lightning required)
Simple CNN architecture with global average pooling
Direct control over training loop
Easy to modify and extend

Troubleshooting

Common Issues

Import Error: No module named 'evoaug'

# Make sure you installed the correct package name
pip install evoaug2  # NOT evoaug

CUDA/GPU Issues

# Install PyTorch with CUDA support first
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# Then install EvoAug2
pip install evoaug2

Version Conflicts

# Create a clean environment
conda create -n evoaug2 python=3.8
conda activate evoaug2
pip install evoaug2

Memory Issues with Large Datasets

# Reduce batch size or use gradient accumulation
loader = evoaug.RobustLoader(
    base_dataset=dataset,
    augment_list=augment_list,
    batch_size=16,  # Reduce from 32
    num_workers=2   # Reduce workers if needed
)

Getting Help

GitHub Issues: Report bugs at https://github.com/aduranu/evoaug/issues
Email: koo@cshl.edu
Documentation: See example scripts for complete usage examples

Package Structure

evoaug2/
├── evoaug/                 # Core augmentation package
│   ├── __init__.py         # Package exports
│   ├── augment.py          # Augmentation implementations
│   └── evoaug.py           # RobustLoader and dataset classes
├── evoaug_utils/           # Utility functions
│   ├── __init__.py         # Utility exports
│   ├── model_zoo.py        # Model architectures
│   └── utils.py            # H5Dataset and evaluation tools
├── example_lightning_module.py  # Complete Lightning training example
├── example_vanilla_pytorch.py   # Simple PyTorch training example
├── setup.py                 # Package configuration
├── pyproject.toml          # Modern Python packaging
├── requirements.txt         # Core dependencies
└── README.md               # This file

What changed (RobustModel → RobustLoader)

The training wrapper is no longer required. Instead of wrapping a model in RobustModel, EvoAug2 provides a RobustLoader that augments data during loading.
Works with any PyTorch model, any dataset returning (sequence, target) with sequence shaped as (A, L).
Augmentations can be toggled per-loader: loader.enable_augmentations() / loader.disable_augmentations().
Fine-tuning stage is implemented by disabling augmentations on the same dataset/loader.

Quick migration:

Before: wrap model with evoaug.RobustModel(...) and pass a normal DataLoader.
Now: create a RobustLoader(base_dataset, augment_list, ...) and pass the loader to your Trainer or training loop.

Augmentations

from evoaug import augment

augment_list = [
    augment.RandomDeletion(delete_min=0, delete_max=30),
    augment.RandomTranslocation(shift_min=0, shift_max=20),
    augment.RandomInsertion(insert_min=0, insert_max=20),
    augment.RandomRC(rc_prob=0.0),
    augment.RandomMutation(mut_frac=0.05),
    augment.RandomNoise(noise_mean=0.0, noise_std=0.3),
]

All transforms keep sequence length exactly L and operate on batches shaped (N, A, L).

Two-stage workflow (recommended)

Pretrain with EvoAug2 augmentations using RobustLoader (e.g., 100 epochs).
Fine-tune the same architecture on original data with augmentations disabled (e.g., 5 epochs, lower LR).
Optionally, train a control model on original data only for baseline comparison.

This mirrors the EvoAug methodology and typically improves robustness and generalization.

Reference

Paper: "EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations" (Genome Biology, 2023).

@article{lee2023evoaug,
  title={EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations},
  author={Lee, Nicholas Keone and Tang, Ziqi and Toneyan, Shushan and Koo, Peter K},
  journal={Genome Biology},
  volume={24},
  number={1},
  pages={105},
  year={2023},
  publisher={Springer}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.4

Aug 25, 2025

2.0.3

Aug 19, 2025

2.0.2

Aug 19, 2025

2.0.1

Aug 19, 2025

2.0.0

Aug 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evoaug2-2.0.4.tar.gz (28.0 kB view details)

Uploaded Aug 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evoaug2-2.0.4-py3-none-any.whl (24.0 kB view details)

Uploaded Aug 25, 2025 Python 3

File details

Details for the file evoaug2-2.0.4.tar.gz.

File metadata

Download URL: evoaug2-2.0.4.tar.gz
Upload date: Aug 25, 2025
Size: 28.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for evoaug2-2.0.4.tar.gz
Algorithm	Hash digest
SHA256	`8fb4c76d2c4fd3d5e3442a1d1451a2a6826b5f07cf5c2bc82bbde28e997f1dac`
MD5	`71bb44811ef825f03ea7b0d4d13842f9`
BLAKE2b-256	`3913309fb078a1a99d213bee2da5e65406d1faf21366f3b7ce7e21bfc05d775c`

See more details on using hashes here.

File details

Details for the file evoaug2-2.0.4-py3-none-any.whl.

File metadata

Download URL: evoaug2-2.0.4-py3-none-any.whl
Upload date: Aug 25, 2025
Size: 24.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for evoaug2-2.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`80fcca2e95cb9e6dce08e377982426e0bff044e4bc0248e0fb885023fe0703a7`
MD5	`09d1e34253d7d2e889ec8fd145240189`
BLAKE2b-256	`f53160febb3f7c6dc7fae039ad91db8c0064ff7c628582adff833feb6a4541fc`

See more details on using hashes here.

evoaug2 2.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

EvoAug2

Install

Installation Options

Option 1: Install from PyPI (Recommended)

Option 2: Install from Source (Development)

Option 3: Install with Conda/Mamba

Dependencies

Documentation

Quick Start

Use Cases

Use Case 1: PyTorch Lightning DataModule (Recommended)

Use Case 2: Vanilla PyTorch Training Loop

Troubleshooting

Common Issues

Getting Help

Package Structure

What changed (RobustModel → RobustLoader)

Augmentations

Two-stage workflow (recommended)

Reference

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes