Data distillation methods for genomic sequence-to-function models
Project description
distillseq
Data distillation methods for genomic sequence-to-function models
distillseq provides state-of-the-art data distillation and dataset condensation methods specifically designed for genomic deep learning models. Reduce your training dataset size while maintaining model performance.
Features
- 🧬 Genomic-focused: Designed for sequence-to-function models (Enformer, etc.)
- 🎯 Multiple methods: Gradient matching, k-mer diversity, gLM diversity, model confidence diversity, and random sampling
- 🧪 Teacher distillation: Self-distillation with soft labels from teacher models
- ⚡ Efficient: Memory-optimized implementations for large genomic datasets
- 🔧 Flexible: Works with any PyTorch dataset and model
- 📊 Tracking: Optional Weights & Biases integration
Installation
pip install distillseq
For additional features:
# Install with genomic language model support
pip install distillseq[glm]
# Install with W&B tracking
pip install distillseq[wandb]
# Install all optional dependencies
pip install distillseq[all]
Quick Start
from distillseq import GradientMatching
import torch
# Your model and dataset
model = YourGenomicModel()
dataset = YourGenomicDataset()
# Distill to 10% of original size
distiller = GradientMatching(
model=model,
dataset=dataset,
ratio=0.1,
device='cuda'
)
# Get distilled dataset
distilled_dataset = distiller.distill()
Methods
1. Gradient Matching (Data Condensation)
Creates synthetic sequences that match the gradient distributions of the full dataset.
from distillseq import GradientMatching
distiller = GradientMatching(
model=model,
dataset=full_dataset,
ratio=0.1,
iterations=1000,
batch_size=1024
)
synthetic_dataset = distiller.distill()
2. K-mer Diversity Sampling
Maximizes sequence diversity using Jensen-Shannon Divergence on k-mer distributions.
from distillseq import KmerDiversity
distiller = KmerDiversity(
dataset=full_dataset,
ratio=0.1,
kmer_length=6,
n_cores=20
)
diverse_indices = distiller.distill()
3. gLM Diversity Sampling
Maximizes diversity using genomic language model (DNABERT-S) embeddings.
from distillseq import GLMDiversity
distiller = GLMDiversity(
dataset=full_dataset,
ratio=0.1,
model_name="zhihan1996/DNABERT-S",
n_clusters=10
)
diverse_indices = distiller.distill()
4. Model Confidence Diversity
Stratified sampling across epistemic uncertainty levels using model ensemble or MC dropout.
from distillseq import ModelConfidenceDiversity
distiller = ModelConfidenceDiversity(
dataset=full_dataset,
model=trained_model,
ratio=0.1,
mc_dropout=10, # Use MC dropout for uncertainty
n_bins=100
)
diverse_indices = distiller.distill()
5. Random Sampling
Baseline method for comparison.
from distillseq import RandomSampling
distiller = RandomSampling(
dataset=full_dataset,
ratio=0.1,
seed=42
)
random_indices = distiller.distill()
Teacher Distillation (Self-Distillation)
Apply teacher model predictions to create soft labels AFTER distillation selects samples. This is much more efficient than pre-computing predictions for the entire dataset!
from distillseq import apply_teacher_predictions, KmerDiversity
# Step 1: Distill to select important samples first
distiller = KmerDiversity(dataset=original_dataset, ratio=0.1)
indices = distiller.distill()
# Step 2: Apply teacher predictions ONLY to selected samples
teacher_dataset = apply_teacher_predictions(
dataset=original_dataset,
indices=indices,
teacher_models=trained_model, # Or [model1, model2, model3] for ensemble
device='cuda'
)
# Step 3: Train on distilled data with soft labels
train_loader = DataLoader(teacher_dataset, batch_size=32)
Key advantages:
- Efficient: Only compute predictions for selected samples (10-20x speedup typical)
- Ensemble support: Automatically averages predictions from multiple teachers
- MC dropout: Incorporate epistemic uncertainty
- Compatible: Works with all 5 distillation methods
See Teacher Distillation Tutorial for comprehensive examples.
Documentation
Tutorials
- Basic Usage Tutorial - Getting started with distillation methods
- Teacher Distillation Tutorial - Self-distillation with soft labels
- Basic Usage Notebook - Interactive Jupyter notebook
Guides
- Teacher Distillation Guide - Comprehensive reference
- Quick Start Guide
- Installation Guide
Full documentation is available at https://distillseq.readthedocs.io
Citation
If you use distillseq in your research, please cite:
##TODO: Add when avail
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file distillseq-0.1.0.tar.gz.
File metadata
- Download URL: distillseq-0.1.0.tar.gz
- Upload date:
- Size: 24.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
663bdb4cd53553e8605cb4f87bbd90f16fcbdf964b5a0c659f8779e742975bf6
|
|
| MD5 |
689cdfda77f6178ab14d3b5ed7c72242
|
|
| BLAKE2b-256 |
5e4868d983ad43cfa64cfc367bd4808e57f8cccfd29d101df4ff33d0c2370b9a
|
File details
Details for the file distillseq-0.1.0-py3-none-any.whl.
File metadata
- Download URL: distillseq-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
935db7bad30d699b373cd9fc30762eadac4304e6b36de87be25bbddeab1c293b
|
|
| MD5 |
c78e71640a0f2580d72c3b669bac7eba
|
|
| BLAKE2b-256 |
86f82b46849f7469f8fa149cc50035b54f8a4adb159f73dfdb150049264c349b
|