Skip to main content

Data distillation methods for genomic sequence-to-function models

Project description

distillseq

Data distillation methods for genomic sequence-to-function models

distillseq provides state-of-the-art data distillation and dataset condensation methods specifically designed for genomic deep learning models. Reduce your training dataset size while maintaining model performance.

Features

  • 🧬 Genomic-focused: Designed for sequence-to-function models (Enformer, etc.)
  • 🎯 Multiple methods: Gradient matching, k-mer diversity, gLM diversity, model confidence diversity, and random sampling
  • 🧪 Teacher distillation: Self-distillation with soft labels from teacher models
  • Efficient: Memory-optimized implementations for large genomic datasets
  • 🔧 Flexible: Works with any PyTorch dataset and model
  • 📊 Tracking: Optional Weights & Biases integration

Installation

pip install distillseq

For additional features:

# Install with genomic language model support
pip install distillseq[glm]

# Install with W&B tracking
pip install distillseq[wandb]

# Install all optional dependencies
pip install distillseq[all]

Quick Start

from distillseq import GradientMatching
import torch

# Your model and dataset
model = YourGenomicModel()
dataset = YourGenomicDataset()

# Distill to 10% of original size
distiller = GradientMatching(
    model=model,
    dataset=dataset,
    ratio=0.1,
    device='cuda'
)

# Get distilled dataset
distilled_dataset = distiller.distill()

Methods

1. Gradient Matching (Data Condensation)

Creates synthetic sequences that match the gradient distributions of the full dataset.

from distillseq import GradientMatching

distiller = GradientMatching(
    model=model,
    dataset=full_dataset,
    ratio=0.1,
    iterations=1000,
    batch_size=1024
)
synthetic_dataset = distiller.distill()

2. K-mer Diversity Sampling

Maximizes sequence diversity using Jensen-Shannon Divergence on k-mer distributions.

from distillseq import KmerDiversity

distiller = KmerDiversity(
    dataset=full_dataset,
    ratio=0.1,
    kmer_length=6,
    n_cores=20
)
diverse_indices = distiller.distill()

3. gLM Diversity Sampling

Maximizes diversity using genomic language model (DNABERT-S) embeddings.

from distillseq import GLMDiversity

distiller = GLMDiversity(
    dataset=full_dataset,
    ratio=0.1,
    model_name="zhihan1996/DNABERT-S",
    n_clusters=10
)
diverse_indices = distiller.distill()

4. Model Confidence Diversity

Stratified sampling across epistemic uncertainty levels using model ensemble or MC dropout.

from distillseq import ModelConfidenceDiversity

distiller = ModelConfidenceDiversity(
    dataset=full_dataset,
    model=trained_model,
    ratio=0.1,
    mc_dropout=10,  # Use MC dropout for uncertainty
    n_bins=100
)
diverse_indices = distiller.distill()

5. Random Sampling

Baseline method for comparison.

from distillseq import RandomSampling

distiller = RandomSampling(
    dataset=full_dataset,
    ratio=0.1,
    seed=42
)
random_indices = distiller.distill()

Teacher Distillation (Self-Distillation)

Apply teacher model predictions to create soft labels AFTER distillation selects samples. This is much more efficient than pre-computing predictions for the entire dataset!

from distillseq import apply_teacher_predictions, KmerDiversity

# Step 1: Distill to select important samples first
distiller = KmerDiversity(dataset=original_dataset, ratio=0.1)
indices = distiller.distill()

# Step 2: Apply teacher predictions ONLY to selected samples
teacher_dataset = apply_teacher_predictions(
    dataset=original_dataset,
    indices=indices,
    teacher_models=trained_model,  # Or [model1, model2, model3] for ensemble
    device='cuda'
)

# Step 3: Train on distilled data with soft labels
train_loader = DataLoader(teacher_dataset, batch_size=32)

Key advantages:

  • Efficient: Only compute predictions for selected samples (10-20x speedup typical)
  • Ensemble support: Automatically averages predictions from multiple teachers
  • MC dropout: Incorporate epistemic uncertainty
  • Compatible: Works with all 5 distillation methods

See Teacher Distillation Tutorial for comprehensive examples.

Documentation

Tutorials

Guides

Full documentation is available at https://distillseq.readthedocs.io

Citation

If you use distillseq in your research, please cite:

##TODO: Add when avail

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distillseq-0.1.0.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distillseq-0.1.0-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file distillseq-0.1.0.tar.gz.

File metadata

  • Download URL: distillseq-0.1.0.tar.gz
  • Upload date:
  • Size: 24.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for distillseq-0.1.0.tar.gz
Algorithm Hash digest
SHA256 663bdb4cd53553e8605cb4f87bbd90f16fcbdf964b5a0c659f8779e742975bf6
MD5 689cdfda77f6178ab14d3b5ed7c72242
BLAKE2b-256 5e4868d983ad43cfa64cfc367bd4808e57f8cccfd29d101df4ff33d0c2370b9a

See more details on using hashes here.

File details

Details for the file distillseq-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: distillseq-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for distillseq-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 935db7bad30d699b373cd9fc30762eadac4304e6b36de87be25bbddeab1c293b
MD5 c78e71640a0f2580d72c3b669bac7eba
BLAKE2b-256 86f82b46849f7469f8fa149cc50035b54f8a4adb159f73dfdb150049264c349b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page