Data distillation methods for genomic sequence-to-function models

These details have not been verified by PyPI

Project links

Project description

distillseq

Data distillation methods for genomic sequence-to-function models

distillseq provides state-of-the-art data distillation and dataset condensation methods specifically designed for genomic deep learning models. Reduce your training dataset size while maintaining model performance.

Features

🧬 Genomic-focused: Designed for sequence-to-function models (Enformer, etc.)
🎯 Multiple methods: Gradient matching, k-mer diversity, gLM diversity, model confidence diversity, and random sampling
🧪 Teacher distillation: Self-distillation with soft labels from teacher models
⚡ Efficient: Memory-optimized implementations for large genomic datasets
🔧 Flexible: Works with any PyTorch dataset and model
📊 Tracking: Optional Weights & Biases integration

Installation

pip install distillseq

For additional features:

# Install with genomic language model support
pip install distillseq[glm]

# Install with W&B tracking
pip install distillseq[wandb]

# Install all optional dependencies
pip install distillseq[all]

Quick Start

from distillseq import GradientMatching
import torch

# Your model and dataset
model = YourGenomicModel()
dataset = YourGenomicDataset()

# Distill to 10% of original size
distiller = GradientMatching(
    model=model,
    dataset=dataset,
    ratio=0.1,
    device='cuda'
)

# Get distilled dataset
distilled_dataset = distiller.distill()

Methods

1. Gradient Matching (Data Condensation)

Creates synthetic sequences that match the gradient distributions of the full dataset.

from distillseq import GradientMatching

distiller = GradientMatching(
    model=model,
    dataset=full_dataset,
    ratio=0.1,
    iterations=1000,
    batch_size=1024
)
synthetic_dataset = distiller.distill()

2. K-mer Diversity Sampling

Maximizes sequence diversity using Jensen-Shannon Divergence on k-mer distributions.

from distillseq import KmerDiversity

distiller = KmerDiversity(
    dataset=full_dataset,
    ratio=0.1,
    kmer_length=6,
    n_cores=20
)
diverse_indices = distiller.distill()

3. gLM Diversity Sampling

Maximizes diversity using genomic language model (DNABERT-S) embeddings.

from distillseq import GLMDiversity

distiller = GLMDiversity(
    dataset=full_dataset,
    ratio=0.1,
    model_name="zhihan1996/DNABERT-S",
    n_clusters=10
)
diverse_indices = distiller.distill()

4. Model Confidence Diversity

Stratified sampling across epistemic uncertainty levels using model ensemble or MC dropout.

from distillseq import ModelConfidenceDiversity

distiller = ModelConfidenceDiversity(
    dataset=full_dataset,
    model=trained_model,
    ratio=0.1,
    mc_dropout=10,  # Use MC dropout for uncertainty
    n_bins=100
)
diverse_indices = distiller.distill()

5. Random Sampling

Baseline method for comparison.

from distillseq import RandomSampling

distiller = RandomSampling(
    dataset=full_dataset,
    ratio=0.1,
    seed=42
)
random_indices = distiller.distill()

Teacher Distillation (Self-Distillation)

Apply teacher model predictions to create soft labels AFTER distillation selects samples. This is much more efficient than pre-computing predictions for the entire dataset!

from distillseq import apply_teacher_predictions, KmerDiversity

# Step 1: Distill to select important samples first
distiller = KmerDiversity(dataset=original_dataset, ratio=0.1)
indices = distiller.distill()

# Step 2: Apply teacher predictions ONLY to selected samples
teacher_dataset = apply_teacher_predictions(
    dataset=original_dataset,
    indices=indices,
    teacher_models=trained_model,  # Or [model1, model2, model3] for ensemble
    device='cuda'
)

# Step 3: Train on distilled data with soft labels
train_loader = DataLoader(teacher_dataset, batch_size=32)

Key advantages:

Efficient: Only compute predictions for selected samples (10-20x speedup typical)
Ensemble support: Automatically averages predictions from multiple teachers
MC dropout: Incorporate epistemic uncertainty
Compatible: Works with all 5 distillation methods

See Teacher Distillation Tutorial for comprehensive examples.

Documentation

Tutorials

Basic Usage Tutorial - Getting started with distillation methods
Teacher Distillation Tutorial - Self-distillation with soft labels
Basic Usage Notebook - Interactive Jupyter notebook

Guides

Teacher Distillation Guide - Comprehensive reference
Quick Start Guide
Installation Guide

Full documentation is available at https://distillseq.readthedocs.io

Citation

If you use distillseq in your research, please cite:

##TODO: Add when avail

License

MIT License - see LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Nov 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distillseq-0.1.0.tar.gz (24.2 kB view details)

Uploaded Nov 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

distillseq-0.1.0-py3-none-any.whl (29.3 kB view details)

Uploaded Nov 3, 2025 Python 3

File details

Details for the file distillseq-0.1.0.tar.gz.

File metadata

Download URL: distillseq-0.1.0.tar.gz
Upload date: Nov 3, 2025
Size: 24.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for distillseq-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`663bdb4cd53553e8605cb4f87bbd90f16fcbdf964b5a0c659f8779e742975bf6`
MD5	`689cdfda77f6178ab14d3b5ed7c72242`
BLAKE2b-256	`5e4868d983ad43cfa64cfc367bd4808e57f8cccfd29d101df4ff33d0c2370b9a`

See more details on using hashes here.

File details

Details for the file distillseq-0.1.0-py3-none-any.whl.

File metadata

Download URL: distillseq-0.1.0-py3-none-any.whl
Upload date: Nov 3, 2025
Size: 29.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for distillseq-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`935db7bad30d699b373cd9fc30762eadac4304e6b36de87be25bbddeab1c293b`
MD5	`c78e71640a0f2580d72c3b669bac7eba`
BLAKE2b-256	`86f82b46849f7469f8fa149cc50035b54f8a4adb159f73dfdb150049264c349b`

See more details on using hashes here.

distillseq 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

distillseq

Features

Installation

Quick Start

Methods

1. Gradient Matching (Data Condensation)

2. K-mer Diversity Sampling

3. gLM Diversity Sampling

4. Model Confidence Diversity

5. Random Sampling

Teacher Distillation (Self-Distillation)

Documentation

Tutorials

Guides

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes