Skip to main content

GPU-Accelerated MCMC Sampling for Protein Sequences with Codon-Level Mutations

Project description

Genie 2.0

GPU-Accelerated MCMC Sampling for Protein Sequences with Codon-Level Mutations

Genie 2.0 is a high-performance tool for generating protein sequences using Direct Coupling Analysis (DCA) models combined with biologically realistic codon substitution dynamics. It implements efficient MCMC sampling on GPUs with two variants:

  • Genie: DNA codon-aware evolution with Metropolis-Gibbs sampling
  • Genie-AA: Amino acid-only evolution with standard Gibbs sampling

Table of Contents


Features

Core Capabilities

  • Codon-Aware Sampling: Biologically realistic single-nucleotide mutations at DNA level
  • Hybrid MCMC: Combined Metropolis-Hastings and Gibbs sampling for better mixing
  • Reference-Based: Optional convergence tracking against real sequence data
  • GPU-Accelerated: Full CUDA support with PyTorch JIT compilation (2-3x speedup)
  • Flexible Input: Start from existing sequences or random initialization

Technical Highlights

  • Fully vectorized GPU kernels with zero CPU loops
  • Pre-computed codon mutation networks for O(1) neighbor lookups
  • Batched random number generation for improved GPU efficiency
  • Real-time Pearson correlation tracking for convergence monitoring

Installation

From PyPI (Recommended)

pip install genie-dca

From Source

git clone https://github.com/spqb/Genie.py.git
cd Genie.py
pip install .

This installs two command-line tools:

  • genie - Codon-aware evolution
  • genie-aa - Amino acid evolution

Quick Start

Codon-Aware Evolution

genie -p params.dat -n 1000 --num_iterations 50000 -o output_folder

Amino Acid Evolution

genie-aa -p params.dat -n 1000 --num_iterations 50000 -o output_folder

Usage

Genie (Codon-Aware Evolution)

# Generate sequences from scratch
genie -p params.dat -n 1000 --num_iterations 50000 -o results/

# Start from existing sequences
genie -c init_sequences.fasta -p params.dat --num_iterations 50000 -o results/

Genie-AA (Amino Acid Evolution)

# Generate sequences from scratch
genie-aa -p params.dat -n 1000 --num_iterations 50000 -o results/

# Start from existing sequences
genie-aa -c init_sequences.fasta -p params.dat --num_iterations 50000 -o results/

Reconstruction Tools

# Reconstruct final sequences from mutation log
reconstruct_chains results/

# Reconstruct sequences at specific timesteps
reconstruct_at_timesteps results/ --timesteps "0,100,500,1000"

Python API

from Genie import reconstruct_at_timesteps, reconstruct_chains_from_log
from adabmDCA.fasta import get_tokens

# Reconstruct sequences at specific timesteps
sequences = reconstruct_at_timesteps(
    initial_chains_file="results/initial_chains.fasta",
    mutation_log_file="results/mutation_log.csv",
    timesteps=[0, 100, 500, 1000],
    alphabet="protein"
)
# Returns: torch.Tensor of shape (len(timesteps), n_chains, L)

# Reconstruct and validate final sequences
tokens = get_tokens(alphabet="protein")
reconstructed_seqs, headers = reconstruct_chains_from_log(
    initial_chains_file="results/initial_chains.fasta",
    mutation_log_file="results/mutation_log.csv",
    tokens=tokens
)

Command-Line Arguments

Required Arguments

Argument Short Description
--path_params -p DCA model parameters file (.dat)
--num_iterations Number of MCMC iterations

Optional Arguments

Argument Short Default Description
--output -o DCA_evolution Output directory
--num_chains -n None Number of sequences (required if not using -c)
--path_chains -c None Initial sequences (FASTA format)
--seq_index None Replicate single sequence from -c file
--save_steps 100 Checkpoint interval or comma-separated list (e.g., "100,500,1000")
--device auto Device: 'cuda' or 'cpu'
--dtype float32 Data type: float32 or float64

Genie-Specific Arguments

Argument Default Description
--p_metropolis 0.5 Metropolis vs Gibbs ratio (0.0-1.0)

Genie-AA Specific Arguments

Argument Default Description
--alphabet protein Alphabet type: 'protein', 'rna', 'dna', or custom

Reconstruction Tool Arguments

reconstruct_chains: Takes output folder as positional argument

reconstruct_at_timesteps:

  • folder - Output folder (positional)
  • --timesteps - Comma-separated list (e.g., "0,100,500,1000")

Output Files

All files are saved in the output directory specified by -o.

Generated Files

File Description
initial_chains.fasta Starting sequences (before evolution)
final_chains.fasta Final sequences (after all iterations)
mutation_log.csv Log of all mutations at checkpoints

Mutation Log Format

File: mutation_log.csv

CSV file tracking mutations at checkpoints:

Column Description
iteration Checkpoint iteration number
chain_id Sequence identifier
position Position in sequence (0-indexed)
new_aa New amino acid at this position

Example:

iteration,chain_id,position,new_aa
100,seq_0,15,A
100,seq_0,42,G
100,seq_1,23,L
200,seq_0,15,V
...

Console Output

Real-time progress showing:

  • Iteration number and speed (iter/sec)
  • Elapsed time
  • Compilation status (first iteration)

Algorithm Overview

Genie (Codon Evolution)

  1. Initialization: Load DCA model, build codon mutation network
  2. Sequence Translation: Convert amino acids to codons
  3. MCMC Sampling: Hybrid Metropolis-Gibbs with codon mutations
  4. Convergence Tracking: Optional Pearson correlation monitoring

Genie-AA (Amino Acid Only)

  1. Initialization: Load DCA model
  2. Gibbs Sampling: Standard position-wise sampling
  3. Convergence Tracking: Optional correlation monitoring

Performance

Hardware: NVIDIA RTX 4090, 1000 sequences, L=100

Mode Iterations/sec Speedup
Genie (compiled) ~45-50 2.5x
Genie (eager) ~18-20 1.0x
Genie-AA (compiled) ~120-140 6.5x

Note: First iteration includes ~10-30s JIT compilation overhead


Requirements

torch>=2.0.0
numpy>=1.20.0
pandas>=1.3.0
adabmDCA>=1.0.0

Hardware:

  • Minimum: CPU with 4GB RAM
  • Recommended: NVIDIA GPU (8GB+ VRAM) with CUDA 11.7+

Examples

Basic Evolution

# Generate 1000 sequences with codon awareness
genie -p example_data/pf76/params.dat -n 1000 --num_iterations 50000 -o results/pf76

# Generate amino acid sequences only
genie-aa -p example_data/pf76/params.dat -n 1000 --num_iterations 50000 -o results/pf76_aa

Custom Checkpoints

# Save mutations at specific iterations
genie -p params.dat -n 1000 --num_iterations 10000 --save_steps "100,500,1000,5000,10000" -o results/

# Reconstruct sequences at those timesteps
reconstruct_at_timesteps results/ --timesteps "0,100,500,1000,5000,10000"

Citation

This software is based on the following article:

@article{
doi:10.1073/pnas.2406807121,
author = {Leonardo Di Bari  and Matteo Bisardi  and Sabrina Cotogno  and Martin Weigt  and Francesco Zamponi },
title = {Emergent time scales of epistasis in protein evolution},
journal = {Proceedings of the National Academy of Sciences},
volume = {121},
number = {40},
pages = {e2406807121},
year = {2024},
doi = {10.1073/pnas.2406807121},
URL = {https://www.pnas.org/doi/abs/10.1073/pnas.2406807121},
eprint = {https://www.pnas.org/doi/pdf/10.1073/pnas.2406807121},
}

A Julia version of Genie is also available: Genie.jl


License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

  • Built on the adabmDCA library
  • PyTorch team for excellent GPU optimization tools

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genie_dca-2.0.1.tar.gz (39.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genie_dca-2.0.1-py3-none-any.whl (45.3 kB view details)

Uploaded Python 3

File details

Details for the file genie_dca-2.0.1.tar.gz.

File metadata

  • Download URL: genie_dca-2.0.1.tar.gz
  • Upload date:
  • Size: 39.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for genie_dca-2.0.1.tar.gz
Algorithm Hash digest
SHA256 57a908a00b193a7fc00490465819b7179c4012697006a95497468d8bb4cd5bd4
MD5 6cad5611bd6146057a9e8b742961f3c4
BLAKE2b-256 de6999a394304a7e43b8e4e5ffac633f322794ac07ad04a6529194d372533c1b

See more details on using hashes here.

File details

Details for the file genie_dca-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: genie_dca-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 45.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for genie_dca-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0825e231bc65aaeec253876a6aa9bc36aee8cd974539e85323254926412cee26
MD5 9c80e186445748073246971c108437bc
BLAKE2b-256 92898125012dab238694f15ac09921b092c4122b1ed377230823b62a0747d314

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page