GPU-Accelerated MCMC Sampling for Protein Sequences with Codon-Level Mutations
Project description
Genie 2.0
GPU-Accelerated MCMC Sampling for Protein Sequences with Codon-Level Mutations
Genie 2.0 is a high-performance tool for generating protein sequences using Direct Coupling Analysis (DCA) models combined with biologically realistic codon substitution dynamics. It implements efficient MCMC sampling on GPUs with two variants:
- Genie: DNA codon-aware evolution with Metropolis-Gibbs sampling
- Genie-AA: Amino acid-only evolution with standard Gibbs sampling
Table of Contents
- Features
- Installation
- Quick Start
- Usage
- Command-Line Arguments
- Output Files
- Algorithm Overview
- Performance
- Requirements
- Examples
- Citation
- License
Features
Core Capabilities
- Codon-Aware Sampling: Biologically realistic single-nucleotide mutations at DNA level
- Hybrid MCMC: Combined Metropolis-Hastings and Gibbs sampling for better mixing
- Reference-Based: Optional convergence tracking against real sequence data
- GPU-Accelerated: Full CUDA support with PyTorch JIT compilation (2-3x speedup)
- Flexible Input: Start from existing sequences or random initialization
Technical Highlights
- Fully vectorized GPU kernels with zero CPU loops
- Pre-computed codon mutation networks for O(1) neighbor lookups
- Batched random number generation for improved GPU efficiency
- Real-time Pearson correlation tracking for convergence monitoring
Installation
Prerequisites
- Python 3.8+
- PyTorch 2.0+ with CUDA support (recommended) or CPU
- adabmDCA library
Install from Source
git clone https://github.com/yourusername/Genie.py.git
cd Genie.py
pip install .
This installs two command-line tools:
genie- Codon-aware evolutiongenie-aa- Amino acid evolution
Quick Start
Codon-Aware Evolution
genie -p params.dat -n 1000 --num_iterations 50000 -o output_folder
Amino Acid Evolution
genie-aa -p params.dat -n 1000 --num_iterations 50000 -o output_folder
Usage
Genie (Codon-Aware Evolution)
# Generate sequences from scratch
genie -p params.dat -n 1000 --num_iterations 50000 -o results/
# Start from existing sequences
genie -c init_sequences.fasta -p params.dat --num_iterations 50000 -o results/
Genie-AA (Amino Acid Evolution)
# Generate sequences from scratch
genie-aa -p params.dat -n 1000 --num_iterations 50000 -o results/
# Start from existing sequences
genie-aa -c init_sequences.fasta -p params.dat --num_iterations 50000 -o results/
Reconstruction Tools
# Reconstruct final sequences from mutation log
reconstruct_chains results/
# Reconstruct sequences at specific timesteps
reconstruct_at_timesteps results/ --timesteps "0,100,500,1000"
Python API
from Genie import reconstruct_at_timesteps, reconstruct_chains_from_log
from adabmDCA.fasta import get_tokens
# Reconstruct sequences at specific timesteps
sequences = reconstruct_at_timesteps(
initial_chains_file="results/initial_chains.fasta",
mutation_log_file="results/mutation_log.csv",
timesteps=[0, 100, 500, 1000],
alphabet="protein"
)
# Returns: torch.Tensor of shape (len(timesteps), n_chains, L)
# Reconstruct and validate final sequences
tokens = get_tokens(alphabet="protein")
reconstructed_seqs, headers = reconstruct_chains_from_log(
initial_chains_file="results/initial_chains.fasta",
mutation_log_file="results/mutation_log.csv",
tokens=tokens
)
Command-Line Arguments
Required Arguments
| Argument | Short | Description |
|---|---|---|
--path_params |
-p |
DCA model parameters file (.dat) |
--num_iterations |
Number of MCMC iterations |
Optional Arguments
| Argument | Short | Default | Description |
|---|---|---|---|
--output |
-o |
DCA_evolution |
Output directory |
--num_chains |
-n |
None | Number of sequences (required if not using -c) |
--path_chains |
-c |
None | Initial sequences (FASTA format) |
--seq_index |
None | Replicate single sequence from -c file |
|
--save_steps |
100 |
Checkpoint interval or comma-separated list (e.g., "100,500,1000") | |
--device |
auto | Device: 'cuda' or 'cpu' | |
--dtype |
float32 | Data type: float32 or float64 |
Genie-Specific Arguments
| Argument | Default | Description |
|---|---|---|
--p_metropolis |
0.5 | Metropolis vs Gibbs ratio (0.0-1.0) |
Genie-AA Specific Arguments
| Argument | Default | Description |
|---|---|---|
--alphabet |
protein | Alphabet type: 'protein', 'rna', 'dna', or custom |
Reconstruction Tool Arguments
reconstruct_chains: Takes output folder as positional argument
reconstruct_at_timesteps:
folder- Output folder (positional)--timesteps- Comma-separated list (e.g., "0,100,500,1000")
Output Files
All files are saved in the output directory specified by -o.
Generated Files
| File | Description |
|---|---|
initial_chains.fasta |
Starting sequences (before evolution) |
final_chains.fasta |
Final sequences (after all iterations) |
mutation_log.csv |
Log of all mutations at checkpoints |
Mutation Log Format
File: mutation_log.csv
CSV file tracking mutations at checkpoints:
| Column | Description |
|---|---|
iteration |
Checkpoint iteration number |
chain_id |
Sequence identifier |
position |
Position in sequence (0-indexed) |
new_aa |
New amino acid at this position |
Example:
iteration,chain_id,position,new_aa
100,seq_0,15,A
100,seq_0,42,G
100,seq_1,23,L
200,seq_0,15,V
...
Console Output
Real-time progress showing:
- Iteration number and speed (iter/sec)
- Elapsed time
- Compilation status (first iteration)
Algorithm Overview
Genie (Codon Evolution)
- Initialization: Load DCA model, build codon mutation network
- Sequence Translation: Convert amino acids to codons
- MCMC Sampling: Hybrid Metropolis-Gibbs with codon mutations
- Convergence Tracking: Optional Pearson correlation monitoring
Genie-AA (Amino Acid Only)
- Initialization: Load DCA model
- Gibbs Sampling: Standard position-wise sampling
- Convergence Tracking: Optional correlation monitoring
Performance
Hardware: NVIDIA RTX 4090, 1000 sequences, L=100
| Mode | Iterations/sec | Speedup |
|---|---|---|
| Genie (compiled) | ~45-50 | 2.5x |
| Genie (eager) | ~18-20 | 1.0x |
| Genie-AA (compiled) | ~120-140 | 6.5x |
Note: First iteration includes ~10-30s JIT compilation overhead
Requirements
torch>=2.0.0
numpy>=1.20.0
pandas>=1.3.0
adabmDCA>=1.0.0
Hardware:
- Minimum: CPU with 4GB RAM
- Recommended: NVIDIA GPU (8GB+ VRAM) with CUDA 11.7+
Examples
Basic Evolution
# Generate 1000 sequences with codon awareness
genie -p example_data/pf76/params.dat -n 1000 --num_iterations 50000 -o results/pf76
# Generate amino acid sequences only
genie-aa -p example_data/pf76/params.dat -n 1000 --num_iterations 50000 -o results/pf76_aa
Custom Checkpoints
# Save mutations at specific iterations
genie -p params.dat -n 1000 --num_iterations 10000 --save_steps "100,500,1000,5000,10000" -o results/
# Reconstruct sequences at those timesteps
reconstruct_at_timesteps results/ --timesteps "0,100,500,1000,5000,10000"
Citation
This software is based on the following article:
@article{
doi:10.1073/pnas.2406807121,
author = {Leonardo Di Bari and Matteo Bisardi and Sabrina Cotogno and Martin Weigt and Francesco Zamponi },
title = {Emergent time scales of epistasis in protein evolution},
journal = {Proceedings of the National Academy of Sciences},
volume = {121},
number = {40},
pages = {e2406807121},
year = {2024},
doi = {10.1073/pnas.2406807121},
URL = {https://www.pnas.org/doi/abs/10.1073/pnas.2406807121},
eprint = {https://www.pnas.org/doi/pdf/10.1073/pnas.2406807121},
}
A Julia version of Genie is also available: Genie.jl
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Built on the adabmDCA library
- PyTorch team for excellent GPU optimization tools
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genie_dca-2.0.0.tar.gz.
File metadata
- Download URL: genie_dca-2.0.0.tar.gz
- Upload date:
- Size: 39.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c3cf1708ad5502ba8b1503d1a6cb2d4b7dddb274c28029596d0b1e80443c5e7d
|
|
| MD5 |
ca87b7893e7d015952a0fe54c0f57853
|
|
| BLAKE2b-256 |
6c4cc7fae02a8a337cdd02889dab17305d3a6f4d5e8b26c2ba5fa9dd3c1e07c7
|
File details
Details for the file genie_dca-2.0.0-py3-none-any.whl.
File metadata
- Download URL: genie_dca-2.0.0-py3-none-any.whl
- Upload date:
- Size: 45.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4428eb2a5a57baa8290de6526e717bd4731e8bf795d1633c850b0ebb812e9208
|
|
| MD5 |
271bb708319f18d1d717d4ed84a708a3
|
|
| BLAKE2b-256 |
50f75ef98f792f3eb9048145a94dbc84c4ece7999a3d2b3a6fb33c6b9e13383c
|