GPU-accelerated CONTRAfold partition function + Boltzmann sampling with hard constraints
Project description
Contrafold_GPU
CONTRAfold on the GPU — ~460× faster than a single CPU core and ~38× a full 32-thread CPU at folding 10,000 × 200 nt RNAs on an RTX 5090 (vs this package's own CPU reference; see Benchmark.md), reproducing the original CONTRAfold's output.
GPU-accelerated CONTRAfold RNA secondary-structure model: partition function, posterior base-pair probabilities, and Boltzmann structure sampling — with optional per-base hard constraints (e.g. force DMS-reactive positions unpaired).
It is a faithful reimplementation of CONTRAfold's CRF scoring + inside recurrence (complementary model) that reproduces the original CONTRAfold binary's output, running on the GPU via Numba CUDA.
Why
This started from a practical need: I had to fold large batches of RNA with CONTRAfold, but it is a CPU-only C++ tool and was simply too slow for that volume. So I reimplemented its model on the GPU. Folding millions of short windows (e.g. for DMS-MaPseq / SHAPE structure-probing pipelines) now runs the same model on the GPU, batching thousands of sequences at once.
- Accurate: the GPU engine reproduces CONTRAfold's float-precision arithmetic
— the same 8-segment polynomial
Fast_LogExpPlusOneused by theRealT=floatbinary — so log-partition matches the binary to ~5e-4 (≈1e-4 relative). The small residual is float32 summation order on the GPU, not the algorithm; bit-identical output isn't achievable on a parallel device (non-associative float adds + FMA). The bundled CPU engine instead uses exactlog1p/exp(double) as a higher-precision reference, mirroring aRealT=doubleCONTRAfold build. Boltzmann sample base-pair frequencies match CONTRAfold's posterior probabilities within sampling noise. - Fast: ~6,000–19,000 folds/s for 128-nt windows on an RTX 5090 (≈2–6× a 32-core CPU running the CONTRAfold binary; ≈65–200× single-core).
- Constraints: any position can be forced unpaired (hard constraint), e.g. to encode DMS/SHAPE reactivity.
Install
pip install gpu-contrafold
Requires an NVIDIA GPU + CUDA toolkit (tested CUDA 12.8, sm_120); numpy and numba
are pulled in automatically. The trained parameters
(gpu_contrafold/data/contrafold.params.complementary) are bundled. The CPU engine
works without a GPU. Install from source with pip install ..
Command-line tool
pip install exposes a gpu-contrafold command (or python -m gpu_contrafold)
— pass the sequence or file directly:
gpu-contrafold GGGGAAAACCCC # one sequence -> MEA structure (dot-bracket)
gpu-contrafold GGGGAAAACCCC --gamma 4 # MEA with a custom tradeoff (default 6)
gpu-contrafold GGGGAAAACCCC --viterbi # max-probability (Viterbi/MAP) structure
gpu-contrafold GGGGAAAACCCC --sample 10 # 10 Boltzmann samples
gpu-contrafold GGGGAAAACCCC --logz # partition function (logZ) instead
gpu-contrafold seqs.jsonl -o out.jsonl # batch: JSONL in -> JSONL out
gpu-contrafold seqs.fasta -o out.jsonl --sample 100
The default is the maximum-expected-accuracy (MEA) structure via posterior
decoding — exactly what contrafold predict returns by default (verified: exact
dot-bracket match over 400 random sequences up to 250 nt, and across --gamma
0.5–16). --gamma G sets the sensitivity/specificity tradeoff (default 6, as in
CONTRAfold). --viterbi gives the maximum-probability (Viterbi/MAP) structure
instead — identical to contrafold --viterbi. --sample N draws N Boltzmann
samples; --logz gives the partition function.
The argument is a literal RNA sequence if it is not an existing file; otherwise it is
read as JSONL (lines starting with {), FASTA (>), or one sequence per line.
JSONL input — one object per line:
{"id": "rna1", "seq": "GGGGAAAACCCC"}
{"id": "rna2", "seq": "GGGGAAAACCCC", "constrain": [0,0,0,1,0,0,0,0,0,0,0,0]}
seq(required); non-ACGU is treated asN(cannot pair).id(optional), echoed to output; defaults to the 0-based index.constrain(optional): per-base 0/1 hard-constraint mask, length== len(seq),1forces that position unpaired (e.g. a DMS-reactive base). A list[0,0,1,...]or string"001..."; omit for none.
Output: a single literal sequence prints structures to stdout (one per line); file
input (or any -o) writes JSONL — {"id","structure"} by default, {"id","samples":[...]}
with --sample N, or {"id","logZ"} with --logz — input order preserved. Flags:
--gamma G, --viterbi, --sample N, --logz, --chunk, --seed, --threads.
Usage
from gpu_contrafold import load, logZ_batch, sample_batch, mfe, mea, bpp
P = load() # bundled CONTRAfold complementary params
# Maximum-probability (Viterbi/MAP) structure, dot-bracket (matches contrafold --viterbi)
db = mfe("GGGGAAAACCCC", P) # -> "((((....))))"
# Maximum-expected-accuracy structure (matches contrafold's default `predict`)
db = mea("GGGGAAAACCCC", P, gamma=6) # -> "((((....))))"
# Exact base-pair probability matrix (inside+outside; matches contrafold --posteriors)
prob = bpp("GGGGAAAACCCC", P) # n x n upper-triangular numpy array
# Log partition function (one per sequence), batched on GPU
z = logZ_batch(["GGGGAAAACCCC", "GCGCGCAAAAGCGCGC"], P)
# Boltzmann-sample structures (dot-bracket), N per sequence
structs = sample_batch(["GGGGAAAACCCC"], P, n_samples=10) # -> [[db, db, ...]]
# Hard constraints: force positions unpaired (1 = forced unpaired)
import numpy as np
mask = np.zeros(12, np.int8); mask[3] = 1 # position 3 cannot pair
structs = sample_batch(["GGGGAAAACCCC"], P, n_samples=10, forced_list=[mask])
# Bulk pipeline primitive: fold many (seq, mask) tasks, 1 sample each, chunked
from gpu_contrafold import fold_tasks_gpu
dbs = fold_tasks_gpu(seqs, masks, P) # list of dot-bracket strings
CPU reference (matches the binary, no GPU needed):
from gpu_contrafold import cpu
P = cpu.load()
print(cpu.logZ("GGGGAAAACCCC", P))
Benchmark
Full results are in Benchmark.md. Reproduce:
# deterministic random sequences (reproducible from the seed)
python benchmarks/generate_sequences.py -n 10000 -l 200 -o benchmarks/data/sim_10k_200nt.fa
# batch-size sweep, length scaling, constraint overhead, sampling, GPU-vs-CPU
python benchmarks/benchmark.py benchmarks/data/sim_10k_200nt.fa --repeat 3
# add --skip-compare for a fast GPU-only run (no slow single-core CPU pass)
Headline on an RTX 5090 (10k × 200 nt): logZ peaks at ~9,800 folds/s (~460× a
single CPU core, ~38× the full 32-thread CPU); hard constraints are effectively free;
Boltzmann sampling runs at ~5,600 structures/s (100 seqs × 100 samples). GPU memory
scales as ~3 × (L+2)² × 4 B per sequence in a batch (≈0.5 MB/seq at 200 nt,
≈3 MB/seq at 500 nt), so use a smaller batch for longer sequences or smaller GPUs.
Model
Complementary CONTRAfold model, all scoring terms enabled: base_pair,
helix_stacking, terminal_mismatch, helix_closing, dangle_left/right,
hairpin_length, bulge_length, internal_length, internal_symmetric,
internal_asymmetry, internal_explicit (≤4×4), internal_1x1, bulge_0x1,
affine multi-branch and external loops. Recurrence is CONTRAfold's unambiguous
FC/FM/FM1/FM2/F5 (simple-FC variant), 1-based s[1..L], natural-log
(log_base = 1.0). Non-ACGU characters (N) map to a 5th code that cannot pair
and carries zero scores — exactly as the CONTRAfold binary handles them.
Validation
tests/test_validation.py checks:
- CPU (exact, double) vs GPU (CONTRAfold float arithmetic)
logZagree to ~1e-3, and - GPU
logZmatches the original CONTRAfold binary (--partition) to ~5e-4 if available, and - GPU sample base-pair frequencies == CONTRAfold posteriors (
--posteriors), and - MFE structure ==
contrafold --viterbi(exact dot-bracket), and - MEA structure ==
contrafold predictdefault decoding (exact dot-bracket, all gammas), and - exact
bppmatrix ==contrafold --posteriors.
Verified against CONTRAfold 2.02 (max |Δ logZ| 5e-4 over 20 random sequences;
sample BPP within 0.006 of --posteriors; MEA dot-bracket exact over 400 random
sequences up to 250 nt and across --gamma 0.5–16). The MEA/posterior path is
computed in CONTRAfold's RealT=float arithmetic (Fast_LogPlusEquals/Fast_Exp
polynomials) so the discrete decode reproduces the binary exactly.
To validate against the original binary, build CONTRAfold and point to it:
export GPU_CONTRAFOLD_BIN=/path/to/contrafold
python tests/test_validation.py
Acknowledgments
All of the science here — the CRF scoring model, the inside/outside recurrences, the posterior decoding, and the trained parameters — is the work of the original CONTRAfold authors. This project is purely an engineering effort (a GPU port + faithful re-implementation); full credit for the method belongs to them:
- Chuong B. Do, Daniel A. Woods, and Serafim Batzoglou (Batzoglou Lab, Stanford University)
- Original software & parameters: CONTRAfold — http://contra.stanford.edu/contrafold/
- Source code referenced during this re-implementation: csfoo/contrafold-se (a maintained mirror of the original CONTRAfold/CONTRAfold-SE source)
Do CB, Woods DA, Batzoglou S. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics. 2006;22(14):e90–e98. doi:10.1093/bioinformatics/btl246
License
This GPU reimplementation is released under the MIT license (see LICENSE).
It incorporates the CONTRAfold model and bundles its trained parameter file, which
derive from the BSD-licensed CONTRAfold source (see NOTICE.md); those components
remain under their original terms. Please retain that attribution.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gpu_contrafold-0.1.0.tar.gz.
File metadata
- Download URL: gpu_contrafold-0.1.0.tar.gz
- Upload date:
- Size: 35.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
45d28ff5925cab954806366efcd0670a2484aa19aeddb3e6df0596a83e70f2ce
|
|
| MD5 |
b6325ea5f7f445894b66b5d9a3901461
|
|
| BLAKE2b-256 |
836ca616334c53507a3c82cb045a0cf31146c1104e59ed026b948cfc777a716f
|
File details
Details for the file gpu_contrafold-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gpu_contrafold-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e99f022f76e330333a7f70fac2b3705f40376ea66cdac033ee7c25fb51f14fe1
|
|
| MD5 |
3695f821e3bca278bdc0b6893ddbf3fb
|
|
| BLAKE2b-256 |
b9931623a483f4d8217a7a9ed096a68704cc070411b47396eb83ba2c90b6fc1f
|