GPU-accelerated CONTRAfold partition function + Boltzmann sampling with hard constraints

These details have not been verified by PyPI

Project links

Project description

Contrafold_GPU

CONTRAfold on the GPU — ~460× faster than a single CPU core and ~38× a full 32-thread CPU at folding 10,000 × 200 nt RNAs on an RTX 5090 (vs this package's own CPU reference; see Benchmark.md), reproducing the original CONTRAfold's output.

GPU-accelerated CONTRAfold RNA secondary-structure model: partition function, posterior base-pair probabilities, and Boltzmann structure sampling — with optional per-base hard constraints (e.g. force DMS-reactive positions unpaired).

It is a faithful reimplementation of CONTRAfold's CRF scoring + inside recurrence (complementary model) that reproduces the original CONTRAfold binary's output, running on the GPU via Numba CUDA.

Why

This started from a practical need: I had to fold large batches of RNA with CONTRAfold, but it is a CPU-only C++ tool and was simply too slow for that volume. So I reimplemented its model on the GPU. Folding millions of short windows (e.g. for DMS-MaPseq / SHAPE structure-probing pipelines) now runs the same model on the GPU, batching thousands of sequences at once.

Accurate: the GPU engine reproduces CONTRAfold's float-precision arithmetic — the same 8-segment polynomial Fast_LogExpPlusOne used by the RealT=float binary — so log-partition matches the binary to ~5e-4 (≈1e-4 relative). The small residual is float32 summation order on the GPU, not the algorithm; bit-identical output isn't achievable on a parallel device (non-associative float adds + FMA). The bundled CPU engine instead uses exact log1p/exp (double) as a higher-precision reference, mirroring a RealT=double CONTRAfold build. Boltzmann sample base-pair frequencies match CONTRAfold's posterior probabilities within sampling noise.
Fast: ~6,000–19,000 folds/s for 128-nt windows on an RTX 5090 (≈2–6× a 32-core CPU running the CONTRAfold binary; ≈65–200× single-core).
Constraints: any position can be forced unpaired (hard constraint), e.g. to encode DMS/SHAPE reactivity.

Install

pip install gpu-contrafold

Requires an NVIDIA GPU + CUDA toolkit (tested CUDA 12.8, sm_120); numpy and numba are pulled in automatically. The trained parameters (gpu_contrafold/data/contrafold.params.complementary) are bundled. The CPU engine works without a GPU. Install from source with pip install ..

Command-line tool

pip install exposes a gpu-contrafold command (or python -m gpu_contrafold) — pass the sequence or file directly:

gpu-contrafold GGGGAAAACCCC                 # one sequence -> MEA structure (dot-bracket)
gpu-contrafold GGGGAAAACCCC --gamma 4       # MEA with a custom tradeoff (default 6)
gpu-contrafold GGGGAAAACCCC --viterbi       # max-probability (Viterbi/MAP) structure
gpu-contrafold GGGGAAAACCCC --sample 10     # 10 Boltzmann samples
gpu-contrafold GGGGAAAACCCC --logz          # partition function (logZ) instead
gpu-contrafold seqs.jsonl  -o out.jsonl     # batch: JSONL in -> JSONL out
gpu-contrafold seqs.fasta  -o out.jsonl --sample 100

The default is the maximum-expected-accuracy (MEA) structure via posterior decoding — exactly what contrafold predict returns by default (verified: exact dot-bracket match over 400 random sequences up to 250 nt, and across --gamma 0.5–16). --gamma G sets the sensitivity/specificity tradeoff (default 6, as in CONTRAfold). --viterbi gives the maximum-probability (Viterbi/MAP) structure instead — identical to contrafold --viterbi. --sample N draws N Boltzmann samples; --logz gives the partition function.

The argument is a literal RNA sequence if it is not an existing file; otherwise it is read as JSONL (lines starting with {), FASTA (>), or one sequence per line.

JSONL input — one object per line:

{"id": "rna1", "seq": "GGGGAAAACCCC"}
{"id": "rna2", "seq": "GGGGAAAACCCC", "constrain": [0,0,0,1,0,0,0,0,0,0,0,0]}

seq (required); non-ACGU is treated as N (cannot pair).
id (optional), echoed to output; defaults to the 0-based index.
constrain (optional): per-base 0/1 hard-constraint mask, length == len(seq), 1 forces that position unpaired (e.g. a DMS-reactive base). A list [0,0,1,...] or string "001..."; omit for none.

Output: a single literal sequence prints structures to stdout (one per line); file input (or any -o) writes JSONL — {"id","structure"} by default, {"id","samples":[...]} with --sample N, or {"id","logZ"} with --logz — input order preserved. Flags: --gamma G, --viterbi, --sample N, --logz, --chunk, --seed, --threads.

Usage

from gpu_contrafold import load, logZ_batch, sample_batch, mfe, mea, bpp

P = load()                                  # bundled CONTRAfold complementary params

# Maximum-probability (Viterbi/MAP) structure, dot-bracket (matches contrafold --viterbi)
db = mfe("GGGGAAAACCCC", P)                  # -> "((((....))))"

# Maximum-expected-accuracy structure (matches contrafold's default `predict`)
db = mea("GGGGAAAACCCC", P, gamma=6)         # -> "((((....))))"

# Exact base-pair probability matrix (inside+outside; matches contrafold --posteriors)
prob = bpp("GGGGAAAACCCC", P)                # n x n upper-triangular numpy array

# Log partition function (one per sequence), batched on GPU
z = logZ_batch(["GGGGAAAACCCC", "GCGCGCAAAAGCGCGC"], P)

# Boltzmann-sample structures (dot-bracket), N per sequence
structs = sample_batch(["GGGGAAAACCCC"], P, n_samples=10)   # -> [[db, db, ...]]

# Hard constraints: force positions unpaired (1 = forced unpaired)
import numpy as np
mask = np.zeros(12, np.int8); mask[3] = 1                   # position 3 cannot pair
structs = sample_batch(["GGGGAAAACCCC"], P, n_samples=10, forced_list=[mask])

# Bulk pipeline primitive: fold many (seq, mask) tasks, 1 sample each, chunked
from gpu_contrafold import fold_tasks_gpu
dbs = fold_tasks_gpu(seqs, masks, P)        # list of dot-bracket strings

CPU reference (matches the binary, no GPU needed):

from gpu_contrafold import cpu
P = cpu.load()
print(cpu.logZ("GGGGAAAACCCC", P))

Benchmark

Full results are in Benchmark.md. Reproduce:

# deterministic random sequences (reproducible from the seed)
python benchmarks/generate_sequences.py -n 10000 -l 200 -o benchmarks/data/sim_10k_200nt.fa

# batch-size sweep, length scaling, constraint overhead, sampling, GPU-vs-CPU
python benchmarks/benchmark.py benchmarks/data/sim_10k_200nt.fa --repeat 3
# add --skip-compare for a fast GPU-only run (no slow single-core CPU pass)

Headline on an RTX 5090 (10k × 200 nt): logZ peaks at ~9,800 folds/s (~460× a single CPU core, ~38× the full 32-thread CPU); hard constraints are effectively free; Boltzmann sampling runs at ~5,600 structures/s (100 seqs × 100 samples). GPU memory scales as ~3 × (L+2)² × 4 B per sequence in a batch (≈0.5 MB/seq at 200 nt, ≈3 MB/seq at 500 nt), so use a smaller batch for longer sequences or smaller GPUs.

Model

Complementary CONTRAfold model, all scoring terms enabled: base_pair, helix_stacking, terminal_mismatch, helix_closing, dangle_left/right, hairpin_length, bulge_length, internal_length, internal_symmetric, internal_asymmetry, internal_explicit (≤4×4), internal_1x1, bulge_0x1, affine multi-branch and external loops. Recurrence is CONTRAfold's unambiguous FC/FM/FM1/FM2/F5 (simple-FC variant), 1-based s[1..L], natural-log (log_base = 1.0). Non-ACGU characters (N) map to a 5th code that cannot pair and carries zero scores — exactly as the CONTRAfold binary handles them.

Validation

tests/test_validation.py checks:

CPU (exact, double) vs GPU (CONTRAfold float arithmetic) logZ agree to ~1e-3, and
GPU logZ matches the original CONTRAfold binary (--partition) to ~5e-4 if available, and
GPU sample base-pair frequencies == CONTRAfold posteriors (--posteriors), and
MFE structure == contrafold --viterbi (exact dot-bracket), and
MEA structure == contrafold predict default decoding (exact dot-bracket, all gammas), and
exact bpp matrix == contrafold --posteriors.

Verified against CONTRAfold 2.02 (max |Δ logZ| 5e-4 over 20 random sequences; sample BPP within 0.006 of --posteriors; MEA dot-bracket exact over 400 random sequences up to 250 nt and across --gamma 0.5–16). The MEA/posterior path is computed in CONTRAfold's RealT=float arithmetic (Fast_LogPlusEquals/Fast_Exp polynomials) so the discrete decode reproduces the binary exactly.

To validate against the original binary, build CONTRAfold and point to it:

export GPU_CONTRAFOLD_BIN=/path/to/contrafold
python tests/test_validation.py

Acknowledgments

All of the science here — the CRF scoring model, the inside/outside recurrences, the posterior decoding, and the trained parameters — is the work of the original CONTRAfold authors. This project is purely an engineering effort (a GPU port + faithful re-implementation); full credit for the method belongs to them:

Chuong B. Do, Daniel A. Woods, and Serafim Batzoglou (Batzoglou Lab, Stanford University)
Original software & parameters: CONTRAfold — http://contra.stanford.edu/contrafold/
Source code referenced during this re-implementation: csfoo/contrafold-se (a maintained mirror of the original CONTRAfold/CONTRAfold-SE source)

Do CB, Woods DA, Batzoglou S. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics. 2006;22(14):e90–e98. doi:10.1093/bioinformatics/btl246

License

This GPU reimplementation is released under the MIT license (see LICENSE). It incorporates the CONTRAfold model and bundles its trained parameter file, which derive from the BSD-licensed CONTRAfold source (see NOTICE.md); those components remain under their original terms. Please retain that attribution.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpu_contrafold-0.1.0.tar.gz (35.6 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gpu_contrafold-0.1.0-py3-none-any.whl (32.5 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file gpu_contrafold-0.1.0.tar.gz.

File metadata

Download URL: gpu_contrafold-0.1.0.tar.gz
Upload date: May 29, 2026
Size: 35.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for gpu_contrafold-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`45d28ff5925cab954806366efcd0670a2484aa19aeddb3e6df0596a83e70f2ce`
MD5	`b6325ea5f7f445894b66b5d9a3901461`
BLAKE2b-256	`836ca616334c53507a3c82cb045a0cf31146c1104e59ed026b948cfc777a716f`

See more details on using hashes here.

File details

Details for the file gpu_contrafold-0.1.0-py3-none-any.whl.

File metadata

Download URL: gpu_contrafold-0.1.0-py3-none-any.whl
Upload date: May 29, 2026
Size: 32.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for gpu_contrafold-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e99f022f76e330333a7f70fac2b3705f40376ea66cdac033ee7c25fb51f14fe1`
MD5	`3695f821e3bca278bdc0b6893ddbf3fb`
BLAKE2b-256	`b9931623a483f4d8217a7a9ed096a68704cc070411b47396eb83ba2c90b6fc1f`

See more details on using hashes here.

gpu-contrafold 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Contrafold_GPU

Why

Install

Command-line tool

Usage

Benchmark

Model

Validation

Acknowledgments

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes