Predicting protein-protein interactions and structures from multiple sequence alignments.
Project description
🍐 yunta
Predicting pairwise protein-protein interactions and structures from multiple sequence alignments. Now with interspecies (host-pathogen) interactions and automatic chunking of large sequences!
- Installation
- Credit
- Command-line usage
- Python API
- Scaling up
- Issues, problems, suggestions
- Further help
yunta provides several implementations of protein-protein interaction evaluation. In increasing computational cost:
- GPU-accelerated direct coupling analysis (DCA) in PyTorch
- RoseTTAFold-2track via the
rf2t-micropackage - AlphaFold2 for protein-protein structure prediction
yunta has streamlined installation, a command-line interface, a Python API, and resilience to GPU out-of-memory errors through chunking of long sequences and CPU-fallback. It takes as input unpaired multiple-sequence alignments in A3M format (as generated by tools like hhblits), and outputs a matrix of inter-residue contacts.
Rough timings for a pair of ~200 amino-acid proteins (S. cerevisiae DHFR and WW domain-containing protein) on CPU:
- DCA: 5 seconds
- RosettaFold-2track: 10 seconds
- AlphaFold2: 1 hour
Note that times increase quadratically with total protein length.
Installation
pip install yunta
To enable AlphaFold2 with CUDA 12 (recommended for GPU):
pip install yunta[af_cuda12]
For a local CUDA 12 installation:
pip install yunta[af_cuda12_local]
For CUDA 11:
pip install yunta[af_cuda11]
For AlphaFold2 without a specific CUDA version (CPU or custom JAX install):
pip install yunta[af]
To enable RosettaFold-2track:
pip install yunta[rf2t]
Using the embedded models requires the RoseTTAFold-2track and AlphaFold2 weights. These are automatically downloaded on first use. By doing so you agree that the trained weights for RoseTTAFold are made available for non-commercial use only under the terms of the Rosetta-DL Software license and AlphaFold2's pretrained parameters fall under the CC BY 4.0 license.
Environment variables
| Variable | Default | Description |
|---|---|---|
YUNTA_CACHE |
~/.cache/yunta |
Directory for the organism interaction lookup table cache. |
YUNTA_USE_CACHE |
False |
Set to True to load a pre-built cache from disk rather than rebuilding. |
YUNTA_TEST |
0 |
Set to 1 to build and hold the interaction lookup table in memory only (no disk write). |
Credit
yunta is a fork of SpeedPPI, which is itself inspired by FoldDock. This method used AlphaFold2 to evaluate 65,484 protein-protein interactions from the human proteome in Towards a structurally resolved human protein interaction network.
The idea of using DCA, RoseTTAFold-2track, and AlphaFold2 in a cascade of increasingly expensive and specific PPI detection methods has been explored in a series of papers from David Baker's lab:
- Cong et al., Protein interaction networks revealed by proteome coevolution. Science, 2019
- Humphreys et al., Computed structures of core eukaryotic protein complexes. Science, 2021
- Humphreys et al., Protein interactions in human pathogens revealed through deep learning. Nature Microbiology, 2024
yunta puts these algorithms in one place with easy installation, a command-line interface, and a Python API. It also enables interspecies co-evolutionary analysis using a built-in host-pathogen interaction mapping.
Command-line usage
$ yunta --help
usage: yunta [-h] {dca-single,dca-many,rf2t-single,af2-single,af2-many} ...
Screening protein-protein interactions using DCA, RosettaFold-2track, and AlphaFold2.
options:
-h, --help show this help message and exit
Sub-commands:
{dca-single,dca-many,rf2t-single,af2-single,af2-many}
Use these commands to specify the tool you want to use.
dca-single Calculate DCA for one protein-protein interaction.
dca-many Calculate DCA between two sets of proteins, or all pairs in one set of proteins.
rf2t-single Calculate RF-2track contacts between one protein and a series of others.
af2-single Model one protein-protein interaction.
af2-many Model all interactions between two sets of proteins, or all pairs in one set of proteins.
Generating multiple-sequence alignments
All algorithms depend on pre-computed multiple-sequence alignments (MSAs) between a protein of interest and as many homologs as possible. You can generate MSAs using hhblits with pre-clustered databases like UniClust:
hhblits -e 0.01 -v 3 -d /path/to/UniClust-database -i input.fasta -oa3m output-msa.a3m -o /dev/null -cov 60 -n 3 -realign -realign_max 10000
This typically takes 1–40 min depending on query complexity. See the hhsuite documentation for details.
Calculating contact maps
Given two MSAs, yunta calculates a contact map using DCA, RF2t, or AlphaFold2, and produces a summary table for each pair.
Using DCA or RF2t produces a table like this:
$ yunta dca-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/dca-single.tsv --apc
| ID | uniprot_id_1 | uniprot_id_2 | seq_len | chain_a_len | chain_b_len | msa1_depth | msa2_depth | msa_depth | n_eff | DCA:apc | DCA:mean | DCA:median | DCA:maximum | DCA:minimum | DCA:var | DCA:sigma1 | DCA:focality | DCA:top_A | DCA:top_B |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| O13297-D6VTK4 | O13297 | D6VTK4 | 980 | 549 | 431 | 14246 | 1546 | 670 | 2 | False | 0.0183 | 0.0147 | 0.0743 | 2.28e-06 | ... | ... | ... | ... | ... |
Method-specific columns are prefixed with DCA: or RF2t:. Common columns across all methods:
sigma1— leading singular value of the inter-chain contact submatrixfocality— ratio of first to second singular value; higher values indicate a more concentrated interaction signaltop_A,top_B— indices of the top-scoring residues in each chain (from the leading SVD eigenvector)
If you also give --plot, contact maps for the full complex and inter-chain contacts only are saved as PNG, alongside CSV files of the raw matrices:
$ yunta dca-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/dca-single.tsv --apc --plot test/outputs/DYR_YEAST-CAPZA_YEAST
Predicting protein complex structures
yunta can feed MSAs into AlphaFold2 to predict binary protein complex structures:
$ yunta af2-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/af2-single.tsv
This writes a summary TSV with AF2:-prefixed metrics — n_contacts, mean_interface_plddt, pdockq, seed — in addition to the standard contact map statistics. PDB structure files are written to the current working directory, named by protein pair ID.
Using --plot generates contact map plots as with the other commands:
$ yunta af2-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/af2-single.tsv --plot test/outputs/af2-single-plot
Command-line tools
*-single commands run one protein against one or more others:
$ yunta dca-single --help
usage: yunta dca-single [-h] [--msa2 [MSA2 ...]] [--list-file] [--interspecies] [--strict-match]
[--output [OUTPUT]] [--plot PLOT] [--apc] [msa1]
positional arguments:
msa1 MSA file. Default: STDIN.
options:
-h, --help show this help message and exit
--msa2 [MSA2 ...], -2 [MSA2 ...]
Second MSA file(s). Default: if not provided, all pairwise from msa1.
--list-file, -l Treat inputs as plain-text list of MSA files, rather than MSA filenames.
Default: treat as MSA filenames.
--interspecies, -i MSAs are from different species; enables built-in host-pathogen interaction
map. Default: assume same species.
--strict-match, -S For interspecies mode, require query MSA sequences to be from known
interacting species. Default: relax this constraint for query sequences.
--output [OUTPUT], -o [OUTPUT]
Output filename. Default: STDOUT.
--plot PLOT, -p PLOT Directory for saving plots. Default: don't plot.
--apc, -a Apply average product correction (APC) to DCA scores. Default: off.
If one MSA is provided (no -2), homodimeric interactions are probed. Use --list-file to pass a single plain-text file containing one MSA path per line.
*-many commands run all pairwise combinations across two sets:
$ yunta af2-many --help
usage: yunta af2-many [-h] [--msa2 [MSA2 ...]] [--list-file] [--interspecies] [--strict-match]
[--output [OUTPUT]] [--params PARAMS] [--recycles RECYCLES] [--plot PLOT]
[msa1 ...]
positional arguments:
msa1 MSA file(s).
options:
-h, --help show this help message and exit
--msa2 [MSA2 ...], -2 [MSA2 ...]
Second MSA file(s). Default: if not provided, all pairwise from msa1.
--list-file, -l Treat inputs as plain-text list of MSA files, rather than MSA filenames.
--interspecies, -i MSAs are from different species; enables built-in host-pathogen interaction map.
--strict-match, -S For interspecies mode, require query MSA sequences to be from known
interacting species.
--output [OUTPUT], -o [OUTPUT]
Output filename. Default: STDOUT.
--params PARAMS, -w PARAMS
Path to AlphaFold2 params file (.npz). Downloaded automatically if absent.
--recycles RECYCLES, -x RECYCLES
Maximum number of recycles through the model. Default: 10.
--plot PLOT, -p PLOT Directory for saving plots. Default: don't plot.
Interspecies (host-pathogen) analysis
Use --interspecies / -i when the two MSAs come from organisms that interact as host and pathogen. yunta uses a built-in host-pathogen interaction (HPI) map to pair aligned sequences across species rather than requiring exact species identity:
$ yunta dca-single test/inputs/crypto/Q5CPK5_CRYPI.a3m \
-2 test/inputs/human/EZRI_HUMAN.a3m \
--interspecies --apc \
-o test/outputs/dca-single-interspecies.tsv
By default (without --strict-match), the HPI constraint is relaxed for the query sequences themselves — useful when screening an uncharacterised query against a known host or pathogen proteome. Add --strict-match to require that query sequences come from species in the HPI map.
Python API
Load and inspect an MSA:
from yunta.structs.msa import MSA, PairedMSA
msa = MSA.from_file("my-msa-file.a3m")
print(msa) # MSA(name=P07807) of sequence length 549, with 14246 sequences.
print(msa.neff()) # effective sequence count
Pair two MSAs and run DCA:
from yunta.structs.msa import MSA, PairedMSA
from yunta.interactions.dca.dca_torch import calculate_dca
msa1 = MSA.from_file("protein-a.a3m")
msa2 = MSA.from_file("protein-b.a3m")
paired = PairedMSA.from_msa(msa1, msa2)
contact_matrix = calculate_dca(paired, apc=True)
For interspecies pairing, pass interaction_map="builtin":
paired = PairedMSA.from_msa(msa1, msa2, interaction_map="builtin")
Or supply a custom dict mapping species IDs to lists of interacting species IDs:
paired = PairedMSA.from_msa(
msa1, msa2,
interaction_map={"NCBI:562": ["NCBI:10710"], "NCBI:10710": ["NCBI:562"]},
)
Run the full screening pipeline programmatically:
from yunta.screening import dca_one_vs_many, rf2track_one_vs_many
outputs = dca_one_vs_many(
msa_file1="query.a3m",
msa_file2=["target1.a3m", "target2.a3m"],
apc=True,
interaction_map="builtin", # omit for same-species
)
for result_matrix, interaction_matrix, metrics in outputs:
print(metrics.ID, metrics.focality)
Each element of outputs is a 3-tuple (full_contact_matrix, inter-chain_contact_matrix, metrics_dataclass). Metrics dataclasses (DCAMetrics, RF2TMetrics, AF2Metrics) can be written directly to TSV:
metrics.write("results.tsv")
(More documentation coming soon!)
... if you want to scale up
While the *-many commands handle batches of PPIs, for large-scale screening across a HPC cluster our nf-ggi Nextflow pipeline is more efficient and can also generate MSAs for you.
Issues, problems, suggestions
Add to the issue tracker.
Further help
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yunta-0.1.3.tar.gz.
File metadata
- Download URL: yunta-0.1.3.tar.gz
- Upload date:
- Size: 10.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89bb73a76cfdcd769ed27a7eacdc06ee76e7cecb7a035aa7d80fe5328c43d099
|
|
| MD5 |
7a8346f5194ef8d873b62c91bd2032bc
|
|
| BLAKE2b-256 |
635b8d688c65cf2c3e3f889f07f1bde9e164b17308d4979f42269c3a8ae32a33
|
File details
Details for the file yunta-0.1.3-py3-none-any.whl.
File metadata
- Download URL: yunta-0.1.3-py3-none-any.whl
- Upload date:
- Size: 11.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68740386f28a49b4106cec8f59ab4a91355fac8079b12755006b077738a9c1f8
|
|
| MD5 |
f697c1ce42c00ad5e385b9974d6d935f
|
|
| BLAKE2b-256 |
345217489aa2a9ca65603e21a9c74a649964bab41032e32ce0e46de79f352915
|