Skip to main content

Predicting protein-protein interactions and structures from multiple sequence alignments.

Project description

🍐 yunta

GitHub Workflow Status (with branch) PyPI - Python Version PyPI

Predicting pairwise protein-protein interactions and structures from multiple sequence alignments. Now with interspecies (host-pathogen) interactions and automatic chunking of large sequences!

yunta provides several implementations of protein-protein interaction evaluation. In increasing computational cost:

  • GPU-accelerated direct coupling analysis (DCA) in PyTorch
  • RoseTTAFold-2track via the rf2t-micro package
  • AlphaFold2 for protein-protein structure prediction

yunta has streamlined installation, a command-line interface, a Python API, and resilience to GPU out-of-memory errors through chunking of long sequences and CPU-fallback. It takes as input unpaired multiple-sequence alignments in A3M format (as generated by tools like hhblits), and outputs a matrix of inter-residue contacts.

Rough timings for a pair of ~200 amino-acid proteins (S. cerevisiae DHFR and WW domain-containing protein) on CPU:

  • DCA: 5 seconds
  • RosettaFold-2track: 10 seconds
  • AlphaFold2: 1 hour

Note that times increase quadratically with total protein length.

Installation

pip install yunta

To enable AlphaFold2 with CUDA 12 (recommended for GPU):

pip install yunta[af_cuda12]

For a local CUDA 12 installation:

pip install yunta[af_cuda12_local]

For CUDA 11:

pip install yunta[af_cuda11]

For AlphaFold2 without a specific CUDA version (CPU or custom JAX install):

pip install yunta[af]

To enable RosettaFold-2track:

pip install yunta[rf2t]

Using the embedded models requires the RoseTTAFold-2track and AlphaFold2 weights. These are automatically downloaded on first use. By doing so you agree that the trained weights for RoseTTAFold are made available for non-commercial use only under the terms of the Rosetta-DL Software license and AlphaFold2's pretrained parameters fall under the CC BY 4.0 license.

Environment variables

Variable Default Description
YUNTA_CACHE ~/.cache/yunta Directory for the organism interaction lookup table cache.
YUNTA_USE_CACHE False Set to True to load a pre-built cache from disk rather than rebuilding.
YUNTA_TEST 0 Set to 1 to build and hold the interaction lookup table in memory only (no disk write).

Credit

yunta is a fork of SpeedPPI, which is itself inspired by FoldDock. This method used AlphaFold2 to evaluate 65,484 protein-protein interactions from the human proteome in Towards a structurally resolved human protein interaction network.

The idea of using DCA, RoseTTAFold-2track, and AlphaFold2 in a cascade of increasingly expensive and specific PPI detection methods has been explored in a series of papers from David Baker's lab:

yunta puts these algorithms in one place with easy installation, a command-line interface, and a Python API. It also enables interspecies co-evolutionary analysis using a built-in host-pathogen interaction mapping.

Command-line usage

$ yunta --help
usage: yunta [-h] {dca-single,dca-many,rf2t-single,af2-single,af2-many} ...

Screening protein-protein interactions using DCA, RosettaFold-2track, and AlphaFold2.

options:
  -h, --help            show this help message and exit

Sub-commands:
  {dca-single,dca-many,rf2t-single,af2-single,af2-many}
                        Use these commands to specify the tool you want to use.
    dca-single          Calculate DCA for one protein-protein interaction.
    dca-many            Calculate DCA between two sets of proteins, or all pairs in one set of proteins.
    rf2t-single         Calculate RF-2track contacts between one protein and a series of others.
    af2-single          Model one protein-protein interaction.
    af2-many            Model all interactions between two sets of proteins, or all pairs in one set of proteins.

Generating multiple-sequence alignments

All algorithms depend on pre-computed multiple-sequence alignments (MSAs) between a protein of interest and as many homologs as possible. You can generate MSAs using hhblits with pre-clustered databases like UniClust:

hhblits -e 0.01 -v 3 -d /path/to/UniClust-database -i input.fasta -oa3m output-msa.a3m -o /dev/null -cov 60 -n 3 -realign -realign_max 10000

This typically takes 1–40 min depending on query complexity. See the hhsuite documentation for details.

Calculating contact maps

Given two MSAs, yunta calculates a contact map using DCA, RF2t, or AlphaFold2, and produces a summary table for each pair.

Using DCA or RF2t produces a table like this:

$ yunta dca-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/dca-single.tsv --apc
ID uniprot_id_1 uniprot_id_2 seq_len chain_a_len chain_b_len msa1_depth msa2_depth msa_depth n_eff DCA:apc DCA:mean DCA:median DCA:maximum DCA:minimum DCA:var DCA:sigma1 DCA:focality DCA:top_A DCA:top_B
O13297-D6VTK4 O13297 D6VTK4 980 549 431 14246 1546 670 2 False 0.0183 0.0147 0.0743 2.28e-06 ... ... ... ... ...

Method-specific columns are prefixed with DCA: or RF2t:. Common columns across all methods:

  • sigma1 — leading singular value of the inter-chain contact submatrix
  • focality — ratio of first to second singular value; higher values indicate a more concentrated interaction signal
  • top_A, top_B — indices of the top-scoring residues in each chain (from the leading SVD eigenvector)

If you also give --plot, contact maps for the full complex and inter-chain contacts only are saved as PNG, alongside CSV files of the raw matrices:

$ yunta dca-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/dca-single.tsv --apc --plot test/outputs/DYR_YEAST-CAPZA_YEAST

Predicting protein complex structures

yunta can feed MSAs into AlphaFold2 to predict binary protein complex structures:

$ yunta af2-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/af2-single.tsv

This writes a summary TSV with AF2:-prefixed metrics — n_contacts, mean_interface_plddt, pdockq, seed — in addition to the standard contact map statistics. PDB structure files are written to the current working directory, named by protein pair ID.

Using --plot generates contact map plots as with the other commands:

$ yunta af2-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/af2-single.tsv --plot test/outputs/af2-single-plot

Command-line tools

*-single commands run one protein against one or more others:

$ yunta dca-single --help
usage: yunta dca-single [-h] [--msa2 [MSA2 ...]] [--list-file] [--interspecies] [--strict-match]
                        [--output [OUTPUT]] [--plot PLOT] [--apc] [msa1]

positional arguments:
  msa1                  MSA file. Default: STDIN.

options:
  -h, --help            show this help message and exit
  --msa2 [MSA2 ...], -2 [MSA2 ...]
                        Second MSA file(s). Default: if not provided, all pairwise from msa1.
  --list-file, -l       Treat inputs as plain-text list of MSA files, rather than MSA filenames.
                        Default: treat as MSA filenames.
  --interspecies, -i    MSAs are from different species; enables built-in host-pathogen interaction
                        map. Default: assume same species.
  --strict-match, -S    For interspecies mode, require query MSA sequences to be from known
                        interacting species. Default: relax this constraint for query sequences.
  --output [OUTPUT], -o [OUTPUT]
                        Output filename. Default: STDOUT.
  --plot PLOT, -p PLOT  Directory for saving plots. Default: don't plot.
  --apc, -a             Apply average product correction (APC) to DCA scores. Default: off.

If one MSA is provided (no -2), homodimeric interactions are probed. Use --list-file to pass a single plain-text file containing one MSA path per line.

*-many commands run all pairwise combinations across two sets:

$ yunta af2-many --help
usage: yunta af2-many [-h] [--msa2 [MSA2 ...]] [--list-file] [--interspecies] [--strict-match]
                      [--output [OUTPUT]] [--params PARAMS] [--recycles RECYCLES] [--plot PLOT]
                      [msa1 ...]

positional arguments:
  msa1                  MSA file(s).

options:
  -h, --help            show this help message and exit
  --msa2 [MSA2 ...], -2 [MSA2 ...]
                        Second MSA file(s). Default: if not provided, all pairwise from msa1.
  --list-file, -l       Treat inputs as plain-text list of MSA files, rather than MSA filenames.
  --interspecies, -i    MSAs are from different species; enables built-in host-pathogen interaction map.
  --strict-match, -S    For interspecies mode, require query MSA sequences to be from known
                        interacting species.
  --output [OUTPUT], -o [OUTPUT]
                        Output filename. Default: STDOUT.
  --params PARAMS, -w PARAMS
                        Path to AlphaFold2 params file (.npz). Downloaded automatically if absent.
  --recycles RECYCLES, -x RECYCLES
                        Maximum number of recycles through the model. Default: 10.
  --plot PLOT, -p PLOT  Directory for saving plots. Default: don't plot.

Interspecies (host-pathogen) analysis

Use --interspecies / -i when the two MSAs come from organisms that interact as host and pathogen. yunta uses a built-in host-pathogen interaction (HPI) map to pair aligned sequences across species rather than requiring exact species identity:

$ yunta dca-single test/inputs/crypto/Q5CPK5_CRYPI.a3m \
    -2 test/inputs/human/EZRI_HUMAN.a3m \
    --interspecies --apc \
    -o test/outputs/dca-single-interspecies.tsv

By default (without --strict-match), the HPI constraint is relaxed for the query sequences themselves — useful when screening an uncharacterised query against a known host or pathogen proteome. Add --strict-match to require that query sequences come from species in the HPI map.

Python API

Load and inspect an MSA:

from yunta.structs.msa import MSA, PairedMSA

msa = MSA.from_file("my-msa-file.a3m")
print(msa)         # MSA(name=P07807) of sequence length 549, with 14246 sequences.
print(msa.neff())  # effective sequence count

Pair two MSAs and run DCA:

from yunta.structs.msa import MSA, PairedMSA
from yunta.interactions.dca.dca_torch import calculate_dca

msa1 = MSA.from_file("protein-a.a3m")
msa2 = MSA.from_file("protein-b.a3m")
paired = PairedMSA.from_msa(msa1, msa2)
contact_matrix = calculate_dca(paired, apc=True)

For interspecies pairing, pass interaction_map="builtin":

paired = PairedMSA.from_msa(msa1, msa2, interaction_map="builtin")

Or supply a custom dict mapping species IDs to lists of interacting species IDs:

paired = PairedMSA.from_msa(
    msa1, msa2,
    interaction_map={"NCBI:562": ["NCBI:10710"], "NCBI:10710": ["NCBI:562"]},
)

Run the full screening pipeline programmatically:

from yunta.screening import dca_one_vs_many, rf2track_one_vs_many

outputs = dca_one_vs_many(
    msa_file1="query.a3m",
    msa_file2=["target1.a3m", "target2.a3m"],
    apc=True,
    interaction_map="builtin",  # omit for same-species
)
for result_matrix, interaction_matrix, metrics in outputs:
    print(metrics.ID, metrics.focality)

Each element of outputs is a 3-tuple (full_contact_matrix, inter-chain_contact_matrix, metrics_dataclass). Metrics dataclasses (DCAMetrics, RF2TMetrics, AF2Metrics) can be written directly to TSV:

metrics.write("results.tsv")

(More documentation coming soon!)

... if you want to scale up

While the *-many commands handle batches of PPIs, for large-scale screening across a HPC cluster our nf-ggi Nextflow pipeline is more efficient and can also generate MSAs for you.

Issues, problems, suggestions

Add to the issue tracker.

Further help

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yunta-0.1.2.tar.gz (10.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yunta-0.1.2-py3-none-any.whl (11.1 MB view details)

Uploaded Python 3

File details

Details for the file yunta-0.1.2.tar.gz.

File metadata

  • Download URL: yunta-0.1.2.tar.gz
  • Upload date:
  • Size: 10.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for yunta-0.1.2.tar.gz
Algorithm Hash digest
SHA256 505efcd643a942583c83b592578d24f83657f9fcc3cb929d288ac762fbc89235
MD5 29f3bc8d89830814da3597b64ad3d8a4
BLAKE2b-256 8de530978559de75192849626df52a7bdf7edf6d66b2c9e4d38e6bd959b83f5f

See more details on using hashes here.

File details

Details for the file yunta-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: yunta-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 11.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for yunta-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c43ffe7ae490ddfb0163022c20bd9b49537b828a3462f0eccd223e6b2263ca9d
MD5 388fa0fe42a38f36caf5a6708832eb14
BLAKE2b-256 cac78f1fb832bf5daff257e956259580afe8e7e36e940379776a37cd62e69c43

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page