Skip to main content

Differentiable Pairing using Soft Scores

Project description

DiffPaSS – Differentiable Pairing using Soft Scores

Overview

DiffPaSS is a family of high-performance and scalable PyTorch modules for finding optimal one-to-one pairings between two collections of biological sequences, and for performing general graph alignment.

Pairing multiple-sequence alignments (MSAs)

A typical example of the problem DiffPaSS is designed to solve is the following: given two multiple sequence alignments (MSAs) A and B, containing interacting biological sequences, find the optimal one-to-one pairing between the sequences in A and B.

MSA pairing problem
Pairing problem for two multiple sequence alignments, where pairings are restricted to be within the same species

To find an optimal pairing, we can maximize the average mutual information between columns of the two paired MSAs (InformationPairing), or we can maximize the similarity between distance-based (MirrortreePairing) or orthology (BestHitsPairing) networks constructed from the two MSAs.

Graph alignment and pairing unaligned sequence collections

DiffPaSS can be used for general graph alignment problems (GraphAlignment), where the goal is to find the one-to-one pairing between the nodes of two weighted graphs that maximizes the similarity between the two graphs. The user can specify the (dis-)similarity measure to be optimized, as an arbitrary differentiable function of the adjacency matrices of the two graphs.

Using this capability, DiffPaSS can be used for finding the optimal one-to-one pairing between two unaligned collections of sequences, if weighted graphs are built in advance from the two collections (for example, using the pairwise Levenshtein distance). This is useful when alignments are not available or reliable.

Can I pair two collections with a different number of sequences?

DiffPaSS optimizes and returns permutation matrices. Hence, its inputs are required to have the same number of sequences. However, DiffPaSS can be used to pair two collections (e.g. MSAs) containing a different number of sequences, by padding the smaller collection with dummy sequences. For multiple sequence alignments, a simple choice is to add dummy sequences consisting entirely of gap symbols. For general graphs, dummy nodes, connected to the other nodes with arbitrary edge weights, can be added to the smaller graph.

How DiffPaSS works: soft scores, differentiable optimization, bootstrap

Check our paper for details of the DiffPaSS and DiffPaSS-IPA algorithms. Briefly, the main ingredients are as follows:

  1. Using “soft” scores that differentiably extend information-theoretic scores between two paired multiple sequence alignments (MSAs), or scores based on sequence similarity or graph similarity measures.

  2. The (truncated) Sinkhorn operator for smoothly parametrizing “soft permutations”, and the matching operator for parametrizing real permutations [Mena et al, 2018].

  3. A novel and efficient bootstrap technique, motivated by mathematical results and heuristic insights into this smooth optimization process. See the animation below for an illustration.

  4. A notion of “robust pairs” that can be used to identify pairs that are consistently found throughout a DiffPaSS bootstrap. These pairs can be used as ground truths in another DiffPaSS run, giving rise to the DiffPaSS-Iterative Pairing Algorithm (DiffPaSS-IPA).

The DiffPaSS bootstrap technique and robust pairs

Installation

From PyPI

DiffPaSS requires Python 3.7 or later. It is available on PyPI and can be installed with pip:

python -m pip install diffpass

From source

Clone this repository on your local machine by running

git clone git@github.com:Bitbol-Lab/DiffPaSS.git

and move inside the root folder. We recommend creating and activating a dedicated conda or virtualenv Python virtual environment. Then, make an editable install of the package:

python -m pip install -e .

Quickstart

Input data preprocessing

First, parse your multiple sequence alignments (MSAs) in FASTA format into a list of tuples (header, sequence) using read_msa.

from diffpass.msa_parsing import read_msa

# Parse and one-hot encode the MSAs
msa_data_A = read_msa("path/to/msa_A.fasta")
msa_data_B = read_msa("path/to/msa_B.fasta")

We assume that the MSAs contain species information in the headers, which will be used to restrict the pairings to be within the same species (more generally, “groups”). We need a simple function to extract the species information from the headers. For instance, if the headers are in the format >sequence_id|species_name|..., we can use:

def species_name_func(header):
    return header.split("|")[1]

This function will be used to group the sequences by species:

from diffpass.data_utils import create_groupwise_seq_records

msa_data_A_species_by_species = create_groupwise_seq_records(msa_data_A, species_name_func)
msa_data_B_species_by_species = create_groupwise_seq_records(msa_data_B, species_name_func)

If one of the MSAs contains sequences from species not present in the other MSA, we can remove these species from both MSAs:

from diffpass.data_utils import remove_groups_not_in_both

msa_data_A_species_by_species, msa_data_B_species_by_species = remove_groups_not_in_both(
    msa_data_A_species_by_species, msa_data_B_species_by_species
)

If there are species with different numbers of sequences in the two MSAs, we can add dummy sequences to the smaller species to make the number of sequences equal. For example, we can add dummy sequences consisting entirely of gap symbols:

from diffpass.data_utils import pad_msas_with_dummy_sequences

msa_data_A_species_by_species_padded, msa_data_B_species_by_species_padded = pad_msas_with_dummy_sequences(
    msa_data_A_species_by_species, msa_data_B_species_by_species
)

species = list(msa_data_A_species_by_species_padded.keys())
species_sizes = list(map(len, msa_data_A_species_by_species_padded.values()))

Next, one-hot encode the MSAs using the one_hot_encode_msa function.

from diffpass.data_utils import one_hot_encode_msa

device = "cuda" if torch.cuda.is_available() else "cpu"

# Unpack the padded MSAs into a list of records
msa_data_A_for_pairing = [record for records_this_species in msa_data_A_species_by_species_padded.values() for record in records_this_species]
msa_data_B_for_pairing = [record for records_this_species in msa_data_B_species_by_species_padded.values() for record in records_this_species]

# One-hot encode the MSAs and load them to a device
msa_A_oh = one_hot_encode_msa(msa_data_A_for_pairing, device=device)
msa_B_oh = one_hot_encode_msa(msa_data_B_for_pairing, device=device)

Pairing optimization

Finally, we can instantiate an InformationPairing object and optimize the mutual information between the paired MSAs using the DiffPaSS bootstrap algorithm. The results are stored in a DiffPaSSResults container. The lists of (hard) losses and permutations found can be accessed as attributes of the container.

from diffpass.train import InformationPairing

information_pairing = InformationPairing(group_sizes=species_sizes).to(device)
bootstrap_results = information_pairing.fit_bootstrap(x, y)

print(f"Final hard loss: {bootstrap_results.hard_losses[-1].item()}")
print(f"Final hard permutations (one permutation per species): {bootstrap_results.hard_perms[-1][-1].item()}")

For more details and examples, including the DiffPaSS-IPA variant, see the tutorials.

Tutorials

See the mutual_information_msa_pairing.ipynb notebook for an example of paired MSA optimization in the case of well-known prokaryotic datasets, for which ground truth pairings are given by genome proximity.

Documentation

The full documentation is available at https://Bitbol-Lab.github.io/DiffPaSS/.

Citation

To cite this work, please refer to the following publication:

@inproceedings{
  lupo2024diffpass,
  title={DiffPa{SS} {\textendash} Differentiable and scalable pairing of biological sequences using soft scores},
  author={Umberto Lupo and Damiano Sgarbossa and Martina Milighetti and Anne-Florence Bitbol},
  booktitle={ICLR 2024 Workshop on Generative and Experimental Perspectives for Biomolecular Design},
  year={2024},
  url={https://openreview.net/forum?id=n5hO5seROB}
}

nbdev

Project developed using nbdev.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diffpass-0.1.0.tar.gz (36.3 kB view hashes)

Uploaded Source

Built Distribution

diffpass-0.1.0-py3-none-any.whl (36.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page