Skip to main content

Reverse Complement Equivariant Layers

Project description

Equi-RC : Equivariant layers for RC-complement symmetry in DNA sequence data

This is a repository that implements the layers as described in "Reverse-Complement Equivariant Networks for DNA Sequences" (see paper) in Keras and Pytorch.

Setup and notes

First, install Keras or Pytorch. Then you can install our package with pip, by running :

pip install equirc

This package includes both versions (PyTorch and Keras) of the code. To use, for instance the PyTorch layers, just specify it in the import :

from equirc.pytorch_rclayers import RegToRegConv

Summary of the paper

The DNA molecule contains two strands, each composed of a chain of nucleotides A, C, T and G. These strands are complimentary : As and Cs are always paired with Ts and Gs respectively. Therefore, to represent a DNA molecule, one can arbitrarily choose a strand and then encode the chain for instance in one hot encoding. The biological way to access this information, sequencing, yields the two possible representations. For ML tasks on the whole DNA (the two strands), such as molecule binding to the double helix, one want to have a stable prediction over the two possible representations of the input.

To find all possible networks satisfying this condition, we frame this into group theory. Indeed, given a sequence s, applying the reverse complement operation twice falls back onto s : RC(RC(s)) = s. Therefore, this operation is a representation of the group Z2. To also have stability over the translation, we extend this group to the product Z x Z2. Then, we leverage the framework of equivariance :

Equivariance

Let X1 and X2 be two spaces with group actions π1 and π2, respectively. A function Φ going from X1 to X2 is said to be equivariant if Φ o π1 = π2 o Φ. This means that applying the group action π1 on an input x, and then going through Φ yields the same result as first going through Φ and then applying π2. Equivariant functions can be composed, making it suitable for the design of deep learning layers.

We know what the input group action π0 on one-hot encoded DNA matrices is : we permute the channels A and C into T and G and we reverse the sequence order. We need to also structure each intermediate spaces of our network with a group action. This is similar to choosing the number of feature maps at each layer, except that beyond choosing only the dimension, we also choose how the group will affect each column (the features at each point). Then a mathematical tool known as the induced representation extends this action on the columns onto the action on the whole matrix.

Once equipped with these representations, our paper finds all possible equivariant linear Φ and all non linear pointwise Φ when the input and output representations are the same. We then show that previous methods, such as RCPS, are special cases of this general setting. We also implement equivariant k-merisation and equivariant BatchNorm.

Finally we empirically investigate the efficiency of our networks. We show that having access to a larger functional space yields better performance, but do not find that a specific equivariant parametric function behaves consistently better than others. This advocates for tuning these new hyperparameters onto the validation set, enabling to achieve the best results.

Practical design of equivariant networks

All the possible representations are described in the paper, but the practical ones to use are mostly of two types :

  • 'Irrep', with a and b two integers : a dimensions are unaffected by the group action and b dimensions see their signs flipped.
  • 'Reg' are regular layers : upon group action the column is reversed.

The a_n are of the dimensions of type +1, the b_n of type -1. The reg_in, reg_out arguments should be understood as the number of cycles and thus correspond to half the total dimension. For instance, in the input, we have 4 nucleotides and reg=2. For technical reasons (the formalism of continuous convolution as opposed to matrix multiplication), one need to use only odd kernel sizes to ensure a strict equivariance.

The layers name and parameters should be quite explicit, for instance the IrrepToRegConv is a linear layer going from a space structured with the Irrep action to one structured with an Irrep action. This layer takes as parameters a_in and b_in which are the a and b dimensions of its input as well as reg_out which corresponds to its output group action. The other layers (BatchNorm, Kmers) also follow this nomenclature. For the non linearities, you can directly use the ones native to your framework, following the rules of theorem 3. A practical implementation is to use any non linearity in spaces with a 'Reg' structure, and odd non linearities such as tanh in spaces structured with Irrep spaces. We found networks that balance a and b to perform better.

Examples

Keras

This class used for the Binary Prediction task is implemented as an example. One can refer to this implementation and for testing, simply run :

python keras_example.py

Pytorch

The equivalent class is also written in Pytorch, and can be ran with :

python pytorch_example.py

Acknowlegements

We want to thank Hannah Zhou, Avanti Shrikumar and Anshul Kundaje for precious discussions. We also thank Guillaume Bouvier and the CBIO members for advice.

Contact

Please feel free to reach out by sending an email to vincent.mallet96@gmail.com or opening issues.

Cite

If you want to cite this tool, please use :

@article{mallet2021reverse,
  title={Reverse-Complement Equivariant Networks for DNA Sequences},
  author={Mallet, Vincent and Vert, Jean-Philippe},
  journal={bioRxiv},
  year={2021},
  publisher={Cold Spring Harbor Laboratory}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

equirc-1.0.0.tar.gz (23.2 kB view details)

Uploaded Source

Built Distribution

equirc-1.0.0-py3-none-any.whl (21.6 kB view details)

Uploaded Python 3

File details

Details for the file equirc-1.0.0.tar.gz.

File metadata

  • Download URL: equirc-1.0.0.tar.gz
  • Upload date:
  • Size: 23.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.5.0.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for equirc-1.0.0.tar.gz
Algorithm Hash digest
SHA256 1d1ee3c6669bf506a4cee5731a1237650862ad317be661a9147ac5cef43b8b3f
MD5 04721b2b45a176195269bb11e70f8828
BLAKE2b-256 63766c3c39c98976ca14c24c2e6dbf2bb0b037a85642f1d6dbafff96b40d1c89

See more details on using hashes here.

File details

Details for the file equirc-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: equirc-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 21.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.5.0.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for equirc-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a2d688885323a71188746c8f363497009fcd9538db444ab2c724b9a59c01c49c
MD5 31aa6a2f57568281be1fd354f1657070
BLAKE2b-256 d8493a8377a888fb09760145ac60191d1e42577dcb10140700892dd7e4861f58

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page