Reverse Complement Equivariant Layers
Project description
Equi-RC : Equivariant layers for RC-complement symmetry in DNA sequence data
This is a repository that implements the layers as described in "Reverse-Complement Equivariant Networks for DNA Sequences" (see paper) in Keras and Pytorch.
Setup and notes
First, install Keras or Pytorch. Then you can install our package with pip, by running :
pip install equirc
This package includes both versions (PyTorch and Keras) of the code. To use, for instance the PyTorch layers, just specify it in the import :
from equirc.pytorch_rclayers import RegToRegConv
Summary of the paper
The DNA molecule contains two strands, each composed of a chain of nucleotides A, C, T and G. These strands are complimentary : As and Cs are always paired with Ts and Gs respectively. Therefore, to represent a DNA molecule, one can arbitrarily choose a strand and then encode the chain for instance in one hot encoding. The biological way to access this information, sequencing, yields the two possible representations. For ML tasks on the whole DNA (the two strands), such as molecule binding to the double helix, one want to have a stable prediction over the two possible representations of the input.
To find all possible networks satisfying this condition, we frame this into group theory. Indeed, given a sequence s, applying the reverse complement operation twice falls back onto s : RC(RC(s)) = s. Therefore, this operation is a representation of the group Z2. To also have stability over the translation, we extend this group to the product Z x Z2. Then, we leverage the framework of equivariance :
Let X1 and X2 be two spaces with group actions π1 and π2, respectively. A function Φ going from X1 to X2 is said to be equivariant if Φ o π1 = π2 o Φ. This means that applying the group action π1 on an input x, and then going through Φ yields the same result as first going through Φ and then applying π2. Equivariant functions can be composed, making it suitable for the design of deep learning layers.
We know what the input group action π0 on one-hot encoded DNA matrices is : we permute the channels A and C into T and G and we reverse the sequence order. We need to also structure each intermediate spaces of our network with a group action. This is similar to choosing the number of feature maps at each layer, except that beyond choosing only the dimension, we also choose how the group will affect each column (the features at each point). Then a mathematical tool known as the induced representation extends this action on the columns onto the action on the whole matrix.
Once equipped with these representations, our paper finds all possible equivariant linear Φ and all non linear pointwise Φ when the input and output representations are the same. We then show that previous methods, such as RCPS, are special cases of this general setting. We also implement equivariant k-merisation and equivariant BatchNorm.
Finally we empirically investigate the efficiency of our networks. We show that having access to a larger functional space yields better performance, but do not find that a specific equivariant parametric function behaves consistently better than others. This advocates for tuning these new hyperparameters onto the validation set, enabling to achieve the best results.
Practical design of equivariant networks
All the possible representations are described in the paper, but the practical ones to use are mostly of two types :
- 'Irrep', with a and b two integers : a dimensions are unaffected by the group action and b dimensions see their signs flipped.
- 'Reg' are regular layers : upon group action the column is reversed.
The a_n are of the dimensions of type +1, the b_n of type -1. The reg_in, reg_out arguments should be understood as the number of cycles and thus correspond to half the total dimension. For instance, in the input, we have 4 nucleotides and reg=2. For technical reasons (the formalism of continuous convolution as opposed to matrix multiplication), one need to use only odd kernel sizes to ensure a strict equivariance.
The layers name and parameters should be quite explicit, for instance the IrrepToRegConv is a linear layer going from a space structured with the Irrep action to one structured with an Irrep action. This layer takes as parameters a_in and b_in which are the a and b dimensions of its input as well as reg_out which corresponds to its output group action. The other layers (BatchNorm, Kmers) also follow this nomenclature. For the non linearities, you can directly use the ones native to your framework, following the rules of theorem 3. A practical implementation is to use any non linearity in spaces with a 'Reg' structure, and odd non linearities such as tanh in spaces structured with Irrep spaces. We found networks that balance a and b to perform better.
Examples
Keras
This class used for the Binary Prediction task is implemented as an example. One can refer to this implementation and for testing, simply run :
python keras_example.py
Pytorch
The equivalent class is also written in Pytorch, and can be ran with :
python pytorch_example.py
Acknowlegements
We want to thank Hannah Zhou, Avanti Shrikumar and Anshul Kundaje for precious discussions. We also thank Guillaume Bouvier and the CBIO members for advice.
Contact
Please feel free to reach out by sending an email to vincent.mallet96@gmail.com or opening issues.
Cite
If you want to cite this tool, please use :
@article{mallet2021reverse,
title={Reverse-Complement Equivariant Networks for DNA Sequences},
author={Mallet, Vincent and Vert, Jean-Philippe},
journal={bioRxiv},
year={2021},
publisher={Cold Spring Harbor Laboratory}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file equirc-1.0.0.tar.gz
.
File metadata
- Download URL: equirc-1.0.0.tar.gz
- Upload date:
- Size: 23.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.5.0.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1d1ee3c6669bf506a4cee5731a1237650862ad317be661a9147ac5cef43b8b3f |
|
MD5 | 04721b2b45a176195269bb11e70f8828 |
|
BLAKE2b-256 | 63766c3c39c98976ca14c24c2e6dbf2bb0b037a85642f1d6dbafff96b40d1c89 |
File details
Details for the file equirc-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: equirc-1.0.0-py3-none-any.whl
- Upload date:
- Size: 21.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.5.0.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a2d688885323a71188746c8f363497009fcd9538db444ab2c724b9a59c01c49c |
|
MD5 | 31aa6a2f57568281be1fd354f1657070 |
|
BLAKE2b-256 | d8493a8377a888fb09760145ac60191d1e42577dcb10140700892dd7e4861f58 |