Skip to main content

An end-to-end method to predict RNA secondary structure based on deep learning

Project description

sincFold

This is the repository for sincFold, a new RNA secondary folding prediction tool based on deep learning.

abstract

SincFold is a fast and accurate RNA secondary structure prediction method. It is an end-to-end approach that predicts the contact matrix using only the sequence of nucleotides as input. The model is based on a residual neural network that can learn short and long context interactions. Extensive experiments on several benchmark datasets were made, comparing sincFold against classical methods and new models based on deep learning. We demonstrate that sincFold achieves the best performance in comparison with state-of-the-art methods.

A summary of results can be seen in this notebook.

Folding RNA sequences

We have a web demo (mirror) running with the latest version. This server admits one sequence at a time. We provide a model pre-trained with validated RNA datasets. Please follow the next instructions if you want to run the model locally.

Install

This is a Python package. It is recommended to use virtualenv or conda to create a new environment. To install the package, run:

pip install sincfold

Alternativelly, you can clone the repository with:

git clone https://github.com/sinc-lab/sincFold
cd sincFold/

and install with:

pip install .

on Windows, you will probably need to add the python scripts folder to the PATH.

Predicting sequences

To predict the secondary structure of a sequence using the pretrained weights:

sincFold pred AACCGGGUCAGGUCCGGAAGGAAGCAGCCCUAA

This will display the predicted dot-bracket in the console.

SincFold also supports files with multiple sequences in .csv and .fasta format as inputs, and providing .csv or .ct format outputs.

echo -e ">seq1\\nAACCGGGUCAGGUCCGGAAGGAAGCAGCCCUAA" > sample.fasta
echo -e ">seq2\\nGUAGUCGUGGCCGAGUGGUUAAGGCGAUGGACUAGAAAUCCAUUGGGGUCUCCCCGCGCAGGUUCGAAUCCUGCCGACUACGCCA" >> sample.fasta

sincFold pred sample.fasta -o pred_ct_files/

We also provide this notebook to run the sincFold functions.

Training and testing models

A new model can be trained using the train option. For example, download this training set:

wget "https://raw.githubusercontent.com/sinc-lab/sincFold/main/sample/train.csv"

and then run sincFold with:

sincFold -d cuda train train.csv -n 10 -o output_path

The option "-d cuda" requires a GPU (otherwise remove it), and -n limits the maximum number of epochs to get a quick result. The output log and trained model will be saved in the directory output_path.

Then, a different test set can be evaluated with the test option. You can download this sample file form:

wget "https://raw.githubusercontent.com/sinc-lab/sincFold/main/sample/test.csv"

and test the model with:

sincFold test test.csv -w output_path/weights.pmt

The model path (-w) is optional, if omitted the pretrained weights are used.

Reproducible research

You can run prepare train and test partitions using the following code (in this case set up ArchiveII and fold 0 data partition). The "data/" folder can be found in this repository.

import os 
import pandas as pd 

out_path = f"working_path/"
os.mkdir(out_path)

# read dataset and predefined partitions (the files are available in this repository)
dataset = pd.read_csv("data/ArchiveII.csv", index_col="id")
partitions = pd.read_csv("data/ArchiveII_splits.csv")

dataset.loc[partitions[(partitions.fold_number==0) & (partitions.partition=="train")].id].to_csv(out_path + "train.csv")
dataset.loc[partitions[(partitions.fold_number==0) & (partitions.partition=="valid")].id].to_csv(out_path + "valid.csv")
dataset.loc[partitions[(partitions.fold_number==0) & (partitions.partition=="test")].id].to_csv(out_path + "test.csv")

then call the training and testing functions

sincFold -d cuda train working_path/train.csv --valid-file working_path/valid.csv -o working_path/output/

sincFold -d cuda test working_path/test.csv -w working_path/output/weights.pmt

Using a GPU for training is recommended (with the option '-d cuda'). The complete process may take several hours using a GPU.

@article{sincFold2023,
  title={sincFold: end-to-end learning of short- and long-range interactions for RNA folding},
  author={Leandro A. Bugnon and Leandro Di Persia and Matias Gerard and Jonathan Raad and 
  Santiago Prochetto and Emilio Fenoy and Uciel Chorostecki and Federico Ariel and 
  Georgina Stegmayer and Diego H. Milone},
  journal={under review, bioRxiv},
  url={https://www.biorxiv.org/content/10.1101/2023.10.10.561771v2}
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sincfold-0.16.2.tar.gz (2.2 MB view details)

Uploaded Source

Built Distribution

sincfold-0.16.2-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file sincfold-0.16.2.tar.gz.

File metadata

  • Download URL: sincfold-0.16.2.tar.gz
  • Upload date:
  • Size: 2.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.8

File hashes

Hashes for sincfold-0.16.2.tar.gz
Algorithm Hash digest
SHA256 b86c9e8a3af3bfebf6e0f65d93094601e6e32f736671eff5fd8aa12e130e4391
MD5 5de145013af3efd2fc50089cac3d4675
BLAKE2b-256 e432ce6591bf069a25887d7446ad7d06da5818c5c8e59b56cbf20a64dfa68857

See more details on using hashes here.

File details

Details for the file sincfold-0.16.2-py3-none-any.whl.

File metadata

  • Download URL: sincfold-0.16.2-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.8

File hashes

Hashes for sincfold-0.16.2-py3-none-any.whl
Algorithm Hash digest
SHA256 41a67b422aed1b1ac0a1c65fe86dfd9fe8300f686b237a8f36075db0af7314a4
MD5 e972bc73360a3c3f3035d09ecb188a6c
BLAKE2b-256 720f52e9d6af9b57c77092ffbc0ca8c68bb998aeb7728f7f1f43cf6c5a43c040

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page