Skip to main content

A small example package

Project description

Uncovering the Folding Landscape of RNA Secondary Structure with Deep Graph Embeddings

Paper Paper

Visual Description

visual overview

Installation

pip install GSAE

NOTE: You will also need to install PyTorch Geometric. Instructions for doing so can be found here

Training the models

To train the GSAE:

python train_gsae.py

To train the GAE or GVAE models:

python train_gnn.py

The scattering inversion network if found ins

gsae/models/inv_scattering.py

Data Processing Workflow

1. Loading data from RNAfold


From RNAfold, we get a file like the following

rnafold_output.txt

which inside looks like

GGCGUUUUCGCCUUUUGGCGAUUUUUAUCGCC -14.20  10.00
(.((...(((((....))))).......)).)  -5.50
(.(((..(((((....))))).....).)).)  -4.20
((.....(((((....))))).........))  -5.90
((.((..(((((....))))).....))..))  -5.60
((.(...(((((....))))).......).))  -6.40
((.(.(.(((((....))))).....).).))  -4.20
((.((..(((((....))))).....).).))  -5.10
(((....(((((....))))).(...)..)))  -4.30
(((....(((((....)))))(....)..)))  -5.30

RNAfold output files used in the paper are included in

data/raw_data/

We can use rnafold2arrays.py in gsae/data_processing to convert this text file to

  • a csv file containing adjacency matrices for each fold (adjmats_<datestamp>.csv)
  • a csv file containing the energy scalar for each structure (energies_<datestamp>.csv)
  • a text file with the rna sequence (sequence_<datestamp>.txt)

rnafold2arrays.py usage:

usage: rnafold2arrays.py [-h] --data DATA --outname OUTNAME

optional arguments:
-h, --help         show this help message and exit
--data DATA        RNAfold txt file output to be converted
--outname OUTNAME  base name for the outputs

sample usage:

> python rnafold2arrays.py --data seq4_rnafold_out.txt --outname seq4

Which will produce the following files

seq4_adjmat_2020-03-04-03.csv
seq4_energies_2020_03-04-03.csv
seq4_sequence_2020-03-04-03.txt

If you would like to skip this step, you can also download the processed files from this box link. The file is named processed_data.tar.gz

2. Converting adjacency data to scattering coefficients


Once we have the adjacency matrices of the structures we're interested in, we can convert them using scattering transforms to a new, more informative representation

Here we will use diracs centered at each node (i.e. the identity matrix) as our graph signals.

To convert them, we will use adj2scatcoeffs.py

usage: adj2scatcoeffs.py [-h] --data DATA --outname OUTNAME [--pcs PCS]

optional arguments:
-h, --help         show this help message and exit
--data DATA        file (npy or csv) with adjacency matrices
--outname OUTNAME  base name for output
--pcs PCS          how many principle components to use (if 0, then use raw scattering coefficients)

sample usage:

> python adj2scatcoeffs.py --data seq4_adjmat_2020-03-04-03.csv --outname seq_4

If you would like to skip this step, you can also download the processed files from this box link. The file is named scattering_coeffs.tar.gz

3. Create Splits


Now that we've generated all the data our model will use, we can now create the train/test splits

To convert them, we will use create_splits.py

usage: create_splits.py [-h] --adjs ADJS --coeffs COEFFS --energies ENERGIES --outname OUTNAME

optional arguments:
-h, --help           show this help message and exit
--adjs ADJS          file with adjacency matrices
--coeffs COEFFS      file with scattering coeffs
--energies ENERGIES  file with energy values
--outname OUTNAME    base name for output

The output set of files can be then stored in a directory which we will later refer to as ROOT_DIR for the reason mentioned below

If you would like to skip this step, you can also download the processed files from this box link. The file is named final_splits.tar.gz

IMPORTANT: Data loading for models

In order to ensure that the training scripts in the model files function correctly, the ROOT_DIR variable at the top of load_splits.py to where the train/test split is located

Data

Data for the 4 sequences used in the paper are located in data/

└── raw_data
    ├── hiv_tar
    │   ├── hiv_tar_sequence.txt
    │   ├── hivtar_100k_subp_n_052020.txt
    ├── hob_seq3
    │   ├── seq3_100k_subp_n_052020.txt
    │   └── seq3_sequence.txt
    ├── hob_seq4
    │   ├── seq4_100k_subp_n_052020.txt
    │   └── seq4_sequence.txt
    └── tebown
        ├── teb_100k_subp_n_052020.txt
        └── tebown_sequence.txt

Citation

@inproceedings{castro2020uncovering,
  title={Uncovering the Folding Landscape of RNA Secondary Structure Using Deep Graph Embeddings},
  author={Castro, Egbert and Benz, Andrew and Tong, Alexander and Wolf, Guy and Krishnaswamy, Smita},
  booktitle={2020 IEEE International Conference on Big Data (Big Data)},
  pages={4519--4528},
  year={2020},
  organization={IEEE}
}

Project details


Release history Release notifications | RSS feed

This version

0.2

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GSAE-0.2.tar.gz (16.8 kB view hashes)

Uploaded Source

Built Distribution

GSAE-0.2-py3-none-any.whl (21.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page