Skip to main content

A small example package

Project description

Uncovering the Folding Landscape of RNA Secondary Structure with Deep Graph Embeddings

Paper Paper

Visual Description

visual overview

Installation

pip install GSAE

NOTE: You will also need to install PyTorch Geometric. Instructions for doing so can be found here

Training the models

To train the GSAE:

python train_gsae.py

To train the GAE or GVAE models:

python train_gnn.py

The scattering inversion network if found ins

gsae/models/inv_scattering.py

Data Processing Workflow

1. Loading data from RNAfold


From RNAfold, we get a file like the following

rnafold_output.txt

which inside looks like

GGCGUUUUCGCCUUUUGGCGAUUUUUAUCGCC -14.20  10.00
(.((...(((((....))))).......)).)  -5.50
(.(((..(((((....))))).....).)).)  -4.20
((.....(((((....))))).........))  -5.90
((.((..(((((....))))).....))..))  -5.60
((.(...(((((....))))).......).))  -6.40
((.(.(.(((((....))))).....).).))  -4.20
((.((..(((((....))))).....).).))  -5.10
(((....(((((....))))).(...)..)))  -4.30
(((....(((((....)))))(....)..)))  -5.30

RNAfold output files used in the paper are included in

data/raw_data/

We can use rnafold2arrays.py in gsae/data_processing to convert this text file to

  • a csv file containing adjacency matrices for each fold (adjmats_<datestamp>.csv)
  • a csv file containing the energy scalar for each structure (energies_<datestamp>.csv)
  • a text file with the rna sequence (sequence_<datestamp>.txt)

rnafold2arrays.py usage:

usage: rnafold2arrays.py [-h] --data DATA --outname OUTNAME

optional arguments:
-h, --help         show this help message and exit
--data DATA        RNAfold txt file output to be converted
--outname OUTNAME  base name for the outputs

sample usage:

> python rnafold2arrays.py --data seq4_rnafold_out.txt --outname seq4

Which will produce the following files

seq4_adjmat_2020-03-04-03.csv
seq4_energies_2020_03-04-03.csv
seq4_sequence_2020-03-04-03.txt

If you would like to skip this step, you can also download the processed files from this box link. The file is named processed_data.tar.gz

2. Converting adjacency data to scattering coefficients


Once we have the adjacency matrices of the structures we're interested in, we can convert them using scattering transforms to a new, more informative representation

Here we will use diracs centered at each node (i.e. the identity matrix) as our graph signals.

To convert them, we will use adj2scatcoeffs.py

usage: adj2scatcoeffs.py [-h] --data DATA --outname OUTNAME [--pcs PCS]

optional arguments:
-h, --help         show this help message and exit
--data DATA        file (npy or csv) with adjacency matrices
--outname OUTNAME  base name for output
--pcs PCS          how many principle components to use (if 0, then use raw scattering coefficients)

sample usage:

> python adj2scatcoeffs.py --data seq4_adjmat_2020-03-04-03.csv --outname seq_4

If you would like to skip this step, you can also download the processed files from this box link. The file is named scattering_coeffs.tar.gz

3. Create Splits


Now that we've generated all the data our model will use, we can now create the train/test splits

To convert them, we will use create_splits.py

usage: create_splits.py [-h] --adjs ADJS --coeffs COEFFS --energies ENERGIES --outname OUTNAME

optional arguments:
-h, --help           show this help message and exit
--adjs ADJS          file with adjacency matrices
--coeffs COEFFS      file with scattering coeffs
--energies ENERGIES  file with energy values
--outname OUTNAME    base name for output

The output set of files can be then stored in a directory which we will later refer to as ROOT_DIR for the reason mentioned below

If you would like to skip this step, you can also download the processed files from this box link. The file is named final_splits.tar.gz

IMPORTANT: Data loading for models

In order to ensure that the training scripts in the model files function correctly, the ROOT_DIR variable at the top of load_splits.py to where the train/test split is located

Data

Data for the 4 sequences used in the paper are located in data/

└── raw_data
    ├── hiv_tar
    │   ├── hiv_tar_sequence.txt
    │   ├── hivtar_100k_subp_n_052020.txt
    ├── hob_seq3
    │   ├── seq3_100k_subp_n_052020.txt
    │   └── seq3_sequence.txt
    ├── hob_seq4
    │   ├── seq4_100k_subp_n_052020.txt
    │   └── seq4_sequence.txt
    └── tebown
        ├── teb_100k_subp_n_052020.txt
        └── tebown_sequence.txt

Citation

@inproceedings{castro2020uncovering,
  title={Uncovering the Folding Landscape of RNA Secondary Structure Using Deep Graph Embeddings},
  author={Castro, Egbert and Benz, Andrew and Tong, Alexander and Wolf, Guy and Krishnaswamy, Smita},
  booktitle={2020 IEEE International Conference on Big Data (Big Data)},
  pages={4519--4528},
  year={2020},
  organization={IEEE}
}

Project details


Release history Release notifications | RSS feed

This version

0.2

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GSAE-0.2.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

GSAE-0.2-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file GSAE-0.2.tar.gz.

File metadata

  • Download URL: GSAE-0.2.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for GSAE-0.2.tar.gz
Algorithm Hash digest
SHA256 847b8cbd831eec4e6e35a69a914f2964421005d01b6e4bf6c1b30a634205be8b
MD5 18749799138c8b1694e448ac081aa93d
BLAKE2b-256 4ca23fb395e4d91ecb29a28abb734463f78c24628eea8afd5c83c1e498f69e57

See more details on using hashes here.

File details

Details for the file GSAE-0.2-py3-none-any.whl.

File metadata

  • Download URL: GSAE-0.2-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for GSAE-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0f22a04433e8fbbc96604207e513263f01544638782485dd41a2d7900ea18f84
MD5 bc52bfec29b767825197ea37d96f3072
BLAKE2b-256 f2c28fa46e91ac59258047250441f8c1af574eb288df65b46c7f2b4d8b75f30f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page