A small example package
Project description
Visual Description
Installation
pip install GSAE
NOTE: You will also need to install PyTorch Geometric. Instructions for doing so can be found here
Training the models
To train the GSAE:
python train_gsae.py
To train the GAE or GVAE models:
python train_gnn.py
The scattering inversion network if found ins
gsae/models/inv_scattering.py
Data Processing Workflow
1. Loading data from RNAfold
From RNAfold, we get a file like the following
rnafold_output.txt
which inside looks like
GGCGUUUUCGCCUUUUGGCGAUUUUUAUCGCC -14.20 10.00
(.((...(((((....))))).......)).) -5.50
(.(((..(((((....))))).....).)).) -4.20
((.....(((((....))))).........)) -5.90
((.((..(((((....))))).....))..)) -5.60
((.(...(((((....))))).......).)) -6.40
((.(.(.(((((....))))).....).).)) -4.20
((.((..(((((....))))).....).).)) -5.10
(((....(((((....))))).(...)..))) -4.30
(((....(((((....)))))(....)..))) -5.30
RNAfold output files used in the paper are included in
data/raw_data/
We can use rnafold2arrays.py in gsae/data_processing to convert this text file to
- a csv file containing adjacency matrices for each fold (
adjmats_<datestamp>.csv) - a csv file containing the energy scalar for each structure (
energies_<datestamp>.csv) - a text file with the rna sequence (
sequence_<datestamp>.txt)
rnafold2arrays.py usage:
usage: rnafold2arrays.py [-h] --data DATA --outname OUTNAME
optional arguments:
-h, --help show this help message and exit
--data DATA RNAfold txt file output to be converted
--outname OUTNAME base name for the outputs
sample usage:
> python rnafold2arrays.py --data seq4_rnafold_out.txt --outname seq4
Which will produce the following files
seq4_adjmat_2020-03-04-03.csv
seq4_energies_2020_03-04-03.csv
seq4_sequence_2020-03-04-03.txt
If you would like to skip this step, you can also download the processed files from this box link. The file is named processed_data.tar.gz
2. Converting adjacency data to scattering coefficients
Once we have the adjacency matrices of the structures we're interested in, we can convert them using scattering transforms to a new, more informative representation
Here we will use diracs centered at each node (i.e. the identity matrix) as our graph signals.
To convert them, we will use adj2scatcoeffs.py
usage: adj2scatcoeffs.py [-h] --data DATA --outname OUTNAME [--pcs PCS]
optional arguments:
-h, --help show this help message and exit
--data DATA file (npy or csv) with adjacency matrices
--outname OUTNAME base name for output
--pcs PCS how many principle components to use (if 0, then use raw scattering coefficients)
sample usage:
> python adj2scatcoeffs.py --data seq4_adjmat_2020-03-04-03.csv --outname seq_4
If you would like to skip this step, you can also download the processed files from this box link. The file is named scattering_coeffs.tar.gz
3. Create Splits
Now that we've generated all the data our model will use, we can now create the train/test splits
To convert them, we will use create_splits.py
usage: create_splits.py [-h] --adjs ADJS --coeffs COEFFS --energies ENERGIES --outname OUTNAME
optional arguments:
-h, --help show this help message and exit
--adjs ADJS file with adjacency matrices
--coeffs COEFFS file with scattering coeffs
--energies ENERGIES file with energy values
--outname OUTNAME base name for output
The output set of files can be then stored in a directory which we will later refer to as ROOT_DIR for the reason mentioned below
If you would like to skip this step, you can also download the processed files from this box link. The file is named final_splits.tar.gz
IMPORTANT: Data loading for models
In order to ensure that the training scripts in the model files function correctly, the ROOT_DIR variable at the top of load_splits.py to where the train/test split is located
Data
Data for the 4 sequences used in the paper are located in data/
└── raw_data
├── hiv_tar
│ ├── hiv_tar_sequence.txt
│ ├── hivtar_100k_subp_n_052020.txt
├── hob_seq3
│ ├── seq3_100k_subp_n_052020.txt
│ └── seq3_sequence.txt
├── hob_seq4
│ ├── seq4_100k_subp_n_052020.txt
│ └── seq4_sequence.txt
└── tebown
├── teb_100k_subp_n_052020.txt
└── tebown_sequence.txt
Citation
@inproceedings{castro2020uncovering,
title={Uncovering the Folding Landscape of RNA Secondary Structure Using Deep Graph Embeddings},
author={Castro, Egbert and Benz, Andrew and Tong, Alexander and Wolf, Guy and Krishnaswamy, Smita},
booktitle={2020 IEEE International Conference on Big Data (Big Data)},
pages={4519--4528},
year={2020},
organization={IEEE}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file GSAE-0.2.tar.gz.
File metadata
- Download URL: GSAE-0.2.tar.gz
- Upload date:
- Size: 16.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
847b8cbd831eec4e6e35a69a914f2964421005d01b6e4bf6c1b30a634205be8b
|
|
| MD5 |
18749799138c8b1694e448ac081aa93d
|
|
| BLAKE2b-256 |
4ca23fb395e4d91ecb29a28abb734463f78c24628eea8afd5c83c1e498f69e57
|
File details
Details for the file GSAE-0.2-py3-none-any.whl.
File metadata
- Download URL: GSAE-0.2-py3-none-any.whl
- Upload date:
- Size: 21.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f22a04433e8fbbc96604207e513263f01544638782485dd41a2d7900ea18f84
|
|
| MD5 |
bc52bfec29b767825197ea37d96f3072
|
|
| BLAKE2b-256 |
f2c28fa46e91ac59258047250441f8c1af574eb288df65b46c7f2b4d8b75f30f
|