A library to manipulate data for our DMS prediction models.
Project description
Download your RNA data from HuggingFace with rouskinhf!
A wrapper around Huggingface the load data for eFold. You can:
- pull datasets from the Rouskinlab's HuggingFace
- create datasets from local files
Installation
To download data
pip install rouskinhf
To push data to huggingface (optional)
- get a token access from the rouskilab huggingface's page
- add this token to your environment
export HUGGINGFACE_TOKEN="hf_yourtokenhere"
To predict structures from rouskinhf (optional)
You'll need to install D. Mathew's RNAstructure Fold (also available on Rouskinlab GitHub).
Check your RNAstructure Fold installation in a terminal:
Fold --version
How to use
Download a dataset
import rouskinhf
rouskinhf.get_dataset(
name='bpRNA-1m', # the name of a dataset from huggingface/rouskinlab
force_download = False # use a local copy of the data if it exists
)
Convert whatever format to rouskinhf format
import rouskinhf
rouskinhf.convert(
format = 'ct', # can be ct, seismic, bpseq, fasta or json (rouskinhf output data structure)
file_or_folder = 'path/to/my/ct/folder',
predict_structure = False, # Add structure from RNAstructure
filter = True, # removes duplicates, non-regular characters and low AUROC
min_AUROC=0.8,
)
Note: Sequences with bases different than
A
,C
,G
,T
,U
,N
,a
,c
,g
,t
,u
,n
are not supported. The data will be filtered out.
Rouskinhf structure format
# rouskinhf_output_file.json
{
"reference_name": {
"sequence": "CACGCUAUG",
"structure": [(0,8), (1,7)], # base pair representation
# whatever other info you need
}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
rouskinhf-0.4.8.tar.gz
(17.3 kB
view hashes)
Built Distribution
rouskinhf-0.4.8-py3-none-any.whl
(19.3 kB
view hashes)
Close
Hashes for rouskinhf-0.4.8-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 04576cb878ca6c049c22258083381d34a46eb7a583aeaefe0454cd3460a4c3ad |
|
MD5 | 92f2a5e63b8872cd511db0fd9d344661 |
|
BLAKE2b-256 | 2e7f68789fba2f5fea1ef590f004f48be92ec856758b73241c92735eaaf2ddb9 |