A library to manipulate data for our DMS prediction models.
Project description
Download your RNA data from huggingface with rouskinhf!
A repo to manipulate the data for our RNA structure prediction model. The data is stored on HuggingFace and pulled locally for training models.
Installation
Get a HuggingFace token
Go to HuggingFace and create an account. Then go to your profile and copy your token (huggingface.co/settings/tokens).
Create an environment file
Open a terminal and type:
nano env
Copy paste the following content, and change the values to your own:
HUGGINGFACE_TOKEN="your token here" # you must change this to your HuggingFace token
DATA_FOLDER="data/datafolders" # where the datafolder are stored by default, change it if you want to store it somewhere else
DATA_FOLDER_TESTING="data/input_files_for_testing" # Don't touch this
RNASTRUCTURE_PATH="/Users/ymdt/src/RNAstructure/exe" # Change this to the path of your RNAstructure executable
RNASTRUCTURE_TEMP_FOLDER="temp" # You can change this to the path of your RNAstructure temp folder
Then save the file and exit nano.
Source the environment
source env
Install the package with pip
pip install rouskinhf
Tutorials
Authentify your machine to HuggingFace
See the tutorial.
Download a datafolder from HuggingFace
See the tutorial.
Create a datafolder from local files and push it to HuggingFace
See the tutorial.
About
Import data with import_dataset
This repo provides a function import_dataset
, which allows your to pull a dataset from HuggingFace and store it locally. If the data is already stored locally, it will be loaded from the local folder. The type of data available is the DMS signal and the structure, under the shape of paired bases tuples. The function has the following signature:
def import_dataset(name:str, data:str, force_download:bool=False)->np.ndarray:
"""Finds the dataset with the given name for the given type of data.
Parameters
----------
name : str
Name of the dataset to find.
data : str
Name of the type of data to find the dataset for (structure or DMS).
force_download : bool
Whether to force download the dataset from HuggingFace Hub. Defaults to False.
Returns
-------
ndarray
The dataset with the given name for the given type of data.
Example
-------
>>> import_dataset(name='for_testing', data='structure', hf_token=os.environ['os.environ['HUGGINGFACE_TOKEN'],']).shape
(2,)
>>> import_dataset(name='for_testing', data='DMS', hf_token=os.environ['os.environ['HUGGINGFACE_TOKEN'],']).shape
(2,)
>>> import_dataset(name='for_testing', data='structure', hf_token=os.environ['os.environ['HUGGINGFACE_TOKEN'],'], force_download=True).shape
(2,)
>>> import_dataset(name='for_testing', data='DMS', hf_token=os.environ['os.environ['HUGGINGFACE_TOKEN'],'], force_download=True).shape
(2,)
FYI, the datafolder object
The datafolder object is a wrapper around your local folder and HuggingFace API, to keep a consistent datastructure across your datasets. It contains multiple methods to create datasets from various input formats, store the data and metadata in a systematic way, and push / pull from HuggingFace.
On HuggingFace, the datafolder stores the data under the following structure:
HUGGINGFACE DATAFOLDER
- [datafolder name]
- source
- whichever file(s) you used to create the dataset (fasta, set of CTs, etc.).
- data.json # the data under a human readable format.
- info.json # the metadata of the dataset. This file indicates how we got the DMS signal and the structures (directly from the source or from a prediction).
- README.md # the metadata of the dataset in a human readable format.
Locally, we have the same structure with the addition of .npy files which contain the data in a machine readable format. Each .npy file contains a numpy array of the data, and the name of the file is the name of the corresponding key in the data.json file. The source file won’t be downloaded by default. Hence, the local structure is:
LOCAL DATAFOLDER
- [datafolder name]
...
- README.md # the metadata of the dataset in a human readable format
- references.npy
- sequences.npy
- base_pairs.npy
- dms.npy
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for rouskinhf-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ab99ff00bc6deff6990d32df444b754db661a19e83d33063d35ca8c98b1d2aa |
|
MD5 | 6adb24c0458b216b0a3158323eb4e0c6 |
|
BLAKE2b-256 | 5dec37eeee316850cedfdbaefa60eb9545bdba551ffc686b7a8fac98684f2581 |