A library to manipulate data for our DMS prediction models.

These details have not been verified by PyPI

Project description

PyPI

Download your RNA data from HuggingFace with rouskinhf!

A repo to manipulate the data for our RNA structure prediction model. This repo allows you to:

pull datasets from the Rouskinlab's HuggingFace
create datasets from local files and push them to HuggingFace, from the formats:
- .fasta
- .ct
- .json (DREEM output format)
- .json (Rouskinlab's huggingface format)

Important notes

Sequences with bases different than A, C, G, T, U, N, a, c, g, t, u, n are not supported. The data will be filtered out.

Dependencies

RNAstructure (also available on Rouskinlab GitHub).

Push a new release to Pypi

Edit version to vx.y.z in pyproject.toml. Then run in a terminal git add . && git commit -m 'vx.y.z' && git push.
Create and push a git tag vx.y.z by running in a terminal git tag 'vx.y.z' && git push --tag.
Create a release for the tag vx.y.z on Github Release.
Make sure that the Github Action Publish distributions 📦 to PyPI passed on Github Actions.

Installation

Get a HuggingFace token

Go to HuggingFace and create an account. Then go to your profile and copy your token (huggingface.co/settings/tokens).

Create an environment file

Open a terminal and type:

nano env

Copy paste the following content, and change the values to your own:

export HUGGINGFACE_TOKEN="your token here"  # you must change this to your HuggingFace token
export DATA_FOLDER="data/datafolders" # where the datafolder are stored by default, change it if you want to store it somewhere else
export DATA_FOLDER_TESTING="data/input_files_for_testing" # Don't touch this
export RNASTRUCTURE_PATH="/Users/ymdt/src/RNAstructure/exe" # Change this to the path of your RNAstructure executable
export RNASTRUCTURE_TEMP_FOLDER="temp" # You can change this to the path of your RNAstructure temp folder

Then save the file and exit nano.

Source the environment

source env

Install the package with pip

pip install rouskinhf

Tutorials

Authentify your machine to HuggingFace

See the tutorial.

Download a datafolder from HuggingFace

See the tutorial.

Create a datafolder from local files and push it to HuggingFace

See the tutorial.

About

Sourcing the environment and keeping your environment variable secret

The variables defined in the env file are required by rouskinhf. Make that before you use rouskinhf, you run in a terminal:

source env

or, in a Jupyter notebook:

!pip install python-dotenv
%load_ext dotenv
%dotenv env

The point of using environment variables is to ensure the privacy of your huggingface token. Make sure to add your env file to your .gitignore, so your HuggingFace token doesn't get pushed to any public repository.

Import data with `import_dataset`

This repo provides a function import_dataset, which allows your to pull a dataset from HuggingFace and store it locally. If the data is already stored locally, it will be loaded from the local folder. The type of data available is the DMS signal and the structure, under the shape of paired bases tuples. The function has the following signature:

def import_dataset(name:str, data:str, force_download:bool=False)->np.ndarray:

    """Finds the dataset with the given name for the given type of data.

    Parameters
    ----------

    name : str
        Name of the dataset to find.
    data : str
        Name of the type of data to find the dataset for (structure or DMS).
    force_download : bool
        Whether to force download the dataset from HuggingFace Hub. Defaults to False.

    Returns
    -------

    ndarray
        The dataset with the given name for the given type of data.

    Example
    -------

    >>> import_dataset(name='for_testing', data='structure').keys()
    dict_keys(['references', 'sequences', 'structure'])
    >>> import_dataset(name='for_testing', data='DMS').keys()
    dict_keys(['references', 'sequences', 'DMS'])
    >>> import_dataset(name='for_testing', data='structure', force_download=True).keys()
    dict_keys(['references', 'sequences', 'structure'])
    >>> import_dataset(name='for_testing', data='DMS', force_download=True).keys()
    dict_keys(['references', 'sequences', 'DMS'])

FYI, the datafolder object

The datafolder object is a wrapper around your local folder and HuggingFace API, to keep a consistent datastructure across your datasets. It contains multiple methods to create datasets from various input formats, store the data and metadata in a systematic way, and push / pull from HuggingFace.

On HuggingFace, the datafolder stores the data under the following structure:

HUGGINGFACE DATAFOLDER
- [datafolder name]
    - source
        - whichever file(s) you used to create the dataset (fasta, set of CTs, etc.).
    - data.json # the data under a human readable format.
    - info.json # the metadata of the dataset. This file indicates how we got the DMS signal and the structures (directly from the source or from a prediction).
    - README.md # the metadata of the dataset in a human readable format.

Locally, we have the same structure with the addition of .npy files which contain the data in a machine readable format. Each .npy file contains a numpy array of the data, and the name of the file is the name of the corresponding key in the data.json file. The source file won’t be downloaded by default. Hence, the local structure is:

LOCAL DATAFOLDER
- [datafolder name]
    ...
    - README.md # the metadata of the dataset in a human readable format
    - references.npy
    - sequences.npy
    - base_pairs.npy
    - dms.npy

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.8

May 14, 2024

0.4.7

Dec 22, 2023

0.4.6

Dec 21, 2023

0.4.5

Dec 18, 2023

0.4.4

Dec 12, 2023

0.4.3

Dec 12, 2023

0.4.2

Dec 11, 2023

0.4.1

Dec 11, 2023

0.4.0

Dec 11, 2023

0.3.5

Nov 22, 2023

0.3.3

Nov 21, 2023

0.3.2

Nov 21, 2023

0.3.1

Nov 19, 2023

0.3.0

Nov 11, 2023

0.2.11

Nov 9, 2023

0.2.8

Sep 20, 2023

0.2.7

Sep 19, 2023

0.2.6

Aug 2, 2023

0.2.5

Jul 28, 2023

0.2.4

Jul 27, 2023

This version

0.2.3

Jul 26, 2023

0.2.1

Jul 25, 2023

0.1.4

Jul 24, 2023

0.1.3

Jul 24, 2023

0.1.2

Jul 21, 2023

0.1.1

Jul 20, 2023

0.1.0

Jul 20, 2023

0.0.6

Jul 19, 2023

0.0.4

Jul 19, 2023

0.0.3

Jul 19, 2023

0.0.2

Jul 19, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rouskinhf-0.2.3.tar.gz (21.0 kB view hashes)

Uploaded Jul 26, 2023 Source

Built Distribution

rouskinhf-0.2.3-py3-none-any.whl (22.1 kB view hashes)

Uploaded Jul 26, 2023 Python 3

Hashes for rouskinhf-0.2.3.tar.gz

Hashes for rouskinhf-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`74b54225f0ed5f2f1be7b982212ea4ddce790ba22d614103fd236931d0de9d29`
MD5	`b0d4abbb05837055c672ddec1f04a68b`
BLAKE2b-256	`ba1f0d337fabd21df45f521c6fbb35221d86d9fe0bb1d1484a0151937b9c51ed`

Hashes for rouskinhf-0.2.3-py3-none-any.whl

Hashes for rouskinhf-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77552554bde35221c520fd624d9b20d36987470bd610682e61f0b985f981e66c`
MD5	`67c42668644b5eb11f314a435d650bff`
BLAKE2b-256	`242ec130aa06b551563afd38ed4df112f16072b071285093c9e86815730f0963`

rouskinhf 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Download your RNA data from HuggingFace with rouskinhf!

Important notes

Dependencies

Push a new release to Pypi

Installation

Get a HuggingFace token

Create an environment file

Source the environment

Install the package with pip

Tutorials

Authentify your machine to HuggingFace

Download a datafolder from HuggingFace

Create a datafolder from local files and push it to HuggingFace

About

Sourcing the environment and keeping your environment variable secret

Import data with `import_dataset`

FYI, the datafolder object

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

rouskinhf 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Download your RNA data from HuggingFace with rouskinhf!

Important notes

Dependencies

Push a new release to Pypi

Installation

Get a HuggingFace token

Create an environment file

Source the environment

Install the package with pip

Tutorials

Authentify your machine to HuggingFace

Download a datafolder from HuggingFace

Create a datafolder from local files and push it to HuggingFace

About

Sourcing the environment and keeping your environment variable secret

Import data with import_dataset

FYI, the datafolder object

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Import data with `import_dataset`