Transformers for Transcripts

Project description

transcript_transformer

Deep learning utility functions for processing and annotating transcript genome data.

transcript_transformer is constructed in concordance with the creation of TIS Transformer, (paper, repository) and RIBO-former (paper, repository paper, repository tool). transcript_transformer makes use of the Performer architecture to allow for the annotations and processing of transcripts at single nucleotide resolution. The package makes applies h5py for data loading and pytorch-lightning as a high-level interface for training and evaluation of deep learning models. transcript_transformer is designed to allow a high degree of modularity, but has not been tested for every combination of arguments, and can therefore return errors. For a more targeted and streamlined explanation on how to apply TIS transformer or RIBO-former, please refer to their repositories.

🔗 Installation

pytorch needs to be separately installed by the user.

Next, the package can be installed running

pip install transcript-transformer

📖 User guide

The library features a tool that can be called directly by the command transcript_transformer, featuring four main functions: data, pretrain, train and predict.

Data loading

Information is separated by transcript and information type. Information belonging to a single transcript is mapped according to the index they populate within each h5py.dataset, used to store different types of information. Variable length arrays are used to store the sequences and annotations of all transcripts under a single data set. Sequences are stored using integer arrays following: {A:0, T:1, C:2, G:3, N:4} An example data.h5 has the following structure:

data.h5                                     (h5py.file)
    transcript                              (h5py.group)
    ├── tis                                 (h5py.dataset, dtype=vlen(int))
    ├── contig                              (h5py.dataset, dtype=str)
    ├── id                                  (h5py.dataset, dtype=str)
    ├── seq                                 (h5py.dataset, dtype=vlen(int))
    ├── ribo                                (h5py.group)
    │   ├── SRR0000001                      (h5py.group)
    │   │   ├── 5                           (h5py.group)
    │   │   │   ├── data                    (h5py.dataset, dtype=vlen(int))
    │   │   │   ├── indices                 (h5py.dataset, dtype=vlen(int))
    │   │   │   ├── indptr                  (h5py.dataset, dtype=vlen(int))
    │   │   │   ├── shape                   (h5py.dataset, dtype=vlen(int))
    │   ├── ...
    │   ....

Ribosome profiling data is saved by reads mapped to each transcript position. Mapped reads are furthermore separated by their read lengths. As ribosome profiling data is often sparse, we made use of scipy.sparse to save data within the h5 format. This allows us to save space and store matrix objects as separate arrays. Saving and loading of the data is achieved using the h5max package.

data

transcript_transformer data is used to process the transcriptome of a given assembly to make it readily available for data loading. Dictionary .yml/.json files are used to specify the application of data to the models. After processing, given dictionary files can still be altered to define what data is used for a specific run. As such, for a given assembly, it is possible to store all available data in a single database. New ribosome profiling experiments can be added to an existing database by running transcript_transformer data again after update the config file.

The following command can be used to parse data by running

transcript_transformer data template.yml

where template.yml is:

gtf_path : path/to/gtf_file.gtf
fa_path : path/to/fa_file.fa
########################################################
## add entries when using ribosome profiling data.
## format: 'id : ribosome profiling paths'
## leave empty for sequence input models (TIS transformer)
## DO NOT change id after data is parsed to h5 file
########################################################
ribo_paths :
  SRR000001 : ribo/SRR000001.sam
  SRR000002 : ribo/SRR000002.sam
  SRR000003 : ribo/SRR000003.sam
########################################################
## Data is parsed and stored in a hdf5 format file.
########################################################
h5_path : my_experiment.h5

When applying a model it is required to specify whether sequence information is used (e.g. for TIS transformer).

########################################################
## For models using transcript sequence data.
## This setting is used by the TIS-transformer.
## Set to false when training on ribo-seq data
########################################################
seq : false

Several other options exist that specify how ribosome profiling data is loaded. More information on each option is present in the yaml file.

pretrain

Conform with transformers trained for natural language processing objectives, models can first be trained following a self-supervised learning objective. Using a masked language modelling approach, models are tasked to predict the classes of the masked input tokens. As such, a model is trained the 'semantics' of transcript sequences. The approach is similar to the one described by Zaheer et al. .

Example

transcript_transformer pretrain input_data.yml --val 1 13 --test 2 14 --max_epochs 70 --accelerator gpu --devices 1

train

The package supports training the models architectures listed under transcript_transformer/models.py. The function expects the configuration file containing the input data info (see data loading). Use the --transfer_checkpoint flag to start training upon pre-trained models.

Example

transcript_transformer train input_data.yml --val 1 13 --test 2 14 --max_epochs 70 --transfer_checkpoint lightning_logs/mlm_model/version_0/ --name experiment_1 --accelerator gpu --devices 1

predict

The predict function returns probabilities for all nucleotide positions on the transcript and can be saved using the .npy or .h5 format. In addition to reading from .h5 files, the function supports the use of a single RNA sequence as input or a path to a .fa file. Note that .fa and .npy formats are only supported for models that only apply transcript nucleotide information.

Example

transcript_transformer predict AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACGGT RNA --output_type npy models/example_model.ckpt
transcript_transformer predict data/example_data.fa fa --output_type npy models/example_model.ckpt

Output data

The model returns predictions for every nucleotide on the transcripts. For each transcript, the array lists the transcript label and model outputs. The tool can output predictions using both the npy or h5 format.

>>> results = np.load('results.npy', allow_pickle=True)
>>> results[0]
array(['>ENST00000410304',
       array([2.3891837e-09, 7.0824785e-07, 8.3791534e-09, 4.3269135e-09,
              4.9220684e-08, 1.5315813e-10, 7.0196869e-08, 2.4103475e-10,
              4.5873511e-10, 1.4299616e-10, 6.1071654e-09, 1.9664975e-08,
              2.9255699e-07, 4.7719610e-08, 7.7600065e-10, 9.2305236e-10,
              3.3297397e-07, 3.5771163e-07, 4.1942007e-05, 4.5123262e-08,
              1.0270607e-11, 1.1841109e-09, 7.9038587e-10, 6.5511790e-10,
              6.0892291e-13, 1.6157842e-11, 6.9130129e-10, 4.5778301e-11,
              2.1682500e-03, 2.3315516e-09, 2.2578116e-11], dtype=float32)],
      dtype=object)

Other function flags

Various other function flags dictate the properties of the dataloader, model architecture and training procedure. Check them out

transcript_transformer data -h 
transcript_transformer pretrain -h 
transcript_transformer data -h
transcript_transformer predict -h

✔️ Package features

creation of h5 file from genome assemblies and ribosome profiling datasets
bucket sampling
pre-training functionality
data loading for sequence and ribosome data
custom target labels
function hooks for custom data loading and pre-processing
model architectures
application of trained networks
post-processing
test scripts

🖊️ Citation

@article {10.1093/nargab/lqad021,
    author = {Clauwaert, Jim and McVey, Zahra and Gupta, Ramneek and Menschaert, Gerben},
    title = "{TIS Transformer: remapping the human proteome using deep learning}",
    journal = {NAR Genomics and Bioinformatics},
    volume = {5},
    number = {1},
    year = {2023},
    month = {03},
    issn = {2631-9268},
    doi = {10.1093/nargab/lqad021},
    url = {https://doi.org/10.1093/nargab/lqad021},
    note = {lqad021},
    eprint = {https://academic.oup.com/nargab/article-pdf/5/1/lqad021/49418780/lqad021\_supplemental\_file.pdf},
}

Project details

Release history Release notifications | RSS feed

0.8.8

Sep 4, 2024

0.8.7

Sep 4, 2024

0.8.6

Aug 30, 2024

0.8.5

Jul 15, 2024

0.8.4

May 30, 2024

0.8.3

May 22, 2024

0.8.2

May 21, 2024

0.8.1

May 10, 2024

0.8.0

May 9, 2024

0.7.4

Apr 17, 2024

0.7.3

Apr 17, 2024

0.7.2

Apr 12, 2024

0.7.1

Apr 12, 2024

0.7.0

Apr 11, 2024

0.6.1

Apr 1, 2024

0.6

Mar 26, 2024

0.5.5

Feb 21, 2024

0.5.4

Dec 22, 2023

0.5.3

Dec 15, 2023

0.5.2

Dec 7, 2023

0.5.1

Dec 6, 2023

0.5.0

Dec 6, 2023

0.4.9

Sep 25, 2023

0.4.8

Sep 1, 2023

0.4.7

Aug 31, 2023

0.4.6

Aug 23, 2023

0.4.5

Aug 18, 2023

0.4.4

Aug 17, 2023

0.4.3

Aug 11, 2023

0.4.2

Aug 11, 2023

0.4.1

Aug 11, 2023

This version

0.4.0

Aug 11, 2023

0.3.3

Jun 8, 2023

0.3.2

May 9, 2023

0.3.1

May 9, 2023

0.3.0

May 4, 2023

0.2.1

May 4, 2023

0.2.0

May 4, 2023

0.1.4

Mar 10, 2023

0.1.3

Mar 10, 2023

0.1.2

Nov 30, 2022

0.0.4

Oct 19, 2022

0.0.3

Oct 19, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

transcript_transformer-0.4.0-py3-none-any.whl (17.3 MB view details)

Uploaded Aug 11, 2023 Python 3

File details

Details for the file transcript_transformer-0.4.0-py3-none-any.whl.

File metadata

Download URL: transcript_transformer-0.4.0-py3-none-any.whl
Upload date: Aug 11, 2023
Size: 17.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for transcript_transformer-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`21a7f3c2ddb0724ff2afa592b1016248fbf627bae8c6ae1c6144bcbcb2ee8450`
MD5	`ec8dab862c19a0243e7f8837fbc50de2`
BLAKE2b-256	`3401dcde4aacda64a294b8bd667e3c06e45a3ffce5bd04535c581c4a803936d4`