Library for determining whether a RNA splicing predictor is using frame alignment information
Project description
Frame alignment checks
Set of tools for checking whether splicing prediction models are using frame alignment information. Does so on a set of "long canonical internal coding exons" (see below for definition).
Data sourcing
Long canonical internal coding exons
These are computed using the following conditions.
- appear in the SAM validation set (first half of the SpliceAI test set) of canonical exons in certain chromosomes
- have exactly one ensembl annotation whose transcript is the same as the canonical transcript
- start and end in a coding region
- length at least 100nt
this set is built in to the package and does not need to be provided by the user.
The sources used to construct this are the SpliceAI dataset [1] via the SAM repository's implementation (which is itself based on the SpliceAI implementation) [2] and the Ensembl database [3].
Relevant validation genes
These are sourced, like the long canonical internal coding exons, from the SAM validation set. Only genes relevant to the long canonical internal coding exons are pulled.
Minigenes
The minigenes are sourced from the hg19 canonical transcript, defined in the same way as SpliceAI's canonical transcript.
Saturation Mutagenesis test benchmark
This is sourced from [https://genome.cshlp.org/content/suppl/2017/12/14/gr.219683.116.DC1/Supplemental_Table_S2.xlsx](this link) and is cached in this package in case the link goes down.
Phase handedness counts
This is a count of how many times each donor 9mer appears in each phase. This is sourced from the SpliceAI training set, via SAM.
Non-stop donor windows
This is a collection of donors from the SpliceAI test set (again via SAM), specifically ones where swapping the donor 9mer with an arbitrary sequence would not introduce a stop in the exon. Basically, we exclude conditions where the flanking exon ends with a sequence that is a prefix of a stop codon, these are T, TA, and TG.
Acceptor and donor LSSI models
These are models trained on the SpliceAI training set, via SAM. Copied directly from https://github.com/kavigupta/sam/tree/main/spliceai/Canonical/splicepoint-models. We only use these in tests, and they are not required for the package to run.
[1]: Jaganathan, Kishore, et al. "Predicting splicing from primary sequence with deep learning." Cell 176.3 (2019): 535-548. [2]: https://github.com/kavigupta/sam [3]: https://useast.ensembl.org/index.html
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file frame_alignment_checks-0.0.53.tar.gz.
File metadata
- Download URL: frame_alignment_checks-0.0.53.tar.gz
- Upload date:
- Size: 30.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b0b10a00945047704450d520df9e3c2725805bff71ede62ffbb4105bba3d2a2
|
|
| MD5 |
140481696fb9de5b65b8ce25be521d8e
|
|
| BLAKE2b-256 |
2c9bce9f6ddddd58addca2b3c69a586c0f773f53960c939ade59f7671cfa87da
|