Skip to main content

Library for determining whether a RNA splicing predictor is using frame alignment information

Project description

Frame alignment checks

Set of tools for checking whether splicing prediction models are using frame alignment information. Does so on a set of "long canonical internal coding exons" (see below for definition).

Data sourcing

Long canonical internal coding exons

These are computed using the following conditions.

  • appear in the SAM validation set (first half of the SpliceAI test set) of canonical exons in certain chromosomes
  • have exactly one ensembl annotation whose transcript is the same as the canonical transcript
  • start and end in a coding region
  • length at least 100nt

this set is built in to the package and does not need to be provided by the user.

The sources used to construct this are the SpliceAI dataset [1] via the SAM repository's implementation (which is itself based on the SpliceAI implementation) [2] and the Ensembl database [3].

Relevant validation genes

These are sourced, like the long canonical internal coding exons, from the SAM validation set. Only genes relevant to the long canonical internal coding exons are pulled.

Minigenes

The minigenes are sourced from the hg19 canonical transcript, defined in the same way as SpliceAI's canonical transcript.

Saturation Mutagenesis test benchmark

This is sourced from [https://genome.cshlp.org/content/suppl/2017/12/14/gr.219683.116.DC1/Supplemental_Table_S2.xlsx](this link) and is cached in this package in case the link goes down.

Phase handedness counts

This is a count of how many times each donor 9mer appears in each phase. This is sourced from the SpliceAI training set, via SAM.

Non-stop donor windows

This is a collection of donors from the SpliceAI test set (again via SAM), specifically ones where swapping the donor 9mer with an arbitrary sequence would not introduce a stop in the exon. Basically, we exclude conditions where the flanking exon ends with a sequence that is a prefix of a stop codon, these are T, TA, and TG.

[1]: Jaganathan, Kishore, et al. "Predicting splicing from primary sequence with deep learning." Cell 176.3 (2019): 535-548. [2]: https://github.com/kavigupta/sam [3]: https://useast.ensembl.org/index.html

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

frame_alignment_checks-0.0.41.tar.gz (30.1 MB view details)

Uploaded Source

File details

Details for the file frame_alignment_checks-0.0.41.tar.gz.

File metadata

  • Download URL: frame_alignment_checks-0.0.41.tar.gz
  • Upload date:
  • Size: 30.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for frame_alignment_checks-0.0.41.tar.gz
Algorithm Hash digest
SHA256 61627b6afa90519d5d9793822be4bb0766d4fef01907fb4dd5eb68c8973e11ef
MD5 2f6235170e177f61e40b1669484b54a5
BLAKE2b-256 5769308074283204b55d741473cf651b126cdc4a97f1883771f4a2bdd8da257e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page