Skip to main content

Library for determining whether a RNA splicing predictor is using frame alignment information

Project description

Frame alignment checks

Set of tools for checking whether splicing prediction models are using frame alignment information. Does so on a set of "long canonical internal coding exons" (see below for definition).

Data sourcing

Long canonical internal coding exons

These are computed using the following conditions.

  • appear in the SAM validation set (first half of the SpliceAI test set) of canonical exons in certain chromosomes
  • have exactly one ensembl annotation whose transcript is the same as the canonical transcript
  • start and end in a coding region
  • length at least 100nt

this set is built in to the package and does not need to be provided by the user.

The sources used to construct this are the SpliceAI dataset [1] via the SAM repository's implementation (which is itself based on the SpliceAI implementation) [2] and the Ensembl database [3].

Relevant validation genes

These are sourced, like the long canonical internal coding exons, from the SAM validation set. Only genes relevant to the long canonical internal coding exons are pulled.

Minigenes

The minigenes are sourced from the hg19 canonical transcript, defined in the same way as SpliceAI's canonical transcript.

Saturation Mutagenesis test benchmark

This is sourced from [https://genome.cshlp.org/content/suppl/2017/12/14/gr.219683.116.DC1/Supplemental_Table_S2.xlsx](this link) and is cached in this package in case the link goes down.

Phase handedness counts

This is a count of how many times each donor 9mer appears in each phase. This is sourced from the SpliceAI training set, via SAM.

Non-stop donor windows

This is a collection of donors from the SpliceAI test set (again via SAM), specifically ones where swapping the donor 9mer with an arbitrary sequence would not introduce a stop in the exon. Basically, we exclude conditions where the flanking exon ends with a sequence that is a prefix of a stop codon, these are T, TA, and TG.

Acceptor and donor LSSI models

These are models trained on the SpliceAI training set, via SAM. Copied directly from https://github.com/kavigupta/sam/tree/main/spliceai/Canonical/splicepoint-models. We only use these in tests, and they are not required for the package to run.

[1]: Jaganathan, Kishore, et al. "Predicting splicing from primary sequence with deep learning." Cell 176.3 (2019): 535-548. [2]: https://github.com/kavigupta/sam [3]: https://useast.ensembl.org/index.html

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

frame_alignment_checks-0.0.70.tar.gz (44.6 MB view details)

Uploaded Source

File details

Details for the file frame_alignment_checks-0.0.70.tar.gz.

File metadata

  • Download URL: frame_alignment_checks-0.0.70.tar.gz
  • Upload date:
  • Size: 44.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for frame_alignment_checks-0.0.70.tar.gz
Algorithm Hash digest
SHA256 0fcec3a10d6690dba8aa4524e2344e7997b8871ee00f2d73ad6db040fdc801bb
MD5 851553efc78dc248883d2e5a1d97f983
BLAKE2b-256 9b548cff9977c455a7bcf7699882912d10b093c789360801dc9ac7fcb7ed8f26

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page