Skip to main content

Find and fix missed small exons.

Project description

MisSER-project

An annotation based method to find and fix small exons missed alignment defects in Nanopore long reads.

The example plot above show the misaligned exon in the middle is find by compaired the reference annotation BED and realigned to the correct position. When we use [minimap2](https://github.com/lh3/minimap2) to align ONT cDNA reads, it is easy to miss the exon with small size, because of the difficulty to find an exact match anchor in these region.

INSTALLATION

pip install MisSER

Upgrade to a newer version using: pip install MisSER --upgrade

The package is written for python3

INPUT

MisSER requires four essential arguments.

  • BAM file
  • genome reference fasta (with index fai in the same directory)
  • transcript annotation in BED12 format
  • the target file for the output exon-missed regions information

OUTPUT

  • regions which have missed small exons
  • realigned BAM file which are fixed

USAGE

MisSER [-h] [-v] [-c N] [-s N] [-d float] [-f N] [--strandSpecific]
              [--allTranscripts] [--fixFlankLen] [--debugMode] [--setTag]
              [-o file | --onlyRegion]
              inBam genomeFasta annotBed outRegion

positional arguments:
  inBam                 Input original bam file.
  genomeFasta           Reference genome fasta file, with fai index under the same directory.
  annotBed              Annotated transcripts file in BED12 format.
  outRegion             Output Region file, regions contain missed small exons.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         Print version and exit.
  -c, --coreNum N       The number of cpu cores we used. (default: 1)
  -s, --exonSizeThd N   The threshold of exons size, ignore exons with size > exonSizeThd. (default: 80)
  -d, --deltaRatioThd float
                        The threshold of absolute delta ratio, ignore abs(delta ratio) > deltaRatioThd.
                        (default: 0.5)
  -f, --flankLen N      The extended length on the both sides of realign region. (default: 20)
  --strandSpecific      Only compare reads consistent with annotated strand. (default: False)
  --allTranscripts      Return all possible missed exons among overlapping transcripts. (default: False)
  --fixFlankLen         Flank length will not extend to cover the adjacent indels. (default: False)
  --debugMode           Won't stop when meet an error in one read. (default: False)
  --setTag              Set fr tags on the realigned reads (default: False)
  -o, --outBam file     Output realigned bam file. (default: None)
  --onlyRegion          Only return the Region file without realign process. (default: False)

NOTES

Misaligned exons are absent from annotated position and connected to the neighbour exon as extra protruding parts with high error rate. MisSER will first select all introns on reads which overlap with annotated exons and set borders on the suspected misaligned regions. Then MisSER tries to realign the read sequence in region and compare the alignment score before and after realignment. If alignment score improves, the region will be assigned as misaligned region.
  • Delta length As the picture show above, Delta length is defined as the extra bases between reads flank exons and reference flank exons. Intuitively, it is the bases count in the extra protruding parts. We find the reads introns if the intron overlaps with the annotated exon, and then check whether the Delta length close to the exon length. It is easy to think that Delta length should be close to the Exon length if they come from the annotated exon.
  • Flank length In order to define the range of the realign region. we set the region start = min(annoted splice site, read splice site) - Flank length and the region end = max(annoted splice site, read splice site) + Flank length
  • Delta ratio Delta ratio = (Delta length - Exon length) / Exon length. We choose Exon length <= --exonSizeThd and Delta ratio <= --deltaRatioThd to filter out the false positives and record the regions as suspected exon-missed region. The regions information is output in outRegion.
  • --allTranscripts One Read intron may be overlapped with exons on multiple transcripts. As default, we only return the annotated transcript with minimum Delta raio as the correct reference. You can set --allTranscripts to keep all possible transcripts reference. In the realign process, we only keep one realign result among different transcript, the one with highest realignment score will be kept.

We use parasail to do the global pairwise alignment between read sequence and annotated exon sequence in the realign region. We only keep the realign result if realigned score > original score. ATTENTION: the original score does not refer to the AS field in BAM if provided. We calculate the realigned score and origial score based on the realignment and original alignment in the realign region, and the score is equal to the matched bases count - edit distance. As different alignment tools may have different score system, we do not change the AS of NM field in BAM if provided.

  • --onlyRegion Although we default to return the realigned result in -o, --outBam file, you can set the --onlyRegion to skip the realign process (although the realign process is not the bottleneck at present).
  • --strandSpecific This argument is used if the original reads is stranded RNA-seq. We will only try the strand of mapping read to find the overlapped exons.

EXAMPLE USAGE

git clone https://github.com/zhenLiuExplr/fixalign-project
cd examples
MisSER ex.bam ex.fa ex_annotation.bed ex_realign_region -o realn.bam

Be careful that chromosome in annotBed, genomeFasta and inBam should have same naming style (All in UCSC style like "chr1" or in Ensembl style like "1"). Inconsistent naming style will lead to failed judgement.

ACKNOWLEDGMENTS/CONTRIBUTORS

  • Zhen Liu for building and maintance
  • Wu Wei and Chenchen Zhu for advising

CONTRIBUTING

Welcome for all suggestions, bug reports, feature request and contributions. You can leave an issue or open a pull request.

CITATION

If you use this tool, please consider citing our publication

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

MisSER-0.0.6.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

MisSER-0.0.6-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file MisSER-0.0.6.tar.gz.

File metadata

  • Download URL: MisSER-0.0.6.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.6.9

File hashes

Hashes for MisSER-0.0.6.tar.gz
Algorithm Hash digest
SHA256 f44c4effe2c212a492d922bfb20df87f78f3671d2c247385e10075a3d03deb0e
MD5 1e4dccc0550fbb0b14b9d58083cf3854
BLAKE2b-256 a694c9b38b7410917b9fbc01c15720c688461948d195d1a5ec25826a1c667a67

See more details on using hashes here.

File details

Details for the file MisSER-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: MisSER-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 20.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.6.9

File hashes

Hashes for MisSER-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 1b8d09f115c855a108994e492ebc6e0570de6f3ec7c6cf2287052a073ef95f8a
MD5 60595a070da4ce9c913f0d1fe1f08bec
BLAKE2b-256 f1f79c2e9583145c95d79db363ab09006b2b469928fedb971aa4e42f04eee0d8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page