Skip to main content

isoCirc: computational pipeline to identify high-confidence BSJs and full-length circRNA isoforms from isoCirc long-read data

Project description

isoCirc: computational pipeline to identify high-confidence BSJs and full-length circRNA isoforms from isoCirc long-read data

Latest Release PyPI License Published in Nat. Commun. DOI GitHub Issues

What is isoCirc ?

isoCirc is a long-read sequencing strategy coupled with an integrated computational pipeline to characterize full-length circRNA isoforms using rolling circle amplification (RCA) followed by long-read sequencing.

Table of Contents

Installation

Dependencies

isoCirc is dependent on two open-source software packages: bedtools(>= v2.27.0) and minimap2 minimap2(>= 2.11). Please ensure that these packages are installed before running isoCirc.

Install isoCirc with pip

isoCirc is written with python, please use pip to install isoCirc:

pip install isocirc            # first time installation
pip install isocirc --upgrade  # update to the latest version

Install isoCirc from source

Alternatively, you can install isoCirc from source:

git clone https://github.com/Xinglab/isoCirc.git
cd isoCirc/isoCirc_pipeline && pip install .

Getting started with toy example in test_data

cd isoCirc/test_data
isocirc -t 1 read_toy.fa chr16_toy.fa chr16_toy.gtf chr16_circRNA_toy.bed output

Detailed arguments:

usage: isocirc [-h] [-v] [-t THREADS] [--bedtools BEDTOOLS]
               [--minimap2 MINIMAP2] [--short-read short.fa/fq] [--lordec LORDEC]
               [--kmer KMER] [--solid SOLID] [--trf TRF] [--match MATCH]
               [--mismatch MISMATCH] [--indel INDEL] [--match-frac MATCH_FRAC]
               [--indel-frac INDEL_FRAC] [--min-score MIN_SCORE]
               [--max-period MAX_PERIOD] [--min-len MIN_LEN]
               [--min-copy MIN_COPY] [--min-frac MIN_FRAC]
               [--high-max-ratio HIGH_MAX_RATIO]
               [--high-min-ratio HIGH_MIN_RATIO]
               [--high-iden-ratio HIGH_IDEN_RATIO]
               [--high-repeat-ratio HIGH_REPEAT_RATIO]
               [--low-repeat-ratio LOW_REPEAT_RATIO]
               [--cano-motif {GT/AG,all}] [--bsj-xid BSJ_XID]
               [--key-bsj-xid KEY_BSJ_XID] [--min-circ-dis MIN_CIRC_DIS]
               [--rescue-low] [--fsj-xid FSJ_XID] [--key-fsj-xid KEY_FSJ_XID]
               [--Alu ALU] [--flank-len FLANK_LEN] [--all-repeat ALL_REPEAT]
               long.fa/fq ref.fa anno.gtf circRNA.bed/gtf out_dir

isocirc: circular RNA profiling and analysis using long-read sequencing

positional arguments:
  long.fa/fq            Long-read sequencing data generated with isoCirc
  ref.fa                Reference genome sequence file
  anno.gtf              Gene annotation file in GTF format
  circRNA.bed/gtf       circRNA database annotation file in BED or GTF format
                        Use ',' to separate multiple circRNA annotation files
  out_dir               Output directory for final result and temporary files

optional arguments:
  -h, --help            Show this help message and exit
  -v, --version         Show program's version number and exit

General options:
  -t THREADS, --threads THREADS
                        Number of threads to use (default: 8)
  --bedtools BEDTOOLS   Path to bedtools (default: bedtools)
  --minimap2 MINIMAP2   Path to minimap2 (default: minimap2)

Hybrid error-correction with short-read data (LoRDEC):
  --short-read short.fa/fq
                        Short-read data for error correction 
                        Use ',' to connect multiple or paired-end short-read data
                        (default: )
  --lordec LORDEC       Path to lordec-correct (default: lordec-correct)
  --kmer KMER           k-mer size (default: 21)
  --solid SOLID         Solid k-mer abundance threshold (default: 3)

Consensus calling with Tandem Repeats Finder (TRF)):
  --trf TRF             Path to TRF program (default: trf409.legacylinux64)
  --match MATCH         Match score (default: 2)
  --mismatch MISMATCH   Mismatch penalty (default: 7)
  --indel INDEL         Indel penalty (default: 7)
  --match-frac MATCH_FRAC
                        Match probability (default: 80)
  --indel-frac INDEL_FRAC
                        Indel probability (default: 10)
  --min-score MIN_SCORE
                        Minimum alignment score to report (default: 100)
  --max-period MAX_PERIOD
                        Maximum period size to report (default: 2000)

Filtering and mapping of consensus sequences (minimap2):
  --min-len MIN_LEN     Minimum consensus length to keep (default: 30)
  --min-copy MIN_COPY   Minimum copy number of consensus to keep 
                        (default: 2.0)
  --min-frac MIN_FRAC   Minimum fraction of original long read to keep
                        (default: 0.0)
  --high-max-ratio HIGH_MAX_RATIO
                        Maximum mappedLen / consLen ratio for high-quality
                        alignment (default: 1.1)
  --high-min-ratio HIGH_MIN_RATIO
                        Minimum mappedLen /consLen ratio for high-quality
                        alignment (default: 0.9)
  --high-iden-ratio HIGH_IDEN_RATIO
                        Minimum identicalBases/ consLen ratio for high-quality
                        alignment (default: 0.75)
  --high-repeat-ratio HIGH_REPEAT_RATIO
                        Maximum mappedLen / consLen ratio for high-quality
                        self-tandem consensus (default: 0.6)
  --low-repeat-ratio LOW_REPEAT_RATIO
                        Minimum mappedLen / consLen ratio for low-quality
                        self-tandem alignment (default: 1.9)

Identifying high-confidence BSJs and full-length circRNAs:
  --cano-motif {GT/AG,all}
                        Canonical back-splice motif (GT/AG or all three
                        motifs: GT/AG, GC/AG, AT/AC) (default: GT/AG)
  --bsj-xid BSJ_XID     Maximum allowed mis/ins/del for 20-bp exonic sequence
                        flanking BSJ (10-bp each side) (default: 1)
  --key-bsj-xid KEY_BSJ_XID
                        Maximum allowed mis/ins/del for 4-bp exonic sequence
                        flanking BSJ (2-bp each side) (default: 0)
  --min-circ-dis MIN_CIRC_DIS
                        Minimum distance between genomic coordinates of
                        two back-splice sites (default: 150)
  --rescue-low          Use high-mapping quality reads to rescue low-mapping
                        quality reads (default: False)
  --fsj-xid SJ_XID       Maximum allowed mis/ins/del for 20-bp exonic sequence
                        flanking FSJ (10-bp each side) (default: 1)
  --key-fsj-xid KEY_SJ_XID
                        Maximum allowed mis/ins/del for 4-bp exonic sequence
                        flanking FSJ (2-bp each side) (default: 0)

Other options:
  --Alu ALU             Alu repetitive element annotation in BED format
                        (default: )
  --flank-len FLANK_LEN
                        Length of upstream and downstream flanking sequence to
                        search for Alu (default: 500)
  --all-repeat ALL_REPEAT
                        All repetitive element annotation in BED format
                        (default: )

Input and output

Input files

isoCirc takes a long read containing multiple copies of a circRNA sequence as input

It also requires a reference genome sequence and gene annotation to be provided.

Output files

isoCirc outputs three result files in a user-specified directory:

No. File name Explanation
1 isocirc.out detailed information of each circRNA isoform with a high-confidence BSJ, in tabular format
2 isocirc.bed bed12 format file of isocirc.out
3 isocirc_stats.out basic summary stats numbers for isocirc.out

1. isocirc.out

isocirc.out is a 35-column tabular file, each line represents one unique circRNA isoform that has a high-confidence BSJ:

No. Column name Explanation
1 isoformID assigned isoform ID
2 chrom chromosome ID
3 startCoor0based start coordinate of circRNA, 0-based
4 endCoor end coordinate of circRNA
5 geneStrand gene strand (+/-)
6 geneID gene ID
7 geneName gene name
8 blockCount number of block
9 blockSize size of each block, separated by ,
10 blockStarts relative start coordinates of each block, separated by ,. refer to bed12 format for further details
11 refMapLen total length of circRNA
12 blockType category of each block. E: exon, I: intron, N: intergenic
13 blockAnno detailed annotation for each block, in format: "TransID:E1(100)+I(50)+E2(30)", where TransID is ID of corresponding transcript; E1 and E2 are 1st and 2nd exon of transcript; multiple blocks are separated by ,; and multiple transcripts of one block are separated by ;
14 isKnownSS True if SS is catalogued in gene annotation, False if not, separated by ,
15 isKnownFSJ True if FSJ is catalogued in gene annotation, False if not, separated by ,
16 canoFSJMotif strandness and canonical motifs of FSJs, e.g., -GT/AG, NA if FSJ is not canonical, separated by ,
17 isHighFSJ True if alignment around FSJ is high-quality, False if not, separated by ,
18 isKnownExon True if block is a catalogued exon in gene annotation, False if not, separated by ‘,’
19 isKnownBSJ True if BSJ exists in circRNA annotation, False if not; multiple circRNA annotations are separated by ,
20 isCanoBSJ True if BSJ has canonical motif (GT/AG), False if not
21 canoBSJMotif strandness and canonical motif of BSJ, e.g., -GT/AG, NA if BSJ is not canonical
22 isFullLength True if isoform is considered as full-length isoform, False if not
23 BSJCate category of BSJs: FSM/NIC/NNC, see explanation below.
24 FSJCate category of FSJs: FSM/NIC/NNC
25 CDS number of bases that are mapped to CDS region
26 UTR number of bases that are mapped to UTR region
27 lincRNA number of bases that are mapped to lincRNA region
28 antisense number of bases that are mapped to antisense region
29 rRNA number of bases that are mapped to rRNA region
30 Alu number of bases that are mapped to Alu element; NA if Alu annotation is not provided
31 allRepeat number of bases that are mapped to all repeat elements; NA if repeat annotation is not provided
32 upFlankAlu flanking Alu element in upstream; NA if none or Alu annotation is not provided
33 downFlankAlu flanking Alu element in downstream; NA if none or Alu annotation is not provided
34 readCount number of reads that come from this circRNA isoform
35 readIDs ID of reads that come from this circRNA isoform, separated by ,

2. isocirc.bed

isocirc.bed is a bed12 format file, each line represents one unique circRNA isoform from isocirc.out:

No. Column name Explanation
1 chrom chromosome ID
2 startCoor0based start coordinate of circRNA, 0-based
3 endCoor end coordinate of circRNA
4 isoformName name of the circRNA isoform
5 score indicates how dark the peak will be displayed in the browser (0-1000), set as 0 by isoCirc
6 strand +/- to denote strand
7 thickStart the starting position at which the feature is drawn thickly, set as 0 by isoCirc
8 thickEnd the ending position at which the feature is drawn thickly, set as 0 by isoCirc
9 itemRgb an RGB value of the form R,G,B (e.g. 255,0,0), set as 0 by isoCirc
10 blockCount number of block
11 blockSize size of each block, separated by ,
12 blockStarts relative start coordinates of each block, separated by ,. refer to bed12 format for further details

3. isocirc_stats.out

isocirc_stats.out contains 27 basic stats numbers for isocirc.out:

No. Name Explanation
1 Total reads Number of raw reads in sample
2 Total reads with cons Number of reads with consensus sequence called
3 Total mappable reads with cons Number of reads with consensus sequence called, mappable to genome
4 Total reads with candidate BSJ Number of reads with consensus sequence called, mappable to genome, and with BSJs ("candidate BSJs")
5 Total candidate BSJs Number of candidate BSJs (circRNA species)
6 Total known candidate BSJs Number of candidate BSJs reported in existing circRNA BSJ database (circBase / MiOncoCirc)
7 Total reads with high BSJs Number of reads with consensus sequence called, mappable to genome, and with high-confidence BSJs (based on additional inspection of alignment around BSJs)
8 Total high BSJs Number of high-confidence BSJs
9 Total known high BSJs Number of high-confidence BSJs that are known
10 Total isoforms with high BSJs Number of circRNA isoforms with high-confidence BSJs
11 Total isoforms with high BSJs high FSJs Number of circRNA isoforms with high-confidence BSJs, and all FSJs are high-confidence (canonical, high-quality alignment around internal splice sites)
12 Total isoforms with high BSJ known SSs Number of circRNA isoforms with high-confidence BSJs, and all SS are known (based on existing transcript GTF annotations for splice sites in linear RNA)
13 Total isoforms with high BSJs high FSJs known SSs Number of circRNA isoforms with high-confidence BSJs, all FSJs are high-confidence, and all SS are known
14 Total full-length isoforms Number of circRNA isoforms with high-confidence BSJs, and FSJs are high-confidence or all SS are known
15 Total reads for full-length isoforms Number of reads for circRNA isoforms with high-confidence BSJs, and all FSJs arehigh-confidence or all SS are known
16 Total full-length isoforms with FSM BSJ Number of full-length circRNA isoforms with FSM BSJs
17 Total reads for full-length isoforms with FSM BSJ Number of reads for full-length circRNA isoforms with FSM BSJs
18 Total full-length isoforms with NIC BSJ Number of full-length circRNA isoforms with NIC BSJs
19 Total reads for full-length isoforms with NIC BSJ Number of reads for full-length circRNA isoforms with NIC BSJs
20 Total full-length isoforms with NNC BSJ Number of full-length circRNA isoforms with NNC BSJs
21 Total reads for full-length isoforms with NNC BSJ Number of reads for full-length circRNA isoforms with NNC BSJs
22 Total full-length isoforms with FSM FSJ Number of full-length circRNA isoforms with FSM FSJs
23 Total reads for full-length isoforms with FSM FSJ Number of reads for full-length circRNA isoforms with FSM FSJs
24 Total full-length isoforms with NIC FSJ Number of full-length circRNA isoforms with NIC internal FSJs
25 Total reads for full-length isoforms with NIC FSJ Number of reads for full-length circRNA isoforms with NIC FSJs
26 Total full-length isoforms with NNC FSJ Number of full-length circRNA isoforms with NNC FSJs
27 Total reads for full-length isoforms with NNC FSJ Number of reads for full-length circRNA isoforms with NNC FSJs
  • BSJ: Back-Splice Junction
  • FSJ: Forward-Splice Junction
  • FSS: Forward-Splice Site
  • SS: Splice Site
  • cons: consensus sequence
  • cano: canonical
  • high: high-confidence (canonical and high-quality alignment around FSJ/BSJ)
  • FSM: Full Splice Match
  • NIC: Novel In Catalog
  • NNC: Novel Not in Catalog

Circular alignment of isoCirc long read

With the result file generated by isocirc, we can visulize the circular alignment of full-length isoCirc reads. Let's use the toy example in the test_data again:

isocircPlot ./read_toy.fa ./chr16_toy.fa ./chr16_toy.gtf ./output/isocirc.bed ./isocircPlot_toy.list SJ ./output

A .png file will be generated in the output folder indicating how the isoCirc long read is aligned to the reference genome multiple times.

FAQ

Contact

Yan Gao gaoy286@mail.sysu.edu.cn

Yi Xing yi.xing@pennmedicine.upenn.edu

github issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

isocirc-1.0.7.tar.gz (25.3 MB view hashes)

Uploaded Source

Built Distribution

isocirc-1.0.7-py3-none-any.whl (25.5 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page