Skip to main content

DiviSSR is a DNA tandem repeat identification tool

Project description

DiviSSR

Build License

DiviSSR is a DNA tandem repeat identification tool. Tandem repeats are important genomic sequences which have functional and evolutionary significance.

DiviSSR is scripted in C++.

Working principle

DiviSSR captures a mathematical property exhibited by binary number representations of DNA tandem repeats. In the binary number or the 2-bit format of DNA sequences, nucleotides are represented by 4 different combinations of 2 binary digits (bits). The binary number representations of DNA tandem repeats conform to a unique division rule. DiviSSR scans the genome in windows, converts each window sequence to a binary number and checks if the number qualifies the division rule. The detailed explanation of the algorithm is below and an overview of the method is depicted in the figure below.

A) 2-bit representation for each nucleotide. B) Example repeat sequence of ACT motif with 2 complete units and a partial suffix of 2 bases. C) Binary representation of the repeat sequence generates a number denoted by S. A divisor (D) is built based on the length of the repeat, the motif and the partial suffix. The division of S with D yields a quotient which is equal to binary representation of the motif and the remainder equals binary representation of partial suffix.

Installation

$ pip install divissr

Usage

The help message and available options can be accessed using

$ divissr -h 

which gives the following output

usage: divissr [-h] -i <FILE> [-o <FILE>] [-m <INT>] [-M <INT>] [-l <INT>] [-a] [-g <FILE>]
                 [--anno-format <STR>] [--gene-key <STR>] [--up-promoter <INT>]
                 [--down-promoter <INT>]

Required arguments:
  -i, --input	<FILE>	Input sequence file. Fasta format.

Optional arguments:
  -o, --output			<FILE>	Output file name. Default: Input file name + _divissr.tsv
  -m, --min-motif-size	<INT>	Minimum size of a repeat motif in bp. Default: 1
  -M, --max-motif-size	<INT>	Maximum size of a repeat motif in bp. Default: 6
  -l, --min-length		<INT>	Cutoff repeat length. Default: 2*M.
 								Should at least be twice of maximum motif size.
  -a, --analyse					Generate a summary HTML report.

Compound repeat arguments:
  --compound            Report compound repeats. The output of compound repeats is a separate file with the suffix ".compound".
  -d <INT>, --comp-dist <INT>
                        Maximum distance between individual repeats of compound repeat. Use negative to denote overlap. Default: 0

Annotation arguments:
  -g, --annotate		<FILE>	Genomic feature file to annotate repeats w.r.t genes.
  								Both GFF and GTF can be processed.
  --anno-format			<STR>   Format of genomic feature file. 
  								Valid inputs: GFF, GTF. Default: GFF
  --gene-key			<STR>   Attribute used as unique identifier for gene name.
  								The default identifier is "gene". 
  --up-promoter			<INT>   Upstream distance(bp) from TSS to be considered as
  								promoter region. Default: 1000
  --down-promoter		<INT>   Downstream distance(bp) from TSS to be considered as
  								promoter region. Default: 1000

-i or --input

Expects: STRING (to be used as filename)
Default: None
This is the only required argument for the program. The input file must be a valid FASTA file.

-o or --output

Expects: STRING (to be used as filename)
Default: Input Filename + _divissr.tsv (see below)
If this option is not provided, the default output filename will be the same as the input filename, with its extension replaced with '_divissr.tsv'. For example, if the input filename is my_seq.fa, the default output filename will be my_seq.fa_divissr.tsv. If the input filename does not have any extension, _divissr.tsv will be appended to the filename. Please note that even in the case of no identified SSRs, the output file is still created (therefore overwriting any previous file of the same name) but with no content in the file.

Output for fasta

The output is a tab-delimited file, with one SSR record per line. The output columns follow the BED format. The details of the columns are given below:

S.No Column Description
1 Chromosome Chromosome or Sequence Name as specified by the first word in the FASTA header
2 Repeat Start 0-based start position of SSR in the Chromosome
3 Repeat Stop End position of SSR in the Chromosome
4 Repeat Class Class of repeat as grouped by their cyclical variations
5 Repeat Length Total length of identified repeat in nt
6 Repeat Strand Strand of SSR based on their cyclical variation
7 Motif Number Number of times the base motif is repeated
8 Actual Repeat Starting sequence of the SSR irrespective of Repeat class and strand

An example output showing some of the largest repeats from Drosophila melanogaster is given below

X       22012826  22014795  ACTGGG  1969    -       328     TCCCAG
2RHet   591337    591966    AATACT  629     -       104     ATTAGT
4       1042143   1042690   AAATAT  547     +       91      AAATAT
2RHet   598244    598789    AATACT  545     -       90      AGTATT
XHet    122       663       AGAT    541     +       135     GATA
X       22422335  22422827  AGAT    492     +       123     GATA
3R      975265    975710    AAAT    445     -       111     TTAT
X       15442288  15442724  ACAGAT  436     +       72      ACAGAT
2L      22086818  22087152  AATACT  334     -       55      TATTAG
YHet    137144    137466    AAGAC   322     -       64      CTTGT

-m or --min-motif-size

Expects: INTEGER
Default: 1
Minimum length of motifs to be considered. By default, divissr ignores redundant motifs. For example, a stretch of 12 A's is considered a monomer repeat of 12 A's rather than a dimer repeat of 6 AA's.

-M or --max-motif-size

Expects: INTEGER
Default: 6
Maximum length of motifs to be considered. Setting a large value of -M has a non-trivial effect on both the runtime and memory usage of divissr.

-l or --min-length

Expects: INTEGER
Default: 2 * M
Minimum length cut-off to be considered when finding an SSR. The same cut-off will apply for SSRs of all motif lengths, even if the motif length is not a divisor of this value. In such cases, SSRs that end with a partial motif are also picked if they pass the length cut-off. This value should be at least twice of the maximum motif size.

-a or --analyze

Expects: None
Default: False
In addition to the default tab-separated output, DiviSSR can also generate a fully interactive HTML report for easy downstream analysis of the repeat data. The filename will be the same prefix as that of the main output. For example, if the input filename was my_seq.fa, the analysis report will be my_seq_divissr.html. An example HTML report, generated from the repeat data of Homo sapiens (build hg19), can be accessed here (Right click -> Save As).

--compound

Expects: None
Default: False
This is flag which when set to true reports all compound repeats. Compound repeats are repeats which are either overlapping or separated by a small gap. Compound repeats are reported in a separate file with the extension ".compound". The maximum distance between two individual repeats of a compound repeat can be set using the option -d. The compoud repeat has four columns with first four denoting the sequence, start, end and length of the repeat. The fifth column denotes the repeat classes of individual repeats of the compound repeat and the number of different cyclical variations that have occured as compound repeat. The sixth column is the strand of the individual repeats. The seventh column is denotes the actual motifs of individual repeats, repeat length and the distance between the individual repeats. Below is an example output reporting compound repeats.

chr1	94808	94847	(AAAAG)1(A)1	39	-|-	(CTTTT)20|D0|(T)23
chr1	113871	113896	(AGAGGG)1(AAAGAG)1	25	-|-	(CCCTCT)13|D0|(TCTCTT)12
chr1	130955	130977	(AAAAAC)1(AAAAC)1	22	+|+	(CAAAAA)12|D0|(AAAAC)13
chr1	147433	147475	(AAC)1(AAAAAC)1(AAC)1	42	+|+|+	(CAA)21|D0|(AACAAA)17|D-2|(AAC)14
chr1	156944	156967	(AAAAAT)2	23	+|+	(TAAAAA)12|D0|(AAAAAT)14
chr1	174902	174957	(A)1(AG)1(AAAG)1	55	+|+|+	(A)31|D0|(AG)13|D-2|(AGAA)15
chr1	180444	180472	(ACCCTG)1(AACCCT)1	28	+|+	(CCCTGA)16|D0|(TAACCC)12
chr1	180504	180537	(ACCCTC)1(AACCCT)1	33	+|+	(CCTCAC)20|D0|(CCCTAA)15
chr1	180552	180590	(AACCCT)2(AACCC)1	38	+|+|+	(AACCCT)14|D0|(AACCCT)14|D-2|(AACCC)12
chr1	180748	180912	(AACCCT)4	164	+|+|+|+	(CCCTAA)55|D0|(ACCCTA)44|D-2|(AACCCT)21|D-7|(AACCCT)44

-d or --comp-dist

Expects: INTEGER
Default: 0
Maximum distance between two individual repeats of a compound repeat. To report repeats overlapping by X bp use negative numbers i.e., -X. The default is 0bp.

-g or --annotate

Expects: FILE
Default: None
Input a genomic feature file to annotate the repeats in the genomic context. DiviSSR accepts both GFF and GTF format genomic feature files. Each repeat is annotated w.r.t the closest gene and classified either as Genic, Exonic, Intronic and Intergenic according to the position of the repeat. Besides this, the repeat is also checked if it falls in the promoter region of the gene. Annotation adds 7 columns to the default divissr output which already consist 8 columns.

S.No Column Description
9 Gene name Name of the closest gene
10 Gene Start Start position of gene in the Chromosome
11 Gene Stop End position of gene in the Chromosome
12 Strand The strand orientation of the gene
13 Genomic annotation Annotation of the repeat w.r.t to the gene. Possible annotations are {Genic, Exonic, Intronic, Intergenic}
14 Promoter annotation If repeat falls in the promoter region of the closest gene. The default promoter region is 1Kb upstream and downstream of TSS.
15 Distance from TSS Distance of the repeat from the TSS of the gene.

--anno-format

Expects: STRING
Default: GFF
Option to specify the format of the input genomic feature file. Accepted inputs are GFF or GTF. More details about the GFF and GTF formats can be found here.

--gene-key

Expects: STRING
Default: gene
The attribute key used for the name of the gene in the GFF/GTF file. In the below example GFF file, we have the location of a gene and it's mRNA and exon locations. The last column of the file specifies attributes associated with each feature, like ID, Parent, gene etc. DiviSSR uses on of the attribute to identify the gene and also it's exons. In th below example the key "gene" can be used to identify gene and the exons of the gene as they have the same gene name. Please check your GFF/GTF file for a robust attribute key which can identify all genes and their corresponding exons. We are actively working on better annotation where we can identify genes and their exons based on the ID and Parent.

# Sample GFF
NC_004354.4	RefSeq	gene	124370	126714	.	-	.	ID=gene1;Name=CG17636;gbkey=Gene;gene=CG17636;gene_biotype=protein_coding;gene_synonym=DmelCG17636,EG:23E12.1;
NC_004354.4	RefSeq	mRNA	124370	126714	.	-	.	ID=rna1;Parent=gene1;Name=NM_001103384.3;gbkey=mRNA;gene=CG17636;transcript_id=NM_001103384.3
NC_004354.4	RefSeq	exon	126626	126714	.	-	.	ID=id13;Parent=rna1;gbkey=mRNA;gene=CG17636;transcript_id=NM_001103384.3
NC_004354.4	RefSeq	exon	125495	126259	.	-	.	ID=id14;Parent=rna1;gbkey=mRNA;gene=CG17636;transcript_id=NM_001103384.3

--up-promoter

Expects: INT
Default: 1000
Upstream distance(bp) from the TSS of the gene to be considered as promoter region. Default 1000.

--down-promoter

Expects: INT
Default: 1000
Downstream distance(bp) from the TSS of the gene to be considered as promoter region. Default 1000.

Change log

v 0.1.1

v 0.1.2

  • Solved compound repeat error for gzipped fasta input
  1. Handling error of non ACGTN nucleotides.

Contact

For queries or suggestions, please contact:

Akshay Kumar Avvaru - avvaru@ccmb.res.in
Divya Tej Sowpati - tej@ccmb.res.in

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

divissr-0.1.2.tar.gz (1.4 MB view hashes)

Uploaded Source

Built Distribution

divissr-0.1.2-py3-none-any.whl (1.4 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page