Skip to main content

Compute Genomic & Transcriptomic segments

Project description

This program is designed to work with python 2.7, python 3.2+ and pypy. It will install the following libraries:

In addition, you can install the following library in order to display a nice progress bar and a computation time estimation:

Quick install

pip install GTsegments


usage: [-h] [--genome_type {gbk,tsv,seq}] [--graph_type {gexf,list}]
              [-min INT] [-max INT] [-d THRESHOLD] [--no_filter] [-o FILE]
              [-no_dom] [-m | --no_gene_list | --sgs_like_headers] [-q]
              COEXP_GRAPH [GENOME [GENOME ...]]

Compute the list of GTsegments from a genome and a coexpression network.

example: -min 2 -max 50 -d 0.6 coexp_graph.gexf genome.gbk

positional arguments:
  COEXP_GRAPH           Coexpression graph
  GENOME                genome file(s) containing genomic organization of

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           Quiet mode: display only critical errors

File type:
  --genome_type {gbk,tsv,seq}
                        Type of the genome file(s) (default: gbk)
  --graph_type {gexf,list}
                        Type of the coexpression graph file (default: gexf)

GTsegments size:
  -min INT, --min_size INT
                        Minimum length of a GTsegment (default: 2)
  -max INT, --max_size INT
                        Maximum length of a GTsegment (default: maximum

Density option:
                        Select GTsegments with a genomic density ≥ THRESHOLD
                        in ]0,1] (default: 0.6)
  --no_filter           Do not apply density filtering

Output options:
  -o FILE, --output FILE
                        Output file name
  -no_dom, --no_domination
                        Keep all the GTsegments instead of the dominant ones
  -m, --matrix          Output the density matrix instead of the listing of
  --no_gene_list        Do not add the gene list column in the listing of
  --sgs_like_headers    Produce a listing of GTsegments with headers from
                        listing of SGS


The program asks two types of data: an unweighted coexpression network and some genome files describing the genomic organisation of one or many oragnisms. Missing genes and unmatched genes in the coexpression graph are allowed.

Coexpression graph

The program accepts coexpression files under the .gexf file format or text file containing a listing of nodes and vertices.

The .gexf format

When using the .gexf format (option graph_type gexf), the string in the field label of each node is considered as the id of the gene associated to its node.

The listing format

The listing format (option graph_type list) is quite simple. It is a list of nodes (optional) and edges describing the coexpression network. Only one node or edge is allowed per line. Nodes are gene ids and edges are couple of nodes separated by a blank character (tabulation, space, etc.)

Comments are allowed by using # at the begining of a line. Using # elsewhere won’t be a considered as a comment.


The following file graph.txt is a list of nodes and edges. It will be used as support later in this documentation.

# a line that begins with # is a comment (but # elsewhere won't be a considered as a comment)
# nodes are not mandatory but can exist in the graph file
2       4
4       5
7       8
6       9
6       10
9       10
12      16
14      15
14      16
14      18
11      17
17      23
25      1
# The node 26 does not exist in the genome (commented) and will ignored
25      26


The program accepts genome files under the GenBank file format (--genome_type gbk), files listing the genomic informations (--genome_type tsv) or simply text files giving each the sequence of genes of a chromome (--genome_type seq).

The GenBank format

The program can use GenBank files as input under the following restriction:

  1. considered genes are only CDS features, and

  2. each CDS must have a field locus_tag which will be the gene id.

The .tsv format

As an alternative of Genbank files that are not always easy to manipulate, can use a .tsv file as a description of one or many genome. The .tsv file must be formatted such that the first line contains the names of the columns (i.e. the header) and the next lines must describe a gene each.

The header must contains the at least the following columns names:

chromosome_id       gene_id left_end_position       right_end_position


  • chromosome_id is the id of the chromosome in wich the gene exists,

  • gene_id is the id of the gene,

  • left_end_position is the left end position of the gene (in number of nucleotides) when reading the main strand,

  • right_end_position is the right end position of the gene (in number of nucleotides) when reading the main strand.

The sequence format

The sequence format is simply a text file with a gene id per line such as the genes are sorted by their ascending position in the chromosome. if multiple chromosomes exist a file per chromosome is required.


In the following documentation, we will use the following seq.txt file as genome example data.

# a line that begins with # is a comment (but # elsewhere won't be a considered as a comment)
#26 <- this gene will be ignored because of the comment


Default output

By default, outputs a .tsv formated text. It can be write into by using the > output_file redirection or the the -o/--output option.

The first line of the output of contains the name of each columns and is called header. Following lines are the data where each line is a GTsegments. Each gtsegment is unique and appears once in the listing.

The names of the columns in header are the following:

chromosome  start   end     length  active_genes    density list_of_active_genes
  • chromosome contains the id of the chromosome in which the GTsegment appears. When the inputed gemone files are sequence files (i.e. --genome_type seq), the chromosome id is then the filename.

  • start contains the position of the first gene (i.e. the starting gene) of the GTsegment. The position of a gene is the index of this gene (i.e. the i:sup:th gene has the index i)

  • end contains the position of the last gene (i.e. the ending gene) of a GTsegment.

  • length contains the length of the GTsegment which the number of genes that are in the GTsegment (end - start + 1 modulo the number of genes into the chromosome).

  • active_genes column contains the number of genes of the GTsegment that are coexpressed with the starting and ending genes.

  • density column contains the the genomic density of a GTsegment which is the ratio between active_genes and length.

  • list_of_active_genes column contains the listing of active genes of the GTsegment (i.e. genes in the GTsegment that are coexpressed with the starting and ending genes). This column can be disabled with the --no_gene_list option, which can be usefull when querying large GTsegments (see the parameter -max/--max_size)


The following command … graph.txt seq.txt --graph_type list --genome_type seq

will produce the following output:

chromosome  start   end     length  active_genes    density list_of_active_genes
seq.txt     2       5       4       3       0.75    2 4 5
seq.txt     4       5       2       2       1.0     4 5
seq.txt     6       10      5       3       0.6     6 9 10
seq.txt     7       8       2       2       1.0     7 8
seq.txt     9       10      2       2       1.0     9 10
seq.txt     12      16      5       4       0.8     12 14 15 16
seq.txt     12      18      7       5       0.714285714286  12 14 15 16 18
seq.txt     14      16      3       3       1.0     14 15 16
seq.txt     14      18      5       4       0.8     14 15 16 18
seq.txt     25      1       2       2       1.0     25 1

SGS like output

The option --sgs_like_headers allows to produce listing of GTsegments that is compatible the outputs produced by sgs-utils.

Matrix output

When the option matrix is choosen, the output won’t be the previous listing, but a concatenation of density matrices in .csv format where cells are separated by ;. The indexes of lines and columns are the position of the genes on the chromosomes (i.e. the i:sup:th gene of a chromosome has the index i in the line and the column of the corresponding matrix). As density matrices are square matrices that appear in the same order than the chromosomes given as inputs, separate distinct matrices is possible.


The example with the matrix option… graph.txt seq.txt --graph_type list --genome_type seq --matrix

will produce the following output:



This work was supported by grants Fondap 15090007, Basal program PFB-03 CMM, IntegrativeBioChile INRIA Assoc. Team and CIRIC-INRIA Chile (line Natural Resources).

Project details

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page