Skip to main content
Donate to the Python Software Foundation or Purchase a PyCharm License to Benefit the PSF! Donate Now

Depict microbial diversity via a partitioned pangenome graph

Project description

PPanGGOLiN : Depicting microbial species diversity via a Partitioned PanGenome Graph Of Linked Neighbors
=========================================================================================

.. image:: images/logo.png

This tool compiles the genomic content of a species (A) also named a pangenome. It relies on a graph approach to model pangenomes in which nodes and edges represent families of homologous genes (B and C, not included in the pipeline) and chromosomal neighborhood information, respectively. This approach thus takes into account both graph topology (D.a) and occurrences of genes (D.b) to classify gene families into three partitions (i.e. *persistent genome*, *shell genome* and *cloud genome*) yielding to what we called Partitioned Pangenome Graphs (F). More precisely, the method depends upon an Expectation/Maximization algorithm based on Bernoulli Mixture Model (E.a) coupled with a Markov Random field (E.b).

Partitions:
1) Persistent genome: equivalent to a relaxed core genome (genes conserved in all but a few genomes);
2) Shell genome: genes having intermediate frequencies corresponding to moderately conserved genes potentially associated to environmental adaptation capabilities
3) Cloud genome: genes found at a very low frequency.

.. image:: images/workflow.png

A minimum of 5 genomes is generally required to perform a pangenomics analysis using the traditional *core genome*/*accessory genome* paradigm. Using the statistical approach presented here, we advice using at least 15 genomes having genomic variations (and not only genetic ones) to obtain robust results.

Installation
============================

PPanGGOLiN can be easily installed via:

.. code:: bash

pip install ppanggolin

GCC (>=3.0) will be required, as well as Python 3 and the following modules : "networkx(>=2.00)", "ordered-set", "numpy", "scipy", "tqdm" and "python-highcharts"

Optionally, in order to draw illustrative plots, R will be required together with the following packages : ("ggplot2", "ggrepel(last version)", "reshape2", "minpack.lm" and "data.table")

Quick usage
============================

The minimal command is :

.. code:: bash

ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE

Input formats
---------------------------
The tools required 2 files.

1. A file ORGANISMS_FILE summarizing the information about the about the organisms.
This is a tab-delimitated file structured as follows:

1. The first column is the organism name, it must be unique and can't contain reserved words (see section reserved words).
2. The second column is the path to the associated gff3 file (can be relative or absolute). In the gff files, genomes sequences are not required at all. Only CoDing Sequences (CDSs) features will be taken into account and each one containing an "Identifier" (ID) (mandatory), a "Name" (optional) and a "product" (optional) attributes.
3. (optional) Further columns are the contig IDs in the gff files which are both circulars and perfectly assembled. In this case, it is mandatory the provide the size of the contigs in the gff file either by adding a "region" feature to the gff file having the correct ID attribute or using a '##sequence-region' pragma (as in prokka).

Exemple of ORGANISMS_FILE:
::

Escherichia_coli_042__E._coli_1 gff3/ESCO.1017.00091.gff ESCO.1017.00091.0001 ESCO.1017.00091.0002
Escherichia_coli_1303__E._coli_1 gff3/ESCO.1017.00171.gff ESCO.1017.00171.0001 ESCO.1017.00171.0002 ESCO.1017.00171.0003 ESCO.1017.00171.0004
Escherichia_coli_536__E._coli_1 gff3/ESCO.1017.00005.gff ESCO.1017.00005.0001
Escherichia_coli_55989__E._coli_1 gff3/ESCO.1017.00015.gff ESCO.1017.00015.0001
Escherichia_coli_ABU_83972__E._coli_1 gff3/ESCO.1017.00092.gff ESCO.1017.00092.0001 ESCO.1017.00092.0002
Escherichia_coli_ACN001__E._coli_1 gff3/ESCO.1017.00061.gff ESCO.1017.00061.0001
Escherichia_coli_APEC_IMT5155__E._coli_1 gff3/ESCO.1017.00152.gff ESCO.1017.00152.0001 ESCO.1017.00152.0002 ESCO.1017.00152.0003
Escherichia_coli_APEC_O1__E._coli_1 gff3/ESCO.1017.00137.gff ESCO.1017.00137.0001 ESCO.1017.00137.0002 ESCO.1017.00137.0003
Escherichia_coli_APEC_O78__E._coli_1 gff3/ESCO.1017.00024.gff ESCO.1017.00024.0001
Escherichia_coli_ATCC_25922__E._coli_1 gff3/ESCO.1017.00151.gff ESCO.1017.00151.0001 ESCO.1017.00151.0002
...

Exemple of one of the associated gff file (obtained using prokka):
::

##gff-version 3
##sequence-region ESCO.1017.00091.0001 1 5241977
##sequence-region ESCO.1017.00091.0002 1 113346
ESCO.1017.00091.0001 Prodigal:2.6 CDS 336 2798 . + . ID=ESCO.1017.00091.b0001_00001;Name=thrA;gene=thrA;inference=similar to AA sequence:UniProtKB:P00561;locus_tag=ESCO.1017.00091.b0001_00001;product=Bifunctional aspartokinase/homoserine dehydrogenase 1
ESCO.1017.00091.0001 Prodigal:2.6 CDS 2800 3732 . + . ID=ESCO.1017.00091.i0001_00002;eC_number=2.7.1.39;Name=thrB;gene=thrB;inference=similar to AA sequence:UniProtKB:P00547;locus_tag=ESCO.1017.00091.i0001_00002;product=Homoserine kinase
ESCO.1017.00091.0001 Prodigal:2.6 CDS 3733 5019 . + . ID=ESCO.1017.00091.i0001_00003;eC_number=4.2.3.1;Name=thrC;gene=thrC;inference=similar to AA sequence:UniProtKB:P00934;locus_tag=ESCO.1017.00091.i0001_00003;product=Threonine synthase
ESCO.1017.00091.0001 Prodigal:2.6 CDS 5233 5529 . + . ID=ESCO.1017.00091.i0001_00004;locus_tag=ESCO.1017.00091.i0001_00004;product=hypothetical protein
ESCO.1017.00091.0001 Prodigal:2.6 CDS 5687 6289 . - . ID=ESCO.1017.00091.i0001_00005;locus_tag=ESCO.1017.00091.i0001_00005;product=hypothetical protein
ESCO.1017.00091.0001 Prodigal:2.6 CDS 6514 6687 . - . ID=ESCO.1017.00091.i0001_00006;locus_tag=ESCO.1017.00091.i0001_00006;product=hypothetical protein
ESCO.1017.00091.0001 Prodigal:2.6 CDS 7118 7894 . - . ID=ESCO.1017.00091.i0001_00007;locus_tag=ESCO.1017.00091.i0001_00007;product=hypothetical protein
...

2. A file FAMILIES_FILE providing the gene families formated as follows.
This is a tab-delimitated file.

1. The first column is the gene families name (sometimes the name of the median gene)
2. The further columns are the gene ID belonging to this family (a gene can't belong to multiple families)

Exemple of a families file:
::

1 ESCO.1017.00001.i0001_00047 ESCO.1017.00002.i0001_00053 ESCO.1017.00003.i0001_00052 ESCO.1017.00004.i0001_00047 ESCO.1017.00005.i0001_00048 ESCO.1017.00006.i0001_00053 ESCO.1017.00007.i0001_00052 ESCO.1017.00008.i0001_03750 ESCO.1017.00009.i0001_00047 ESCO.1017.00010.i0001_00047 ESCO.1017.00011.i0001_00052 ESCO.1017.00012.i0001_03643 ESCO.1017.00013.i0001_03593 ESCO.1017.00014.i0001_00050 ESCO.1017.00015.i0001_00048 ESCO.1017.00016.i0001_00047 ESCO.1017.00017.i0001_00053 ESCO.1017.00018.i0001_00038 ESCO.1017.00019.i0001_00051 ESCO.1017.00020.i0001_00051 ESCO.1017.00021.i0001_00048 ESCO.1017.00022.i0001_00047 ESCO.1017.00023.i0001_00049 ESCO.1017.00024.i0001_00735 ESCO.1017.00025.i0001_00040 ESCO.1017.00026.i0001_00048 ESCO.1017.00027.i0001_00047 ESCO.1017.00028.i0001_01224 ESCO.1017.00029.i0001_03729 ESCO.1017.00030.i0001_03859 ESCO.1017.00031.i0001_00620 ESCO.1017.00032.i0001_00627 ESCO.1017.00033.i0001_00637 ESCO.1017.00034.i0001_00050 ESCO.1017.00035.i0001_00047 ESCO.1017.00036.i0001_00047 ESCO.1017.00037.i0001_00047 ESCO.1017.00038.i0001_00047 ESCO.1017.00039.i0001_03494 ESCO.1017.00040.i0001_00279 ESCO.1017.00041.i0001_00052 ESCO.1017.00042.i0001_00052 ESCO.1017.00043.i0001_00047 ESCO.1017.00044.i0001_00047 ESCO.1017.00045.i0001_00765 ESCO.1017.00046.i0001_00756 ESCO.1017.00047.i0001_00764 ESCO.1017.00048.i0001_00765 ESCO.1017.00049.i0001_00822 ESCO.1017.00050.i0001_00763 ESCO.1017.00051.i0001_00766 ESCO.1017.00052.i0001_00822 ESCO.1017.00053.i0001_00047 ESCO.1017.00054.i0001_00051 ESCO.1017.00055.i0001_00047 ESCO.1017.00056.i0001_00047 ESCO.1017.00057.i0001_00047 ESCO.1017.00058.i0001_00047 ESCO.1017.00059.i0001_00047 ESCO.1017.00060.i0001_00052 ESCO.1017.00061.i0001_00052 ESCO.1017.00062.i0001_00047 ESCO.1017.00063.i0001_00047 ESCO.1017.00064.i0001_00047 ESCO.1017.00065.i0001_00051 ESCO.1017.00066.i0001_04368 ESCO.1017.00067.i0001_04371 ESCO.1017.00068.i0001_04369 ESCO.1017.00069.i0001_04242 ESCO.1017.00070.i0001_03265 ESCO.1017.00071.i0001_00052 ESCO.1017.00072.i0001_02745 ESCO.1017.00073.i0001_00772 ESCO.1017.00074.i0001_00774 ESCO.1017.00075.i0001_00622 ESCO.1017.00076.i0001_05069 ESCO.1017.00077.i0001_00052 ESCO.1017.00078.i0001_03627 ESCO.1017.00079.i0001_00767 ESCO.1017.00080.i0001_04013 ESCO.1017.00081.i0001_03408 ESCO.1017.00082.i0001_04825 ESCO.1017.00083.i0001_00047 ESCO.1017.00084.i0001_04180 ESCO.1017.00085.i0001_00053 ESCO.1017.00086.i0001_00050 ESCO.1017.00087.i0001_00051 ESCO.1017.00088.i0001_00050 ESCO.1017.00089.i0001_00053 ESCO.1017.00090.i0001_00051 ESCO.1017.00091.i0001_00055 ESCO.1017.00092.i0001_00051 ESCO.1017.00093.i0001_00050 ESCO.1017.00094.i0001_00048 ESCO.1017.00095.i0001_00052 ESCO.1017.00096.i0001_00047 ESCO.1017.00097.i0001_00768 ESCO.1017.00098.i0001_00774 ESCO.1017.00099.i0001_00053 ESCO.1017.00100.i0001_00054 ESCO.1017.00101.i0001_02441 ESCO.1017.00102.i0001_01197 ESCO.1017.00103.i0001_03712 ESCO.1017.00104.i0001_03915 ESCO.1017.00105.i0001_04058 ESCO.1017.00106.i0001_00052 ESCO.1017.00107.i0001_03883 ESCO.1017.00108.i0001_00047 ESCO.1017.00109.i0001_00047 ESCO.1017.00110.i0001_00052 ESCO.1017.00111.i0001_00052 ESCO.1017.00112.i0001_03779 ESCO.1017.00113.i0001_03530 ESCO.1017.00114.i0001_04415 ESCO.1017.00115.i0001_02640 ESCO.1017.00116.i0001_02854 ESCO.1017.00117.i0001_04675 ESCO.1017.00118.i0001_00052 ESCO.1017.00119.i0001_00051 ESCO.1017.00120.i0001_00053 ESCO.1017.00121.i0001_00048 ESCO.1017.00122.i0001_00053 ESCO.1017.00123.i0001_02649 ESCO.1017.00124.i0001_00084 ESCO.1017.00125.i0001_00708 ESCO.1017.00126.i0001_04565 ESCO.1017.00127.i0001_04548 ESCO.1017.00128.i0001_04614 ESCO.1017.00129.i0001_04564 ESCO.1017.00130.i0001_04555 ESCO.1017.00131.i0001_04613 ESCO.1017.00132.i0001_04544 ESCO.1017.00133.i0001_04600 ESCO.1017.00134.i0001_04596 ESCO.1017.00135.i0001_05121 ESCO.1017.00136.i0001_00052 ESCO.1017.00137.i0001_00050 ESCO.1017.00138.i0001_00053 ESCO.1017.00139.i0001_00049 ESCO.1017.00140.i0001_03887 ESCO.1017.00141.i0001_00048 ESCO.1017.00142.i0001_00048 ESCO.1017.00143.i0001_00051 ESCO.1017.00144.i0001_00052 ESCO.1017.00145.i0001_04318 ESCO.1017.00146.i0001_00052 ESCO.1017.00147.i0001_00055 ESCO.1017.00148.i0001_00055 ESCO.1017.00149.i0001_00052 ESCO.1017.00150.i0001_00052 ESCO.1017.00151.i0001_02558 ESCO.1017.00152.i0001_02857 ESCO.1017.00153.i0001_00050 ESCO.1017.00154.i0001_02854 ESCO.1017.00155.i0001_00052 ESCO.1017.00156.i0001_00564 ESCO.1017.00157.i0001_00052 ESCO.1017.00158.i0001_00053 ESCO.1017.00159.i0001_00053 ESCO.1017.00160.i0001_04406 ESCO.1017.00161.i0001_00052 ESCO.1017.00162.i0001_03910 ESCO.1017.00163.i0001_03179 ESCO.1017.00164.i0001_01542 ESCO.1017.00165.i0001_00048 ESCO.1017.00166.i0001_00052 ESCO.1017.00167.i0001_04244 ESCO.1017.00168.i0001_04266 ESCO.1017.00169.i0001_00054 ESCO.1017.00170.i0001_00050 ESCO.1017.00171.i0001_00047 ESCO.1017.00172.i0001_00048 ESCO.1017.00173.i0001_03823 ESCO.1017.00174.i0001_01302 ESCO.1017.00176.i0001_00052 ESCO.1017.00177.i0001_03204 ESCO.1017.00178.i0001_01987 ESCO.1017.00179.i0001_00051 ESCO.1017.00180.i0001_00049 ESCO.1017.00181.i0001_00051 ESCO.1017.00182.i0001_00055 ESCO.1017.00183.i0001_03498 ESCO.1017.00184.i0001_00054 ESCO.1017.00185.i0001_03853 ESCO.1017.00186.i0001_00049 ESCO.1017.00187.i0001_00049 ESCO.1017.00188.i0001_00051 ESCO.1017.00189.i0001_04109 ESCO.1017.00190.i0001_00053 ESCO.1017.00191.i0001_03546 ESCO.1017.00192.i0001_01381 ESCO.1017.00193.i0001_00049 ESCO.1017.00194.i0001_00048 ESCO.1017.00195.i0001_00052 ESCO.1017.00196.i0001_00052 ESCO.1017.00197.i0001_00052 ESCO.1017.00198.i0001_00049 ESCO.1017.00199.i0001_00904 ESCO.1017.00200.i0001_03596 ESCO.1017.00201.i0001_00844 ESCO.1017.00202.i0001_00050 ESCO.1017.00203.i0002_04611
2 ESCO.1017.00001.i0001_00054 ESCO.1017.00004.i0001_00054 ESCO.1017.00009.i0001_00054 ESCO.1017.00010.i0001_00054 ESCO.1017.00012.i0001_03636 ESCO.1017.00022.i0001_00054 ESCO.1017.00025.i0001_00047 ESCO.1017.00027.i0001_00054 ESCO.1017.00035.i0001_00054 ESCO.1017.00036.i0001_00054 ESCO.1017.00037.i0001_00054 ESCO.1017.00038.i0001_00054 ESCO.1017.00039.i0001_03487 ESCO.1017.00043.i0001_00054 ESCO.1017.00044.i0001_00054 ESCO.1017.00045.i0001_00772 ESCO.1017.00046.i0001_00763 ESCO.1017.00047.i0001_00771 ESCO.1017.00048.i0001_00772 ESCO.1017.00049.i0001_00829 ESCO.1017.00050.i0001_00770 ESCO.1017.00051.i0001_00773 ESCO.1017.00052.i0001_00829 ESCO.1017.00053.i0001_00054 ESCO.1017.00055.i0001_00054 ESCO.1017.00056.i0001_00054 ESCO.1017.00057.i0001_00054 ESCO.1017.00058.i0001_00054 ESCO.1017.00059.i0001_00054 ESCO.1017.00062.i0001_00054 ESCO.1017.00063.i0001_00054 ESCO.1017.00064.i0001_00054 ESCO.1017.00065.i0001_00058 ESCO.1017.00066.i0001_04361 ESCO.1017.00067.i0001_04364 ESCO.1017.00068.i0001_04362 ESCO.1017.00072.i0001_02752 ESCO.1017.00075.i0001_00615 ESCO.1017.00078.i0001_03620 ESCO.1017.00083.i0001_00054 ESCO.1017.00102.i0001_01204 ESCO.1017.00108.i0001_00054 ESCO.1017.00109.i0001_00054
3 ESCO.1017.00001.i0001_00075 ESCO.1017.00002.i0001_00083 ESCO.1017.00003.i0001_00078 ESCO.1017.00004.i0001_00075 ESCO.1017.00005.i0001_00076 ESCO.1017.00006.i0001_00079 ESCO.1017.00007.i0001_00078 ESCO.1017.00008.i0001_03724 ESCO.1017.00010.i0001_00075 ESCO.1017.00011.i0001_00078 ESCO.1017.00012.i0001_03614 ESCO.1017.00013.i0001_03567 ESCO.1017.00014.i0001_00077 ESCO.1017.00015.i0001_00074 ESCO.1017.00016.i0001_00073 ESCO.1017.00017.i0001_00083 ESCO.1017.00018.i0001_00068 ESCO.1017.00019.i0001_00079 ESCO.1017.00020.i0001_00079 ESCO.1017.00021.i0001_00074 ESCO.1017.00022.i0001_00076 ESCO.1017.00023.i0001_00076 ESCO.1017.00024.i0001_00761 ESCO.1017.00025.i0001_00068 ESCO.1017.00026.i0001_00074 ESCO.1017.00027.i0001_00075 ESCO.1017.00028.i0001_01198 ESCO.1017.00029.i0001_03703 ESCO.1017.00030.i0001_03833 ESCO.1017.00031.i0001_00647 ESCO.1017.00032.i0001_00654 ESCO.1017.00033.i0001_00665 ESCO.1017.00034.i0001_00078 ESCO.1017.00035.i0001_00075 ESCO.1017.00036.i0001_00073 ESCO.1017.00037.i0001_00075 ESCO.1017.00038.i0001_00075 ESCO.1017.00039.i0001_03466 ESCO.1017.00040.i0001_00308 ESCO.1017.00041.i0001_00078 ESCO.1017.00042.i0001_00078 ESCO.1017.00043.i0001_00075 ESCO.1017.00044.i0001_00075 ESCO.1017.00045.i0001_00793 ESCO.1017.00046.i0001_00784 ESCO.1017.00047.i0001_00792 ESCO.1017.00048.i0001_00793 ESCO.1017.00049.i0001_00850 ESCO.1017.00050.i0001_00791 ESCO.1017.00051.i0001_00794 ESCO.1017.00052.i0001_00850 ESCO.1017.00053.i0001_00076 ESCO.1017.00054.i0001_00078 ESCO.1017.00055.i0001_00075 ESCO.1017.00056.i0001_00075 ESCO.1017.00057.i0001_00075 ESCO.1017.00058.i0001_00076 ESCO.1017.00059.i0001_00076 ESCO.1017.00060.i0001_00078 ESCO.1017.00061.i0001_00079 ESCO.1017.00062.i0001_00076 ESCO.1017.00063.i0001_00076 ESCO.1017.00064.i0001_00076 ESCO.1017.00065.i0001_00079 ESCO.1017.00066.i0001_04340 ESCO.1017.00067.i0001_04343 ESCO.1017.00068.i0001_04341 ESCO.1017.00069.i0001_04268 ESCO.1017.00070.i0001_03235 ESCO.1017.00071.i0001_00078 ESCO.1017.00072.i0001_02773 ESCO.1017.00073.i0001_00798 ESCO.1017.00074.i0001_00800 ESCO.1017.00075.i0001_00596 ESCO.1017.00076.i0001_05042 ESCO.1017.00077.i0001_00079 ESCO.1017.00078.i0001_03598 ESCO.1017.00079.i0001_00793 ESCO.1017.00080.i0001_03986 ESCO.1017.00081.i0001_03435 ESCO.1017.00082.i0001_04799 ESCO.1017.00083.i0001_00076 ESCO.1017.00084.i0001_04153 ESCO.1017.00085.i0001_00081 ESCO.1017.00086.i0001_00080 ESCO.1017.00087.i0001_00077 ESCO.1017.00088.i0001_00077 ESCO.1017.00089.i0001_00080 ESCO.1017.00090.i0001_00078 ESCO.1017.00091.i0001_00083 ESCO.1017.00092.i0001_00078 ESCO.1017.00093.i0001_00077 ESCO.1017.00094.i0001_00074 ESCO.1017.00095.i0001_00079 ESCO.1017.00096.i0001_00074 ESCO.1017.00097.i0001_00794 ESCO.1017.00098.i0001_00800 ESCO.1017.00099.i0001_00080 ESCO.1017.00100.i0001_00081 ESCO.1017.00101.i0001_02415 ESCO.1017.00102.i0001_01225 ESCO.1017.00103.i0001_03685 ESCO.1017.00104.i0001_03888 ESCO.1017.00105.i0001_04088 ESCO.1017.00106.i0001_00082 ESCO.1017.00107.i0001_03856 ESCO.1017.00110.i0001_00082 ESCO.1017.00111.i0001_00082 ESCO.1017.00112.i0001_03806 ESCO.1017.00113.i0001_03557 ESCO.1017.00114.i0001_04385 ESCO.1017.00115.i0001_02666 ESCO.1017.00116.i0001_02881 ESCO.1017.00117.i0001_04648 ESCO.1017.00118.i0001_00079 ESCO.1017.00119.i0001_00078 ESCO.1017.00120.i0001_00079 ESCO.1017.00121.i0001_00074 ESCO.1017.00122.i0001_00079 ESCO.1017.00123.i0001_02622 ESCO.1017.00124.i0001_00114 ESCO.1017.00125.i0001_00735 ESCO.1017.00126.i0001_04538 ESCO.1017.00127.i0001_04521 ESCO.1017.00128.i0001_04587 ESCO.1017.00129.i0001_04537 ESCO.1017.00130.i0001_04528 ESCO.1017.00131.i0001_04586 ESCO.1017.00132.i0001_04517 ESCO.1017.00133.i0001_04573 ESCO.1017.00134.i0001_04569 ESCO.1017.00135.i0001_05094 ESCO.1017.00136.i0001_00079 ESCO.1017.00137.i0001_00078 ESCO.1017.00138.i0001_00080 ESCO.1017.00139.i0001_00079 ESCO.1017.00140.i0001_03861 ESCO.1017.00141.i0001_00074 ESCO.1017.00142.i0001_00074 ESCO.1017.00143.i0001_00078 ESCO.1017.00144.i0001_00082 ESCO.1017.00145.i0001_04292 ESCO.1017.00146.i0001_00081 ESCO.1017.00147.i0001_00083 ESCO.1017.00148.i0001_00083 ESCO.1017.00149.i0001_00081 ESCO.1017.00150.i0001_00079 ESCO.1017.00151.i0001_02586 ESCO.1017.00152.i0001_02885 ESCO.1017.00153.i0001_00077 ESCO.1017.00154.i0001_02880 ESCO.1017.00155.i0001_00079 ESCO.1017.00156.i0001_00590 ESCO.1017.00157.i0001_00082 ESCO.1017.00158.i0001_00085 ESCO.1017.00159.i0001_00083 ESCO.1017.00160.i0001_04436 ESCO.1017.00161.i0001_00079 ESCO.1017.00162.i0001_03884 ESCO.1017.00163.i0001_03206 ESCO.1017.00164.i0001_01572 ESCO.1017.00165.i0001_00075 ESCO.1017.00166.i0001_00079 ESCO.1017.00167.i0001_04218 ESCO.1017.00168.i0001_04240 ESCO.1017.00169.i0001_00080 ESCO.1017.00170.i0001_00076 ESCO.1017.00171.i0001_00074 ESCO.1017.00172.i0001_00074 ESCO.1017.00173.i0001_03796 ESCO.1017.00174.i0001_01277 ESCO.1017.00175.i0001_03868 ESCO.1017.00176.i0001_00082 ESCO.1017.00177.i0001_03230 ESCO.1017.00178.i0001_01960 ESCO.1017.00179.i0001_00079 ESCO.1017.00180.i0001_00075 ESCO.1017.00181.i0001_00078 ESCO.1017.00182.i0001_00083 ESCO.1017.00183.i0001_03528 ESCO.1017.00184.i0001_00080 ESCO.1017.00185.i0001_03827 ESCO.1017.00186.i0001_00075 ESCO.1017.00187.i0001_00075 ESCO.1017.00188.i0001_00078 ESCO.1017.00189.i0001_04082 ESCO.1017.00190.i0001_00083 ESCO.1017.00191.i0001_03573 ESCO.1017.00192.i0001_01355 ESCO.1017.00193.i0001_00076 ESCO.1017.00194.i0001_00074 ESCO.1017.00195.i0001_00082 ESCO.1017.00196.i0001_00085 ESCO.1017.00197.i0001_00078 ESCO.1017.00198.i0001_00076 ESCO.1017.00199.i0001_00874 ESCO.1017.00200.i0001_03570 ESCO.1017.00201.i0001_00870 ESCO.1017.00202.i0001_00077 ESCO.1017.00203.i0002_04638
4 ESCO.1017.00001.i0001_00079 ESCO.1017.00002.i0001_00087 ESCO.1017.00003.i0001_00082 ESCO.1017.00004.i0001_00079 ESCO.1017.00005.i0001_00080 ESCO.1017.00006.i0001_00083 ESCO.1017.00007.i0001_00082 ESCO.1017.00008.i0001_03720 ESCO.1017.00009.i0001_00060 ESCO.1017.00010.i0001_00079 ESCO.1017.00011.i0001_00082 ESCO.1017.00012.i0001_03610 ESCO.1017.00013.i0001_03563 ESCO.1017.00014.i0001_00081 ESCO.1017.00015.i0001_00078 ESCO.1017.00016.i0001_00077 ESCO.1017.00017.i0001_00087 ESCO.1017.00018.i0001_00072 ESCO.1017.00019.i0001_00083 ESCO.1017.00020.i0001_00083 ESCO.1017.00021.i0001_00078 ESCO.1017.00022.i0001_00080 ESCO.1017.00023.i0001_00080 ESCO.1017.00024.i0001_00765 ESCO.1017.00025.i0001_00072 ESCO.1017.00026.i0001_00078 ESCO.1017.00027.i0001_00079 ESCO.1017.00028.i0001_01194 ESCO.1017.00029.i0001_03699 ESCO.1017.00030.i0001_03829 ESCO.1017.00031.i0001_00652 ESCO.1017.00032.i0001_00659 ESCO.1017.00033.i0001_00670 ESCO.1017.00034.i0001_00082 ESCO.1017.00035.i0001_00079 ESCO.1017.00036.i0001_00077 ESCO.1017.00037.i0001_00079 ESCO.1017.00038.i0001_00079 ESCO.1017.00039.i0001_03462 ESCO.1017.00040.i0001_00312 ESCO.1017.00041.i0001_00082 ESCO.1017.00042.i0001_00082 ESCO.1017.00043.i0001_00079 ESCO.1017.00044.i0001_00079 ESCO.1017.00045.i0001_00797 ESCO.1017.00046.i0001_00788 ESCO.1017.00047.i0001_00796 ESCO.1017.00048.i0001_00797 ESCO.1017.00049.i0001_00854 ESCO.1017.00050.i0001_00795 ESCO.1017.00051.i0001_00798 ESCO.1017.00052.i0001_00854 ESCO.1017.00053.i0001_00080 ESCO.1017.00054.i0001_00082 ESCO.1017.00055.i0001_00079 ESCO.1017.00056.i0001_00079 ESCO.1017.00057.i0001_00079 ESCO.1017.00058.i0001_00080 ESCO.1017.00059.i0001_00080 ESCO.1017.00060.i0001_00082 ESCO.1017.00061.i0001_00083 ESCO.1017.00062.i0001_00080 ESCO.1017.00063.i0001_00080 ESCO.1017.00064.i0001_00080 ESCO.1017.00065.i0001_00083 ESCO.1017.00066.i0001_04336 ESCO.1017.00067.i0001_04339 ESCO.1017.00068.i0001_04337 ESCO.1017.00069.i0001_04272 ESCO.1017.00070.i0001_03231 ESCO.1017.00071.i0001_00082 ESCO.1017.00072.i0001_02777 ESCO.1017.00073.i0001_00802 ESCO.1017.00074.i0001_00804 ESCO.1017.00075.i0001_00592 ESCO.1017.00076.i0001_05038 ESCO.1017.00077.i0001_00083 ESCO.1017.00078.i0001_03594 ESCO.1017.00079.i0001_00797 ESCO.1017.00080.i0001_03982 ESCO.1017.00081.i0001_03439 ESCO.1017.00082.i0001_04795 ESCO.1017.00083.i0001_00080 ESCO.1017.00084.i0001_04149 ESCO.1017.00085.i0001_00085 ESCO.1017.00086.i0001_00084 ESCO.1017.00087.i0001_00081 ESCO.1017.00088.i0001_00081 ESCO.1017.00089.i0001_00084 ESCO.1017.00090.i0001_00082 ESCO.1017.00091.i0001_00087 ESCO.1017.00092.i0001_00082 ESCO.1017.00093.i0001_00081 ESCO.1017.00094.i0001_00078 ESCO.1017.00095.i0001_00083 ESCO.1017.00096.i0001_00078 ESCO.1017.00097.i0001_00798 ESCO.1017.00098.i0001_00804 ESCO.1017.00099.i0001_00084 ESCO.1017.00100.i0001_00085 ESCO.1017.00101.i0001_02411 ESCO.1017.00102.i0001_01229 ESCO.1017.00103.i0001_03681 ESCO.1017.00104.i0001_03884 ESCO.1017.00105.i0001_04092 ESCO.1017.00106.i0001_00086 ESCO.1017.00107.i0001_03852 ESCO.1017.00108.i0001_00060 ESCO.1017.00109.i0001_00060 ESCO.1017.00110.i0001_00086 ESCO.1017.00111.i0001_00087 ESCO.1017.00112.i0001_03810 ESCO.1017.00113.i0001_03561 ESCO.1017.00114.i0001_04381 ESCO.1017.00115.i0001_02670 ESCO.1017.00116.i0001_02885 ESCO.1017.00117.i0001_04644 ESCO.1017.00118.i0001_00083 ESCO.1017.00119.i0001_00082 ESCO.1017.00120.i0001_00083 ESCO.1017.00121.i0001_00078 ESCO.1017.00122.i0001_00083 ESCO.1017.00123.i0001_02618 ESCO.1017.00124.i0001_00118 ESCO.1017.00125.i0001_00739 ESCO.1017.00126.i0001_04534 ESCO.1017.00127.i0001_04517 ESCO.1017.00128.i0001_04583 ESCO.1017.00129.i0001_04533 ESCO.1017.00130.i0001_04524 ESCO.1017.00131.i0001_04582 ESCO.1017.00132.i0001_04513 ESCO.1017.00133.i0001_04569 ESCO.1017.00134.i0001_04565 ESCO.1017.00135.i0001_05090 ESCO.1017.00136.i0001_00083 ESCO.1017.00137.i0001_00082 ESCO.1017.00138.i0001_00084 ESCO.1017.00139.i0001_00083 ESCO.1017.00140.i0001_03857 ESCO.1017.00141.i0001_00078 ESCO.1017.00142.i0001_00078 ESCO.1017.00143.i0001_00082 ESCO.1017.00144.i0001_00086 ESCO.1017.00145.i0001_04288 ESCO.1017.00146.i0001_00085 ESCO.1017.00147.i0001_00087 ESCO.1017.00148.i0001_00087 ESCO.1017.00149.i0001_00085 ESCO.1017.00150.i0001_00084 ESCO.1017.00151.i0001_02590 ESCO.1017.00152.i0001_02889 ESCO.1017.00153.i0001_00081 ESCO.1017.00154.i0001_02884 ESCO.1017.00155.i0001_00083 ESCO.1017.00156.i0001_00594 ESCO.1017.00157.i0001_00086 ESCO.1017.00158.i0001_00089 ESCO.1017.00159.i0001_00087 ESCO.1017.00160.i0001_04441 ESCO.1017.00161.i0001_00083 ESCO.1017.00162.i0001_03880 ESCO.1017.00163.i0001_03210 ESCO.1017.00164.i0001_01576 ESCO.1017.00165.i0001_00079 ESCO.1017.00166.i0001_00083 ESCO.1017.00167.i0001_04214 ESCO.1017.00168.i0001_04236 ESCO.1017.00169.i0001_00084 ESCO.1017.00170.i0001_00080 ESCO.1017.00171.i0001_00078 ESCO.1017.00172.i0001_00078 ESCO.1017.00173.i0001_03792 ESCO.1017.00174.i0001_01273 ESCO.1017.00175.i0001_03864 ESCO.1017.00176.i0001_00086 ESCO.1017.00177.i0001_03234 ESCO.1017.00178.i0001_01956 ESCO.1017.00179.i0001_00083 ESCO.1017.00180.i0001_00079 ESCO.1017.00181.i0001_00082 ESCO.1017.00182.i0001_00087 ESCO.1017.00183.i0001_03532 ESCO.1017.00184.i0001_00084 ESCO.1017.00185.i0001_03823 ESCO.1017.00186.i0001_00079 ESCO.1017.00187.i0001_00079 ESCO.1017.00188.i0001_00082 ESCO.1017.00189.i0001_04078 ESCO.1017.00190.i0001_00087 ESCO.1017.00191.i0001_03577 ESCO.1017.00192.i0001_01351 ESCO.1017.00193.i0001_00080 ESCO.1017.00194.i0001_00078 ESCO.1017.00195.i0001_00086 ESCO.1017.00196.i0001_00089 ESCO.1017.00197.i0001_00082 ESCO.1017.00198.i0001_00080 ESCO.1017.00199.i0001_00870 ESCO.1017.00200.i0001_03566 ESCO.1017.00201.i0001_00874 ESCO.1017.00202.i0001_00081 ESCO.1017.00203.i0002_04642
5 ESCO.1017.00001.i0001_00080 ESCO.1017.00002.i0001_00088 ESCO.1017.00003.i0001_00083 ESCO.1017.00004.i0001_00080 ESCO.1017.00005.i0001_00081 ESCO.1017.00006.i0001_00084 ESCO.1017.00007.i0001_00083 ESCO.1017.00008.i0001_03719 ESCO.1017.00009.i0001_00061 ESCO.1017.00010.i0001_00080 ESCO.1017.00011.i0001_00083 ESCO.1017.00012.i0001_03609 ESCO.1017.00013.i0001_03562 ESCO.1017.00014.i0001_00082 ESCO.1017.00015.i0001_00079 ESCO.1017.00016.i0001_00078 ESCO.1017.00017.i0001_00088 ESCO.1017.00018.i0001_00073 ESCO.1017.00019.i0001_00084 ESCO.1017.00020.i0001_00084 ESCO.1017.00021.i0001_00079 ESCO.1017.00022.i0001_00081 ESCO.1017.00023.i0001_00081 ESCO.1017.00024.i0001_00766 ESCO.1017.00025.i0001_00073 ESCO.1017.00026.i0001_00079 ESCO.1017.00027.i0001_00080 ESCO.1017.00028.i0001_01193 ESCO.1017.00029.i0001_03698 ESCO.1017.00030.i0001_03828 ESCO.1017.00031.i0001_00653 ESCO.1017.00032.i0001_00660 ESCO.1017.00033.i0001_00671 ESCO.1017.00034.i0001_00083 ESCO.1017.00035.i0001_00080 ESCO.1017.00036.i0001_00078 ESCO.1017.00037.i0001_00080 ESCO.1017.00038.i0001_00080 ESCO.1017.00039.i0001_03461 ESCO.1017.00040.i0001_00313 ESCO.1017.00041.i0001_00083 ESCO.1017.00042.i0001_00083 ESCO.1017.00043.i0001_00080 ESCO.1017.00044.i0001_00080 ESCO.1017.00045.i0001_00798 ESCO.1017.00046.i0001_00789 ESCO.1017.00047.i0001_00797 ESCO.1017.00048.i0001_00798 ESCO.1017.00049.i0001_00855 ESCO.1017.00050.i0001_00796 ESCO.1017.00051.i0001_00799 ESCO.1017.00052.i0001_00855 ESCO.1017.00053.i0001_00081 ESCO.1017.00054.i0001_00083 ESCO.1017.00055.i0001_00080 ESCO.1017.00056.i0001_00080 ESCO.1017.00057.i0001_00080 ESCO.1017.00058.i0001_00081 ESCO.1017.00059.i0001_00081 ESCO.1017.00060.i0001_00083 ESCO.1017.00061.i0001_00084 ESCO.1017.00062.i0001_00081 ESCO.1017.00063.i0001_00081 ESCO.1017.00064.i0001_00081 ESCO.1017.00065.i0001_00084 ESCO.1017.00066.i0001_04335 ESCO.1017.00067.i0001_04338 ESCO.1017.00068.i0001_04336 ESCO.1017.00069.i0001_04273 ESCO.1017.00070.i0001_03230 ESCO.1017.00071.i0001_00083 ESCO.1017.00072.i0001_02778 ESCO.1017.00073.i0001_00803 ESCO.1017.00074.i0001_00805 ESCO.1017.00075.i0001_00591 ESCO.1017.00076.i0001_05037 ESCO.1017.00077.i0001_00084 ESCO.1017.00078.i0001_03593 ESCO.1017.00079.i0001_00798 ESCO.1017.00080.i0001_03981 ESCO.1017.00081.i0001_03440 ESCO.1017.00082.i0001_04794 ESCO.1017.00083.i0001_00081 ESCO.1017.00084.i0001_04148 ESCO.1017.00085.i0001_00086 ESCO.1017.00086.i0001_00085 ESCO.1017.00087.i0001_00082 ESCO.1017.00088.i0001_00082 ESCO.1017.00089.i0001_00085 ESCO.1017.00090.i0001_00083 ESCO.1017.00091.i0001_00088 ESCO.1017.00092.i0001_00083 ESCO.1017.00093.i0001_00082 ESCO.1017.00094.i0001_00079 ESCO.1017.00095.i0001_00084 ESCO.1017.00096.i0001_00079 ESCO.1017.00097.i0001_00799 ESCO.1017.00098.i0001_00805 ESCO.1017.00099.i0001_00085 ESCO.1017.00100.i0001_00086 ESCO.1017.00101.i0001_02410 ESCO.1017.00102.i0001_01230 ESCO.1017.00103.i0001_03680 ESCO.1017.00104.i0001_03883 ESCO.1017.00105.i0001_04093 ESCO.1017.00106.i0001_00087 ESCO.1017.00107.i0001_03851 ESCO.1017.00108.i0001_00061 ESCO.1017.00109.i0001_00061 ESCO.1017.00110.i0001_00087 ESCO.1017.00111.i0001_00088 ESCO.1017.00112.i0001_03811 ESCO.1017.00113.i0001_03562 ESCO.1017.00114.i0001_04380 ESCO.1017.00115.i0001_02671 ESCO.1017.00116.i0001_02886 ESCO.1017.00117.i0001_04643 ESCO.1017.00118.i0001_00084 ESCO.1017.00119.i0001_00083 ESCO.1017.00120.i0001_00084 ESCO.1017.00121.i0001_00079 ESCO.1017.00122.i0001_00084 ESCO.1017.00123.i0001_02617 ESCO.1017.00124.i0001_00119 ESCO.1017.00125.i0001_00740 ESCO.1017.00126.i0001_04533 ESCO.1017.00127.i0001_04516 ESCO.1017.00128.i0001_04582 ESCO.1017.00129.i0001_04532 ESCO.1017.00130.i0001_04523 ESCO.1017.00131.i0001_04581 ESCO.1017.00132.i0001_04512 ESCO.1017.00133.i0001_04568 ESCO.1017.00134.i0001_04564 ESCO.1017.00135.i0001_05089 ESCO.1017.00136.i0001_00084 ESCO.1017.00137.i0001_00083 ESCO.1017.00138.i0001_00085 ESCO.1017.00139.i0001_00084 ESCO.1017.00140.i0001_03856 ESCO.1017.00141.i0001_00079 ESCO.1017.00142.i0001_00079 ESCO.1017.00143.i0001_00083 ESCO.1017.00144.i0001_00087 ESCO.1017.00145.i0001_04287 ESCO.1017.00146.i0001_00086 ESCO.1017.00147.i0001_00088 ESCO.1017.00148.i0001_00088 ESCO.1017.00149.i0001_00086 ESCO.1017.00150.i0001_00085 ESCO.1017.00151.i0001_02591 ESCO.1017.00152.i0001_02890 ESCO.1017.00153.i0001_00082 ESCO.1017.00154.i0001_02885 ESCO.1017.00155.i0001_00084 ESCO.1017.00156.i0001_00595 ESCO.1017.00157.i0001_00087 ESCO.1017.00159.i0001_00088 ESCO.1017.00160.i0001_04442 ESCO.1017.00161.i0001_00084 ESCO.1017.00162.i0001_03879 ESCO.1017.00163.i0001_03211 ESCO.1017.00164.i0001_01577 ESCO.1017.00165.i0001_00080 ESCO.1017.00166.i0001_00084 ESCO.1017.00167.i0001_04213 ESCO.1017.00168.i0001_04235 ESCO.1017.00169.i0001_00085 ESCO.1017.00170.i0001_00081 ESCO.1017.00171.i0001_00079 ESCO.1017.00172.i0001_00079 ESCO.1017.00173.i0001_03791 ESCO.1017.00174.i0001_01272 ESCO.1017.00175.i0001_03863 ESCO.1017.00176.i0001_00087 ESCO.1017.00177.i0001_03235 ESCO.1017.00178.i0001_01955 ESCO.1017.00179.i0001_00084 ESCO.1017.00180.i0001_00080 ESCO.1017.00181.i0001_00083 ESCO.1017.00182.i0001_00088 ESCO.1017.00183.i0001_03533 ESCO.1017.00184.i0001_00085 ESCO.1017.00185.i0001_03822 ESCO.1017.00186.i0001_00080 ESCO.1017.00187.i0001_00080 ESCO.1017.00188.i0001_00083 ESCO.1017.00189.i0001_04077 ESCO.1017.00190.i0001_00088 ESCO.1017.00191.i0001_03578 ESCO.1017.00192.i0001_01350 ESCO.1017.00193.i0001_00081 ESCO.1017.00194.i0001_00079 ESCO.1017.00195.i0001_00087 ESCO.1017.00196.i0001_00090 ESCO.1017.00197.i0001_00083 ESCO.1017.00198.i0001_00081 ESCO.1017.00199.i0001_00869 ESCO.1017.00200.i0001_03565 ESCO.1017.00201.i0001_00875 ESCO.1017.00202.i0001_00082 ESCO.1017.00203.i0002_04643
...

Note that the assignation of genes to a gene family can be done in several lines.
Indeed, this form is a prolix equivalent to the previous one:
::

1 ESCO.1017.00001.i0001_00047
1 ESCO.1017.00002.i0001_00053
1 ESCO.1017.00003.i0001_00052
1 ESCO.1017.00004.i0001_00047
1 ESCO.1017.00005.i0001_00048
1 ESCO.1017.00006.i0001_00053
...

The tsv format is the one returned by MMseqs2 (https://github.com/soedinglab/MMseqs2) and can be used directly as PPanGGOLiN input (in MMseqs2, the gene families name (first column) is the name of the median gene of the families).
All the gene IDs found in the gff files must be associated with a gene family even the singletons excepting if the flag --infere-singletons is used. Indeed, in this case, singletons will be automatically detected directly in the gff files (the family ID will be the gene ID).

Reserved word
---------------------------
To prevent any bug, the following words are fobidden to be any of the identifiers: "id", "label", "name", "weight", "partition", "partition_exact", "length", "length_min", "length_max", "length_avg", "length_med", "product", 'nb_genes','subpartition_shell',"viz". Moreover, "|" and "," are also fobidden to be contained in any of the identifiers.

Output
---------------------------
The software generates several output files:

1. *graph.gexf* (and *graph_light.gexf* corresponding to the same topology without gene and organism details). GEXF file can be open using Gephi (https://gephi.org/). See the video below (in the section gephi tunning) to obtain an appealing layout of the graph.

.. image:: images/gephi.gif

2. *matrix.csv* and *matrix.Rtab* correspond to the gene presences-absences matrix formatted as did in Roary (https://sanger-pathogens.github.io/Roary/) except that the second column corresponds to the partition instead of an alternative gene ID. When several genes are present in a single gene family of an organism, identifiers of the gene are merged with a "|" separator.

3. A file generate_plots.R able to generate some figures to visualize some metrics about the pangenome. This file can be executed using the following command :

.. code:: bash

Rscript OUTPUT_DIR/generate_plots.R

*The script can generate some errors as "Removed X rows containing non-finite values" that must be ignored.*

4. A folder *figures* containing the different plots (the script generate_plots.R is executed if flag '-p' is provided):
* tile plot: a figure providing an overview of the presence(green)/absence(grey) matrix.

.. image:: images/tile_plot.png

* U-shaped plot (PDF and HTML): a figure providing an overview gene frequency distribution

.. image:: images/u_plot.png

* optional: evolution curve (if the flag '-e' is provided): a figure providing an overview of the evolution of the pangenome metrics when more and more organisms are added to the pangenome (see the *Evolution* section to obtain more details).

* optional: projection plots (if the option '-pr NUM' is provided): a figure showing the projection of the pangenome against one organism in order to visualize persistent, shell and cloud regions on this genome (see the *Projection* section to obtain more details).

5. A folder *partitions* in which each file contain the list of the gene families in each partition

6. A folder *NEM_results* containing the temporary data of the computation (removed if flag '-df' is provided)

7. A folder *partitions* containing one file by partition. Each file stores the name of the families in its associated partition.

8. optional: a folder *evolutions* containing the temporary data of the computation of all the resampling and the file (stat_evol.txt) summarizing this evolution (if flag '-e' is provided)

9. optional: a folder *projections* containing a tabulated file for each organism providing information about the projection of the graph against each selected organism (if argument '-pr' followed by the line number in the ORGANISM_FILE is provided)

Options
============================

Remove gene families having a high number of gene copies
------------------------------------------------------

To minimize the impact of the genomic hubs in the graph caused by gene families scattered all along the genomes like transposases, we offer an option that allows to filter gene families having a number of genes above a threshold in at least one organism.

For example, this command:

.. code:: bash

ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE --output_directory OUTPUT_DIR -r 10

will remove gene families having more than 10 repeated genes in at least one of the organism. Empirically, using a r-value of 10 will discard only few gene families (a dozen) .

Partionning parameter
---------------------------

The partitioning method can be customized via 3 parameters:

1. Partitioning by chunks (-ck VALUE option): When more than 500 organisms are processed it is advised to partition the pangenome by chunks. Actually, the method seems to saturate with an large number of dimensions. Chunks correspond to samples of the organisms to partition simultaneously. We advise to use chunks of 500 organisms in order to obtain representative ones. Then the tools will partition the pangenome using multiple chunks in a way that every gene families must be partitionned in at least (total number of organisms)/(chunk size) times. Moreover each gene family must be partitionned mainly in one specific partition (>50% of cases), otherwise the partitioning will continue until validating this criteria.

This feature can be executed using this command :

.. code:: bash

ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE --output_directory OUTPUT_DIR -ck 300

2. Smoothing strength (-b VALUE option): This option specify the strength of the smoothing (`:math:\beta`) of the partitions based on the graph topology (using a Markov Random Field). (`:math:\beta = 0`) means no smoothing whereas (`:math:\beta` = 1) means a strong smoothing (value higher than 1 are allowed but highly discouraged). (`:math:\beta` = 0.5`) is generally a good tradeoff.

This feature can be executed using this command :

.. code:: bash

ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE --output_directory OUTPUT_DIR -b 1

3. Free Dispersion around centroid vectors (-fd flag): This flag allows the dispersion vector around the centroid vector of the Bernoulli Mixture Model to be free to vary for all organisms in a vector. By default, dispersions are constrained to be the same for all organisms for each partition, that is to say, all organisms will have the same impact of the partitioning.

This feature can be executed using this command :

.. code:: bash

ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE --output_directory OUTPUT_DIR -fd

Evolution curve (-e option)
------------------------------------------------------

Contrary to a pangenome where gene families are partionned in core genome or accessory genome based on a threshold of occurences, this approach esimates the best partitionning via a statistical approach. Thereby this processing required calculation steps so that it is not instantaneous. Performing a lot of resampling can thus require heavy calculations and this why it is not achieved by default. Nevertheless, it is possible to perform these resampling using the -e flag. Use this flag with caution.

We also offer the possibility to customize the resampling using 4 parameters provided to the -ep option : RESAMPLING_RATIO, MINIMUN_SAMPLING, MAXIMUN_SAMPLING, STEP and LIMIT (See the figure below to obtain an idea of the effect of the parameters). The STEP parameter allows jumping some combinations of organisms by a determined STEP to reduce the number of computation and the LIMIT parameter specify the maximun of sample size. For example purpose, to compute all the combinations (strongly discouraged !) RESAMPLING_RATIO must be equal to 1, MINIMUN_SAMPLING to 1, MAXIMUN_SAMPLING to Inf, STEP to 1 and LIMIT to Inf.

.. image:: images/resampling.png


.. code:: bash

ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE --output_directory OUTPUT_DIR -e -ep 0.01 10 50 1 100

will generate 1% percent of all resampling with at minimum 10 combination for each size of the set of organisms and 50 maximum. The size of the combination will be increased by a step equals to 1 up to samples limited to a size of 100 organisms.

The curves represent the evolution of the size of the partitions when more and more organisms are added to the pangenome. The plain lines connect medians (crosses) of the resampling distribution while shadows represent the interquartile ranges. Finally, a regression curve is drawn fitting a Heap's law ($F = \kappa N^{\gamma}$).


.. image:: images/evolution.png

Projection (-pr option)
---------------------------

It is possible to project the pangenome against one organism in order to visualize persistent, shell and cloud regions on this genome. Moreover, we project the number of neighbors of each gene families in the pangenome to identify hotspots of recombination. To use the feature, you will need to use the '-pr' option followed by the position of organisms to process (position in the ORGANISM FILE) or 0 to compute all organisms.

.. code:: bash

ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE --output_directory OUTPUT_DIR -pr 1 7 9

will project against the organisms 1, 7 and 9 the information about the pangenome (degrees of nodes and partitions).

The internal layer reports the contigs, the grey intermediate layer reports the homologous genes, the third layer shows the partition of the gene families of the organism. The hairy external layer shows the number of families neighbors belonging to each partition of the pangenome. The black line provides the location of the origin of replication if the dnaA gene if found.

.. image:: images/projection.png


Metadata (-mt option)
---------------------------
It is possible to add metainformation to the pangenome graph. This information must be associated with each organism via a METADATA_FILE. During the construction of the graph, metainformation about the organisms are used to label the covered edges.

METADATA_FILE is a tab-delimitated file. The first line contains the names of the attributes and the following lines contain associated information for each organism (in the same order as in the ORGANISM_FILE).
::

phylogroup assembly
D complete
A complete
B2 complete
B1 complete
B2 complete
C complete
B2 complete
B2 complete
C complete
B2 complete
A complete
A complete
A complete
A complete
A complete
A complete
A complete
A complete
A complete
...

.. code:: bash

ppanggolin --organisms ORGANISMS_FILE --gene_families FAMILIES_FILE --output_directory OUTPUT_DIR -mt METADATA_FILE

will add to each edge of the partitioned pangenome graph, the label "phylogroup" and the label "assembly". When an edge encompasses several organisms having different values associated with the same label, the values are sorted and merged (separated by a '|').


Frequently Asked Questions
============================

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
ppanggolin-0.1.4.tar.gz (17.9 MB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page