Skip to main content

An integrative pipeline for family-level super-pangenome analysis across coding and noncoding sequences.

Project description

panCG pipeline

PyPI version

Fig. 1. Overview of the panCG pipeline including panCNS, pangene, and CG modules.

(a) Workflow of the panCNS module. (1) Multiple genome alignment: Input multiple genomes are used for reference-free multiple genome alignment via Progressive Cactus. (2) CNS identification: Each genome is individually designated as the reference genome to generate whole-genome alignments. PhastCons is then employed to identify conserved sequences; conserved sequences overlapping with CDSs are filtered out, yielding CNS regions for each genome. (3) Homologous group identification: Homologous CNS groups are identified based on the aforementioned multiple genome alignments and pairwise CNS comparisons. (4) Synteny cluster construction: Undirected CNS networks are constructed based on syntenic relationships of CNSs between species; these are connected networks. Rectangular nodes represent CNSs, with different colors indicating different species. Edges represent homologous relationships: green for CMR, gray for synteny, and red for best-hit relationships. (5) Index assignment: Members of each synteny network are assigned a unique index. (6) CNS retrieval: For CNSs missing from the index, their CMR PhastCons scores are evaluated. Those with scores exceeding the threshold and no overlap with CDSs are added to the CNS index and labeled “recall-CNS”; CNSs with scores above the threshold but overlapping with CDS are labeled “recall-CDS”; and those with scores below the threshold are designated as “recall-nonCE”. (7) Index retrieval and reassignment: CNSs retrieved in the previous step are incorporated into the index, and best-hit information is used to reassign indices to singleton CNSs. Finally, each CNS has a unique index and a reference-free panCNS is obtained.

(b) Workflow of the pangene module. (1) Ortholog group identification: OrthoFinder is used to identify homolog groups. Circles represent genes, with colors distinguishing homologous genes from different species. Genes in gray ovals belong to the same gene group. (2) CPM clustering: Synteny networks are constructed for genes in each group, and genes are further clustered using the clique percolation method (CPM). Nodes represent genes, and gray edges represent syntenic relationships between genes; the set of genes enclosed by the dashed line denotes a gene cluster identified via CPM. (3) Network expansion: For genes lacking synteny, best-hit information is used to extend the gene synteny network. Red edges indicate homologous best-hit gene pairs. (4) Index assignment: A unique gene index is assigned to genes in each cluster. (5) Tree based reassignment: For gene indices containing paralogous genes, the phylogenetic relationships between genes are considered to further refine index assignments. Finally, each gene has a unique index, and a reference-free pangene is obtained.

(c) Workflow of the CG module. (1) CNS-gene colocalization analysis: CNSs located in the upstream and downstream regions of each gene are extracted to form a CNS set. Based on CNS-gene colocalization patterns across species, we define Conserved Gene and Noncoding sequence Modules (CGNMs) as sets of co-localized CNSs and genes within the same index that are conserved in at least two species. Closely spaced CGNMs are further grouped into Conserved Gene and Noncoding Blocks (CGNBs). (2) Synteny network construction: CNS and gene sets corresponding to each gene index are independently derived from the panCNS and pangene modules, and used to construct CNS synteny networks and gene synteny networks, respectively. (3) Gene-CNS network construction: A unified network for panCNSs and pangenes is generated by merging CNS and gene synteny networks, which captures both collinearity and potential regulatory relationships among all genes and CNSs.

Dependencies

  1. halLiftover in cactus

  2. phast

  3. JCVI

  4. UCSC: mafFilter, mafSplit, wigToBigWig

wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/mafFilter
wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/mafSplit
wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/wigToBigWig
  1. orthofinder

  2. blast

  3. diamond

install

Make sure the above dependencies are installed and added to PATH.

pip install panCG
panCG -h

usage

usage: panCG [-h] [--version]  ...

    an integrative pipeline for family-level super-pangenome analysis across coding and noncoding sequences.

optional arguments:
  -h, --help     show this help message and exit
  --version      show program's version number and exit

Commands:

    callCns      Identification of CNS
    pangene      build gene index
    pancns       build CNS index
    GenePavAsso  Associating gene-PAVs with phenotypes between species
    GLSS         Identification of Gene lineage-specific Synteny networks
    CLSS         Identification of CNS lineage-specific Synteny networks
    CnsGeneLink  According to the relative position relationship between CNS and gene and the maximum number of species supported by CNS index and gene index, CNS index and gene index are linked.
    CnsSyntenyNet
                 Used to construct SyntenyNet for filtered pan-CNS

Input file format requirements

  1. The chromosome ID of the genome cannot contain special characters such as ":", "-", ",", etc., and no other characters except numbers, letters and "_".
  2. In the gff annotation file, it is best to only have gene, mRNA, exon, cds, and utr information. And gene must contain the ID field, and others must contain the Parent field.
  3. The bed file of gene must be a standard 6-column bed file. <chrID> <start> <end> <geneID> <score/0> <chain>.

Output

cns calling

Directory File suffix Describe
{Workdir}/03-phastCons/ {species}.all.bw PhastCons Conservative Scoring File
{Workdir}/03-phastCons/ {species}.CNSs.bed CNS file of {species}

panCNS

Directory File suffix Describe
{Workdir}/Ref_{ref}_ .panGene.final.csv The output panCNS file, each line represents an index

pangene

Directory File suffix Describe
{Workdir}/Ref_{ref}_IndexDir .panGene.csv The result pangene

The Group column is the homology group identified by orthofinder.

Group column Describe
OGxxxxxxx.x Indicates the gene index subdivided in the homology group
OGxxxxxxx.x.Un The .Un suffix indicates a set of genes that still exist independently in a single species after CPM.
OGxxxxxxx.x.tree_x Indicates the gene index subdivided by gene evolution relationship based on the gene index
OGxxxxxxx.x.tree_Un The gene set ending with .tree_Un is a gene set that is not classified using evolutionary relationships.
UnMapOGXXXXXXX.x UnMap prefix is the gene that orthofinder has no clustering

quick start

We provide example data for testing, which can be downloaded at figshare.

cactus

nohup /usr/bin/time -v cactus jobstore species.22way.info.txt Citrus.7ways.test_data.hal \
   --realTimeLogging True \
   --workDir /home/xxx/cactus_dir \
   --maxCores 16 --maxMemory 100G --maxDisk 200G > Citrus.7ways.cactus.log 2>&1 &
   
nohup /usr/bin/time -v cactus-hal2maf jobstore Citrus.7ways.test_data.hal C_sinensis.7ways.maf \
    --refGenome C_sinensis \
    --chunkSize 10000000 \
    --noAncestors \
    --dupeMode single \
    --workDir /home/xxx/cactus_dir > C_sinensis.hal2maf.single.log 2>&1 &

call CNS

for i in C_sinensis C_limon ponkan C_australasica C_glauca F_hindsii A_buxifolia
do
    /usr/bin/time -v panCG callCns \
        -c /home/ltan/Tmp/01-PanCNSGene_test_data/panCG/Example/CNScalling.config.yaml \
        -w /home/ltan/Tmp/01-PanCNSGene_test_data/01-callcns/${i} \
        -r ${i} > ${i}.callCns.log 2>&1
done

pangene

nohup /usr/bin/time -v panCG pangene \
    -c /home/ltan/Tmp/01-PanCNSGene_test_data/panCG/Example/panCG.config.yaml \
    -w /home/ltan/Tmp/01-PanCNSGene_test_data/02-pangene \
    -r C_sinensis > pangene.log 2>&1 &

panCNS

nohup /usr/bin/time -v panCG pancns \
    -c /home/ltan/Tmp/01-PanCNSGene_test_data/panCG/Example/panCG.config.yaml \
    -w /home/ltan/Tmp/01-PanCNSGene_test_data/03-pancns \
    -r C_sinensis \
    -W /home/ltan/Tmp/01-PanCNSGene_test_data/02-pangene \
    > pancns.log 2>&1 &

Citation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pancg-1.0.2.tar.gz (72.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

panCG-1.0.2-py3-none-any.whl (85.2 kB view details)

Uploaded Python 3

File details

Details for the file pancg-1.0.2.tar.gz.

File metadata

  • Download URL: pancg-1.0.2.tar.gz
  • Upload date:
  • Size: 72.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for pancg-1.0.2.tar.gz
Algorithm Hash digest
SHA256 855c184309e96affc526c2aa09350a07e07fd14f94f97a7c3d79fc9634144f96
MD5 32844a848e46013f9d12bb2a1e44e3b3
BLAKE2b-256 93486f4dab90aaefb8cf8a48aa628e212b27b906ad30ca9788c5405e4369cbd2

See more details on using hashes here.

File details

Details for the file panCG-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: panCG-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 85.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for panCG-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5894b1d0dbf872718a6da8be37ab4205f8f5d9d6da0745b7626fb432313834d6
MD5 cbdd57924970ee5ea24d3a4cfdb5df01
BLAKE2b-256 18804111041e646b8dc8131f265c41902c0ebbebebecb37c977361addefe0367

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page