Skip to main content

Python package for probe-based gene cluster finding in large microbial genome database

Project description

pyGCAP: a (py)thon (G)ene (C)luster (A)nnotation & (P)rofiling

A Python Package for Probe-based Gene Cluster Finding in Large Microbial Genome Database


Introduction

Bacterial gene clusters provide insights into metabolism and evolution, and facilitate biotechnological applications. We developed pyGCAP, a Python package for probe-based gene cluster discovery. This pipeline uses sequence search and analysis tools and public databases (e.g. BLAST, MMSeqs2, UniProt, and NCBI) to predict potential gene clusters by user-provided probe genes. We tested the pipeline with the division and cell wall (dcw) gene cluster, crucial for cell division and peptidoglycan biosynthesis.

To evaluate pyGCAP, we used 17 major dcw genes defined by Megrian et al. [1] as a probe set to search for gene clusters in 716 Lactobacillales genomes. The results were integrated to provide detailed information on gene content, gene order, and types of clusters. While PGCfinder examined the completeness of the gene clusters, it could also suggest novel taxa-specific accessory genes related to dcw clusters in Lactobacillales genomes. The package will be freely available on the Python Package Index, Bioconda, and GitHub.

[1] Megrian, D., et al. Ancient origin and constrained evolution of the division and cell wall gene cluster in Bacteria. Nat Microbiol 7, 2114–2127 (2022).


Pipeline-flow

flowchart


Pre-requirement

  1. Python >= 3.6

  2. conda environment

    • blast (bioconda blast package)

      conda install bioconda::blast
      conda install bioconda/label/cf201901::blast
      
    • datasets & dataformat from NCBI (conda-forge ncbi-datasets-cli package)

      conda install conda-forge::ncbi-datasets-cli
      
    • MMseqs2 (MMseqs2 github)

      conda install -c conda-forge -c bioconda mmseqs2
      
    • If you want to make a new conda environment for pygcap, follow the instructions below:

      conda create -n pygcap
      conda activate pygcap
      pip install pygcap
      conda install -c conda-forge ncbi-datasets-cli
      conda install -c conda-forge -c bioconda mmseqs2
      

Usage

  • pypi pygcap (link)

    # pip install pygcap
    pygcap [TAXON] [PROBE_FILE]
    
  • input argument description

    ### usage example
    pygcap Facklamia pygcap/data/probe_sample.tsv
    pygcap 66831 pygcap/data/probe_sample.tsv
    
    1. taxon (both name and taxid are available)

    2. path of probe.tsv (sample file)

      • Probe Name (user defined)
      • Prediction (user defined)
      • Accession (UniProt entry)

Options

  1. --working_dir or -w (default: .): Specify the working directory path.

    pygcap [TAXON] [PROBE_FILE] —-working_dir or -w [PATH_OF_WORKING_DIRECTORY]
    
  2. --thread or -t (default: 50): Number of threads to use when running MMseqs2 and blastp. The number of threads can be adjusted automatically based on the CPU environment. It must be an integer greater than 0.

    pygcap [TAXON] [PROBE_FILE] —-thread or -t [NUMBER_OF_THREAD]
    
  3. --identity of -i (default: 0.5): The value of protein identity to be used in MMseqs2. It must be a value between 0 and 1.

    pygcap [TAXON] [PROBE_FILE] —-identity or -i [PROTEIN_IDENTITY]
    
  4. --max_target_seqs of -m (default: 500): The vaue of aligned sequences to retain in the overall BLASTP results. It must be an integergreater than 0.

    pygcap [TAXON] [PROBE_FILE] —-max_target_seqs or -m [MAX_TARGET]
    
  5. --skip of -s (default: none): Specify steps to skip during the process. Multiple steps can be skipped by using this option multiple times.

    pygcap [TAXON] [PROBE_FILE] —-skip or -s [ARG]
    
    • all: Skip all the processes listed below.
    • ncbi: Skip downloading genome data from NCBI.
    • mmseqs2: Skip running MMseqs2.
    • parsing: Skip parsing genome data.
    • uniprot: Skip downloading probe data from UniProt.
    • blastdb: Skip running makeblastdb.

(WIP)Output

  • A directory with the following structure will be created in your working directory with the name of the TAXON provided as input.\

    📦 [TAXON_NAME]
    ├─ data
    │  ├─ assembly_report.tsv
    │  ├─ metadata_target.tsv
    │  └─ ...
    ├─ input
    │  ├─ [GENUS_01]
    │  ├─ [GENUS_02]
    │  └─ ...
    ├─ output
    │  ├─ genus
    │  ├─ img
    │  └─ tsv
    └─ seqlib
       ├─ blast_output.tsv
       ├─ seqlib.tsv
       └─ ...
    

(WIP) example

  • Profiling dcw genes from pan-genomes of Lactobacillales (LAB)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygcap-1.2.2.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

pygcap-1.2.2-py3-none-any.whl (29.5 kB view details)

Uploaded Python 3

File details

Details for the file pygcap-1.2.2.tar.gz.

File metadata

  • Download URL: pygcap-1.2.2.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for pygcap-1.2.2.tar.gz
Algorithm Hash digest
SHA256 ef1ee2187e2ba0c5a847b51ae2f15023e45faeb6bdec45de8358d9f9c37ccd13
MD5 25206d85b00d8b81b252a550de078e90
BLAKE2b-256 42c4c90c2cd69c6135290c79557280d3e7327b562286ee79616c2a6da709075a

See more details on using hashes here.

File details

Details for the file pygcap-1.2.2-py3-none-any.whl.

File metadata

  • Download URL: pygcap-1.2.2-py3-none-any.whl
  • Upload date:
  • Size: 29.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for pygcap-1.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 afa3f8f1e6699f037fd8cfb0d7ef188ad629690b6c426091a8de1353741f0dbd
MD5 1a44ad7ec56026333442405130c45425
BLAKE2b-256 d3d614626c8602b076a603540715482281567bfde3ebb32bcd45417755c389c4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page