Skip to main content

Package & software for analysis of nucleotide redundancy within CDS-based pan-genome analyses

Project description

panqc logo

License: MIT Static Badge

A pan-genome quality control toolkit for evaluating nucleotide redundancy in pan-genome analyses.

Table of Contents

Motivation

PanQC_NRC_Diagram

The panqc Nucleotide Redundancy Correction (NRC) pipeline adjusts for redundancy at the DNA level within pan-genome estimates in two steps. In step one, all genes predicted to be absent at the Amino Acid (AA) level are compared to their corresponding assembly at the nucleotide level. In cases where the nucleotide sequence is found with high coverage and sequence identity (Query Coverage & Sequence Identity > 90%), the gene is marked as “present at the DNA level”. Next, all genes are clustered and merged using a k-mer based metric of nucleotide similarity. Cases where two or more genes are divergent at the AA level but highly similar at the nucleotide level will be merged into a single “nucleotide similarity gene cluster”. After applying this method the pan-genome gene presence matrix is readjusted according to these results.

Installation

pip

pip install panqc

Install locally

Currently, panqc can be installed by cloning this repository and installing with pip.

git clone git@github.com:maxgmarin/panqc.git

cd panqc

pip install . 

conda

🚧 Check back soon 🚧

Basic usage

panqc nrc -a InputAsmPaths.tsv -r pan_genome_reference.fa -m gene_presence_absence.csv -o NRC_results/

The above command will output an adjusted gene presence absence matrix along with additional statistics to the specified output directory (NRC_results/).

Alternatively, if you would like to use a gene_presence_absence.Rtab file instead of a CSV matrix of gene presence, use add the --is-rtab flag.

panqc nrc -a InputAsmPaths.tsv -r pan_genome_reference.fa -m gene_presence_absence.Rtab --is-rtab -o NRC_results/

Analyzing included test data set

If you wish to run an panqc nrc on an artifical (and abridged) test data set, you can run the following commands:

cd tests/data

# Define path to the 3 needed input files:

# 1) Gene presence absence matrix (As output by Panaroo or Roary)
PG_Matrix_CSV="TestSet1.gene_presence_absence.csv"

# 2) Pan-genome nucleotide reference (As output by Panaroo or Roary)
PG_Ref_FA="TestSet1.pan_genome_reference.fa.gz"

# 3) SampleID + Path for all assemblies used in analysis
Asm_TSV="TestSet1.InputAsmPaths.tsv"

time panqc nrc -a ${Asm_TSV} -r ${PG_Ref_FA} -m ${PG_Matrix_CSV} -o test_results/

NOTE: Make sure that your current working directory (CWD) is tests/data within the repository. The TestSet1.InputAsmPaths.tsv file describes the path to each genome assembly relative to your CWD.

Full usage

panqc has 2 sub-commands:

  • nrc - Run the full panqc Nucleotide Redundancy Correction pipeline on a pan-genome analyses.
  • utils - Run utlity scripts and sub-pipelines of the full panqc NRC pipeline

panqc nrc

Run the complete panqc Nucleotide Redundancy Correction (NRC) pipeline

$ panqc nrc --help

usage: panqc nrc [-h] -a PathToAsms.tsv -r pan_genome_reference.fasta -m gene_presence_absence.csv -o RESULTS_DIR [-p PREFIX] [-c MIN_QUERY_COV] [-i MIN_SEQ_ID] [-k KMER_SIZE] [-t MIN_KSIM]

optional arguments:
  -h, --help            show this help message and exit

  -a, --asms PathToAsms.tsv
                        Table with SampleID & Paths to each input assemblies. (TSV)

  -r, --pg-ref pan_genome_reference.fasta
                        Input pan-genome nucleotide reference. Typically output as `pan_genome_reference.fasta` by Panaroo/Roary (FASTA)

  -m, --gene_matrix gene_presence_absence.csv
                        Input pan-genome gene presence/absence matrix. By default is assumed to be a `gene_presence_absence.csv` output by Panaroo/Roary (CSV) If the user provides the --is-rtab flag, the input is assumed to be an .Rtab (TSV)file.

  -o, --results_dir RESULTS_DIR
                        Output directory for analysis results.

  -p, --prefix PREFIX
                        Prefix to append to output files

  -c, --min-query-cov MIN_QUERY_COV
                        Minimum query coverage (ranging from 0 to 1) to classify a gene as present within an assembly (Default: 0.9)

  -i, --min-seq-id MIN_SEQ_ID
                        Minimum sequence identity (ranging from 0 to 1) to classify a gene as present within an assembly (Default: 0.9)

  -k, --kmer_size KMER_SIZE
                        k-mer size (bp) to use for generating profile of unique k-mers for each sequence (Default: 31))

  -t, --min-ksim MIN_KSIM
                        Minimum k-mer similarity (maximum jaccard containment of k-mers between pair of sequences) to cluster sequences into the same "nucleotide similarity cluster" (Default: 0.8))
  --is-rtab             Flag indicating that the input gene matrix is a tab-delimited .Rtab file

panqc utils

Within utils there are 3 sub-commands that run specific components of the panqc NRC pipeline:

  • utils asmseqcheck - Perform alignment of all genes classified as absent to their respective assemblies.
  • utils ava - Perform all vs all comparison of k-mer profiles of input sequences.
  • utils nscluster - Perform nucleotide similarity clustering and readjust pan-genome estimates.
$ panqc utils --help

usage: panqc utils [-h] {asmseqcheck,ava,nscluster} ...

positional arguments:
  {asmseqcheck,ava,nscluster}
                        Please select one of the utilility pipelines of the panqc toolkit.
    asmseqcheck
    ava
    nscluster

optional arguments:
  -h, --help            show this help message and exit

🚧 Check back soon for full usage for each of the utility sub-pipelines of the panqc toolkit 🚧

Contributing and Issues

🚧 Check back soon 🚧

Citing

🚧 Check back soon 🚧

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

panqc-0.0.4.tar.gz (199.2 kB view details)

Uploaded Source

Built Distribution

panqc-0.0.4-py3-none-any.whl (20.5 kB view details)

Uploaded Python 3

File details

Details for the file panqc-0.0.4.tar.gz.

File metadata

  • Download URL: panqc-0.0.4.tar.gz
  • Upload date:
  • Size: 199.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for panqc-0.0.4.tar.gz
Algorithm Hash digest
SHA256 9fb6821301e124f252fc0fc3682e3f75f4bea649817157164ddaca8850b21b51
MD5 b2b5f2694e6b22cfef63d08d0435f53c
BLAKE2b-256 4b1ac9f53704469ac66ea9215b7635647cfe0afdfcb583a1e6a0d39989255c41

See more details on using hashes here.

File details

Details for the file panqc-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: panqc-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for panqc-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 20a96d97469bf08d0b638718e2c6240a10dcda12922a4d6680549fb71d44afc1
MD5 d09f198535fc6d74409ca9cfda94a59a
BLAKE2b-256 e01a3b2af7d7675490b450f7148c442e1df3a306727603a09b6b941398b56767

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page