panqc

Package & software for analysis of nucleotide redundancy within CDS-based pan-genome analyses

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Static Badge

A pan-genome quality control toolkit for evaluating nucleotide redundancy in pan-genome analyses.

Motivation
Installation
Basic usage
- Analyzing included test data set
Full usage
- nrc
- utils
Contributing and Issues
Citing

Motivation

PanQC_NRC_Diagram

The panqc Nucleotide Redundancy Correction (NRC) pipeline adjusts for redundancy at the DNA level within pan-genome estimates in two steps. In step one, all genes predicted to be absent at the Amino Acid (AA) level are compared to their corresponding assembly at the nucleotide level. In cases where the nucleotide sequence is found with high coverage and sequence identity (Query Coverage & Sequence Identity > 90%), the gene is marked as “present at the DNA level”. Next, all genes are clustered and merged using a k-mer based metric of nucleotide similarity. Cases where two or more genes are divergent at the AA level but highly similar at the nucleotide level will be merged into a single “nucleotide similarity gene cluster”. After applying this method the pan-genome gene presence matrix is readjusted according to these results.

Installation

`pip`

pip install panqc

Install locally

Currently, panqc can be installed by cloning this repository and installing with pip.

git clone git@github.com:maxgmarin/panqc.git

cd panqc

pip install .

`conda`

🚧 Check back soon 🚧

Basic usage

panqc nrc -a InputAsmPaths.tsv -r pan_genome_reference.fa -m gene_presence_absence.csv -o NRC_results/

The above command will output an adjusted gene presence absence matrix along with additional statistics to the specified output directory (NRC_results/).

Alternatively, if you would like to use a gene_presence_absence.Rtab file instead of a CSV matrix of gene presence, use add the --is-rtab flag.

panqc nrc -a InputAsmPaths.tsv -r pan_genome_reference.fa -m gene_presence_absence.Rtab --is-rtab -o NRC_results/

Analyzing included test data set

If you wish to run an panqc nrc on an artifical (and abridged) test data set, you can run the following commands:

cd tests/data

# Define path to the 3 needed input files:

# 1) Gene presence absence matrix (As output by Panaroo or Roary)
PG_Matrix_CSV="TestSet1.gene_presence_absence.csv"

# 2) Pan-genome nucleotide reference (As output by Panaroo or Roary)
PG_Ref_FA="TestSet1.pan_genome_reference.fa.gz"

# 3) SampleID + Path for all assemblies used in analysis
Asm_TSV="TestSet1.InputAsmPaths.tsv"

time panqc nrc -a ${Asm_TSV} -r ${PG_Ref_FA} -m ${PG_Matrix_CSV} -o test_results/

NOTE: Make sure that your current working directory (CWD) is tests/data within the repository. The TestSet1.InputAsmPaths.tsv file describes the path to each genome assembly relative to your CWD.

Full usage

panqc has 2 sub-commands:

nrc - Run the full panqc Nucleotide Redundancy Correction pipeline on a pan-genome analyses.
utils - Run utlity scripts and sub-pipelines of the full panqc NRC pipeline

`panqc nrc`

Run the complete panqc Nucleotide Redundancy Correction (NRC) pipeline

$ panqc nrc --help

usage: panqc nrc [-h] -a PathToAsms.tsv -r pan_genome_reference.fasta -m gene_presence_absence.csv -o RESULTS_DIR [-p PREFIX] [-c MIN_QUERY_COV] [-i MIN_SEQ_ID] [-k KMER_SIZE] [-t MIN_KSIM]

optional arguments:
  -h, --help            show this help message and exit

  -a, --asms PathToAsms.tsv
                        Table with SampleID & Paths to each input assemblies. (TSV)

  -r, --pg-ref pan_genome_reference.fasta
                        Input pan-genome nucleotide reference. Typically output as `pan_genome_reference.fasta` by Panaroo/Roary (FASTA)

  -m, --gene_matrix gene_presence_absence.csv
                        Input pan-genome gene presence/absence matrix. By default is assumed to be a `gene_presence_absence.csv` output by Panaroo/Roary (CSV) If the user provides the --is-rtab flag, the input is assumed to be an .Rtab (TSV)file.

  -o, --results_dir RESULTS_DIR
                        Output directory for analysis results.

  -p, --prefix PREFIX
                        Prefix to append to output files

  -c, --min-query-cov MIN_QUERY_COV
                        Minimum query coverage (ranging from 0 to 1) to classify a gene as present within an assembly (Default: 0.9)

  -i, --min-seq-id MIN_SEQ_ID
                        Minimum sequence identity (ranging from 0 to 1) to classify a gene as present within an assembly (Default: 0.9)

  -k, --kmer_size KMER_SIZE
                        k-mer size (bp) to use for generating profile of unique k-mers for each sequence (Default: 31))

  -t, --min-ksim MIN_KSIM
                        Minimum k-mer similarity (maximum jaccard containment of k-mers between pair of sequences) to cluster sequences into the same "nucleotide similarity cluster" (Default: 0.8))
  --is-rtab             Flag indicating that the input gene matrix is a tab-delimited .Rtab file

`panqc utils`

Within utils there are 3 sub-commands that run specific components of the panqc NRC pipeline:

utils asmseqcheck - Perform alignment of all genes classified as absent to their respective assemblies.
utils ava - Perform all vs all comparison of k-mer profiles of input sequences.
utils nscluster - Perform nucleotide similarity clustering and readjust pan-genome estimates.

$ panqc utils --help

usage: panqc utils [-h] {asmseqcheck,ava,nscluster} ...

positional arguments:
  {asmseqcheck,ava,nscluster}
                        Please select one of the utilility pipelines of the panqc toolkit.
    asmseqcheck
    ava
    nscluster

optional arguments:
  -h, --help            show this help message and exit

🚧 Check back soon for full usage for each of the utility sub-pipelines of the panqc toolkit 🚧

Contributing and Issues

🚧 Check back soon 🚧

Citing

🚧 Check back soon 🚧

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.4

Mar 26, 2024

0.0.3

Feb 14, 2024

0.0.2

Feb 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

panqc-0.0.4.tar.gz (199.2 kB view hashes)

Uploaded Mar 26, 2024 Source

Built Distribution

panqc-0.0.4-py3-none-any.whl (20.5 kB view hashes)

Uploaded Mar 26, 2024 Python 3

Hashes for panqc-0.0.4.tar.gz

Hashes for panqc-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`9fb6821301e124f252fc0fc3682e3f75f4bea649817157164ddaca8850b21b51`
MD5	`b2b5f2694e6b22cfef63d08d0435f53c`
BLAKE2b-256	`4b1ac9f53704469ac66ea9215b7635647cfe0afdfcb583a1e6a0d39989255c41`

Hashes for panqc-0.0.4-py3-none-any.whl

Hashes for panqc-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`20a96d97469bf08d0b638718e2c6240a10dcda12922a4d6680549fb71d44afc1`
MD5	`d09f198535fc6d74409ca9cfda94a59a`
BLAKE2b-256	`e01a3b2af7d7675490b450f7148c442e1df3a306727603a09b6b941398b56767`