Skip to main content

Compute gene-cluster specific k-mers over a pangenome

Project description

panfeed

Anaconda-Server Badge

panfeed is a k-mer streaming tool that works one gene cluster at a time. Starting from a list of annotated genome assemblies in GFF3 format and a gene presence absence matrix (as produced by roary, panaroo and ggCaller), panfeed generates a table with unique k-mer presence/absence patterns, which can be used for genome-wide associations (GWAS) using tools such as pyseer. Mapping of associated patterns to gene clusters and base resolution mapping of k-mers can be then achieved with the other two outputs of panfeed. Advantages of this approach over the generation of k-mers from a global de Bruijn graph include a lower chance of encountering artifacts due to repetitive regions and easier interpretations and visualization of results.

Citation

Sommer, H., Djamalova, D., & Galardini, M. (2023). Reduced ambiguity and improved interpretability of bacterial genome-wide associations using gene-cluster-centric k-mers. Microbial Genomics, 9(11), 001129. 10.1099/mgen.0.001129

Installation

panfeed can be installed using pip:

python3 -m pip install panfeed

Or through conda (or mamba to speed things up):

conda create -n panfeed -c bioconda panfeed

Alternatively, we provide a conda recipe to create an environment named panfeed . Download the environment file and then run:

conda env create -f environment.yml
conda activate panfeed

Quick start guide

We reccommend a two-pass approach when using panfeed; the first pass generates the presence/absence matrix for all k-mers across all gene clusters. After the association analysis is completed, the panfeed-get-clusters command can be used to list the gene clusters with k-mers passing the desired significance threshold, and a second pass of panfeed can be run on those gene clusters alone to generate a base-level mapping of all k-mers across samples for fine-mapping and visualization purposes. The main advantage of the two-pass approach is a significant reduction in storage requirements for the k-mer metadata file, and a slightly shorter computation time.

To run the first pass, prepare a folder with all GFF3 annotated assemblies files (including the nucleotide sequences at the end of each file), with file name in the format SAMPLE.gff. Each sample name should have a matching column in the gene clusters presence/absence file, which must follow the same format as those generated by panaroo (i.e. gene_presence_absence.csv). Then run the following command, which will include 100 bases upstream and downstream of each gene cluster:

panfeed -g gffs -p gene_presence_absence.csv -o panfeed1 --upstream 100 --downstream 100 --compress --cores 4

This will create three files in the panfeed1 directory:

  • kmers.tsv.gz: k-mers metadata file (empty for this pass)
  • kmers_to_hashes.stv.gz: file to match gene clusters, k-mer sequences and the hash for the respective presence/absence pattern
  • hashes_to_patterns.tsv.gz: binary presence/absence matrix for all unique k-mer patterns (rows) across samples (columns)

The hashes_to_patterns.tsv.gz file can be used to run a GWAS analysis with a tool such as pyseer, which will produce an output table (e.g. pyseer.tsv) with association statistics for each pattern passing the basic filtering thresholds. This file can then be used to retrieve the gene clusters that encode k-mers passing the desired significance threshold:

panfeed-get-clusters -a pyseer.tsv -p panfeed1/kmers_to_hashes.stv.gz -t 1E-7 > gene_clusters.txt

The second pass of panfeed can be then run focusing on the "interesting" gene clusters and generating k-mers positional information across all samples:

ls gffs/ | sed 's/.gff//g' > samples.txt
panfeed -g gffs -p gene_presence_absence.csv -o panfeed2 --targets samples.txt --genes gene_clusters.txt --upstream 100 --downstream 100 --compress --cores 4

This time the kmers.tsv.gz file will contain absolute and relative (to start codon) positional information for each k-mer in all samples. This file can be then merged with the association results so that the association statistics are paired with each k-mer and their position across samples:

panfeed-get-kmers -a pyseer.tsv -p kmers_to_hashes.tsv.gz -k kmers.tsv.gz | gzip > annotated_kmers.tsv.gz

Association results can be then visualized for each gene cluster with the following command, which requires the phenotype file used for the association with pyseer (data.tsv in the example command below):

panfeed-plot -k annotated_kmers.tsv.gz -p data.tsv

This command will generate three figures for each gene cluster:

  • significance_CLUSTER.png: k-mers colors are proportional to their significance level, think one Manhatten plot for each sample
  • sequence_CLUSTER.png: k-mers are colored based on their nucleotide sequence, effectively generating a pseudo-alignment
  • hybrid_CLUSTER.png: a combination of the two previous figures; k-mers color is based on their sequence, opacity is proportional to their significance level

Additionally, a file called sequence_legend.png is created to indicate which color is associated to which nucleotide.

Additional information

If your GFF files do not contain the nucleotide sequences, you can provide them to panfeed as a separate argument, using -f fastas. The fastas folder should contain one file per sample with the name format SAMPLE.fasta or SAMPLE.fna.

If you want the k-mers presence/absence patterns to encode differently the information on whether a gene cluster is missing from a sample, use the --consider-missing argument. By default a missing gene cluster is encoded as 0, same as a missing k-mer.

The visualization command has many arguments to fully customize the resulting plots; among them:

  • --phenotype-column PHENOTYPE, will sort the plots by the provided phenotype value (descending order)
  • --start -50 --stop 100 --sample 0.1, will restrict the plot to 10% of samples and to the -50 to +100 region relative to the start codon
  • adding --nucleotides to the above command will add the nucleotide letters to each plot

Working with a very large dataset

Note: this is a new functionality introduced in v1.6.0

If you are working with more than a few thousand input files, it is poor practice to have all the inputs in a single directory (e.g. for performance reasons). Following what other bioinformatic tools do to solve this issue, the --gff and --fasta arguments can also be provided as "files-of-files", where the path to each input file is written in each line.

Prerequisites:

The following packages and version have been used to develop and test panfeed

  • pyfaidx (0.6.3.1)
  • numpy (1.20.3)
  • pandas (1.3.2)
  • matplotlib (3.5.2)
  • seaborn (0.11.2)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

panfeed-1.7.2.tar.gz (17.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

panfeed-1.7.2-py3-none-any.whl (31.4 kB view details)

Uploaded Python 3

File details

Details for the file panfeed-1.7.2.tar.gz.

File metadata

  • Download URL: panfeed-1.7.2.tar.gz
  • Upload date:
  • Size: 17.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for panfeed-1.7.2.tar.gz
Algorithm Hash digest
SHA256 74732e9b7b7d6dffe19dd7a4f7f6dd58c2ab4391214e8bb86eae3bdf5166f4ef
MD5 52543d789750adb6d96c9d25dd2eb0dc
BLAKE2b-256 d5217f682bf78ad26fec07b3b1c3308c375104900e647cb2b25d8484362136cd

See more details on using hashes here.

File details

Details for the file panfeed-1.7.2-py3-none-any.whl.

File metadata

  • Download URL: panfeed-1.7.2-py3-none-any.whl
  • Upload date:
  • Size: 31.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for panfeed-1.7.2-py3-none-any.whl
Algorithm Hash digest
SHA256 db73c67457d1ffbb50a86922a6050e97288ca7982f27c995c80c5fac8fa61252
MD5 ceebb152d250194f7a66f0c20b953ef7
BLAKE2b-256 16e6350f9e14e17b0997c688968da5ad2d12138b0c20c46f0e1087616da50282

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page