Skip to main content

Compute gene-cluster specific k-mers over a pangenome

Project description

panfeed

Anaconda-Server Badge

panfeed is a k-mer streaming tool that works one gene cluster at a time. Starting from a list of annotated genome assemblies in GFF3 format and a gene presence absence matrix (as produced by roary, panaroo and ggCaller), panfeed generates a table with unique k-mer presence/absence patterns, which can be used for genome-wide associations (GWAS) using tools such as pyseer. Mapping of associated patterns to gene clusters and base resolution mapping of k-mers can be then achieved with the other two outputs of panfeed. Advantages of this approach over the generation of k-mers from a global de Bruijn graph include a lower chance of encountering artifacts due to repetitive regions and easier interpretations and visualization of results.

Citation

Sommer, H., Djamalova, D., & Galardini, M. (2023). Reduced ambiguity and improved interpretability of bacterial genome-wide associations using gene-cluster-centric k-mers. Microbial Genomics, 9(11), 001129. 10.1099/mgen.0.001129

Installation

panfeed can be installed using pip:

python3 -m pip install panfeed

Or through conda (or mamba to speed things up):

conda create -n panfeed -c bioconda panfeed

Alternatively, we provide a conda recipe to create an environment named panfeed . Download the environment file and then run:

conda env create -f environment.yml
conda activate panfeed

Quick start guide

We reccommend a two-pass approach when using panfeed; the first pass generates the presence/absence matrix for all k-mers across all gene clusters. After the association analysis is completed, the panfeed-get-clusters command can be used to list the gene clusters with k-mers passing the desired significance threshold, and a second pass of panfeed can be run on those gene clusters alone to generate a base-level mapping of all k-mers across samples for fine-mapping and visualization purposes. The main advantage of the two-pass approach is a significant reduction in storage requirements for the k-mer metadata file, and a slightly shorter computation time.

To run the first pass, prepare a folder with all GFF3 annotated assemblies files (including the nucleotide sequences at the end of each file), with file name in the format SAMPLE.gff. Each sample name should have a matching column in the gene clusters presence/absence file, which must follow the same format as those generated by panaroo (i.e. gene_presence_absence.csv). Then run the following command, which will include 100 bases upstream and downstream of each gene cluster:

panfeed -g gffs -p gene_presence_absence.csv -o panfeed1 --upstream 100 --downstream 100 --compress --cores 4

This will create three files in the panfeed1 directory:

  • kmers.tsv.gz: k-mers metadata file (empty for this pass)
  • kmers_to_hashes.stv.gz: file to match gene clusters, k-mer sequences and the hash for the respective presence/absence pattern
  • hashes_to_patterns.tsv.gz: binary presence/absence matrix for all unique k-mer patterns (rows) across samples (columns)

The hashes_to_patterns.tsv.gz file can be used to run a GWAS analysis with a tool such as pyseer, which will produce an output table (e.g. pyseer.tsv) with association statistics for each pattern passing the basic filtering thresholds. This file can then be used to retrieve the gene clusters that encode k-mers passing the desired significance threshold:

panfeed-get-clusters -a pyseer.tsv -p panfeed1/kmers_to_hashes.stv.gz -t 1E-7 > gene_clusters.txt

The second pass of panfeed can be then run focusing on the "interesting" gene clusters and generating k-mers positional information across all samples:

ls gffs/ | sed 's/.gff//g' > samples.txt
panfeed -g gffs -p gene_presence_absence.csv -o panfeed2 --targets samples.txt --genes gene_clusters.txt --upstream 100 --downstream 100 --compress --cores 4

This time the kmers.tsv.gz file will contain absolute and relative (to start codon) positional information for each k-mer in all samples. This file can be then merged with the association results so that the association statistics are paired with each k-mer and their position across samples:

panfeed-get-kmers -a pyseer.tsv -p kmers_to_hashes.tsv.gz -k kmers.tsv.gz | gzip > annotated_kmers.tsv.gz

Association results can be then visualized for each gene cluster with the following command, which requires the phenotype file used for the association with pyseer (data.tsv in the example command below):

panfeed-plot -k annotated_kmers.tsv.gz -p data.tsv

This command will generate three figures for each gene cluster:

  • significance_CLUSTER.png: k-mers colors are proportional to their significance level, think one Manhatten plot for each sample
  • sequence_CLUSTER.png: k-mers are colored based on their nucleotide sequence, effectively generating a pseudo-alignment
  • hybrid_CLUSTER.png: a combination of the two previous figures; k-mers color is based on their sequence, opacity is proportional to their significance level

Additionally, a file called sequence_legend.png is created to indicate which color is associated to which nucleotide.

Additional information

If your GFF files do not contain the nucleotide sequences, you can provide them to panfeed as a separate argument, using -f fastas. The fastas folder should contain one file per sample with the name format SAMPLE.fasta or SAMPLE.fna.

If you want the k-mers presence/absence patterns to encode differently the information on whether a gene cluster is missing from a sample, use the --consider-missing argument. By default a missing gene cluster is encoded as 0, same as a missing k-mer.

The visualization command has many arguments to fully customize the resulting plots; among them:

  • --phenotype-column PHENOTYPE, will sort the plots by the provided phenotype value (descending order)
  • --start -50 --stop 100 --sample 0.1, will restrict the plot to 10% of samples and to the -50 to +100 region relative to the start codon
  • adding --nucleotides to the above command will add the nucleotide letters to each plot

Working with a very large dataset

Note: this is a new functionality introduced in v1.6.0

If you are working with more than a few thousand input files, it is poor practice to have all the inputs in a single directory (e.g. for performance reasons). Following what other bioinformatic tools do to solve this issue, the --gff and --fasta arguments can also be provided as "files-of-files", where the path to each input file is written in each line.

Prerequisites:

The following packages and version have been used to develop and test panfeed

  • pyfaidx (0.6.3.1)
  • numpy (1.20.3)
  • pandas (1.3.2)
  • matplotlib (3.5.2)
  • seaborn (0.11.2)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

panfeed-1.6.0.tar.gz (17.2 MB view hashes)

Uploaded Source

Built Distribution

panfeed-1.6.0-py3-none-any.whl (30.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page