Skip to main content

Generate a PanGenome given a set of genomes

Project description

Primary contact: Anthony Aylward, aaylward@salk.edu

PanKmer

k-mer based and reference-free pangenome analysis. See the quickstart below, or read the documentation.

Installation

With pip

pip install git+https://gitlab.com/salk-tm/pankmer.git

In a conda environment

First create an environment that includes all dependencies:

conda create -c conda-forge -c bioconda -n pankmer python==3.10 biopython==1.79 cython pandas setuptools seaborn urllib3 wheel python-newick pyfaidx gff2bed

If running on OSX, a few additional packages will be required:

conda activate pankmer
conda install -c conda-forge clang_osx-64 clangxx_osx-64 gfortran_osx-64

Then install PanKmer with pip:

conda activate pankmer
pip install pip install git+https://gitlab.com/salk-tm/pankmer.git

Check installation

Check that the installation was successful by running:

pankmer --version

Tutorial

Download example dataset

The download_example subcommand will download a small example dataset of Chr19 sequences from S. polyrhiza.

pankmer download_example -d .

After running this command the directory PanKmer_example_Sp_Chr19/ will be present in the working directory. It contains FASTA files representing Chr19 from three genomes, and GFF files giving their gene annotations.

ls PanKmer_example_Sp_Chr19/*
PanKmer_example_Sp_Chr19/README.md

PanKmer_example_Sp_Chr19/Sp_Chr19_features:
Sp7498_HiC_Chr19.gff.gz Sp9509_oxford_v3_Chr19.gff3.gz Sp9512_a02_genes_Chr19.gff3.gz

PanKmer_example_Sp_Chr19/Sp_Chr19_genomes:
Sp7498_HiC_Chr19.fasta.gz Sp9509_oxford_v3_Chr19.fasta.gz Sp9512_a02_genome_Chr19.fasta.gz

To get started, navigate to the downloaded directory.

cd PanKmer_example_Sp_Chr19/

Build a k-mer index

The k-mer index is a table tracking presence or absence of k-mers in the set of input genomes. To build an index, use the index subcommand and provide a directory containing the input genomes.

pankmer index -g Sp_Chr19_genomes/ -o Sp_Chr19_index.tar

After completion, the index will be present as a tar file Sp_Chr19_index.tar.

tar -tvf Sp_Chr19_index.tar
Sp_Chr19_index/
Sp_Chr19_index/kmers.b.gz
Sp_Chr19_index/metadata.json
Sp_Chr19_index/scores.b.gz

Note

The input genomes argument proided with the -g flag can be a directory, a tar archive, or a comma-separated list of FASTA files.

If the output argument provided with the -o flag ends with .tar, then the index will be written as a tar archive. Otherwise it will be written as a directory.

Create an adjacency matrix

A useful application of the k-mer index is to generate an adjacency matrix. This is a table of k-mer similarity values for each pair of genomes in the index. We can generate one using the adj-matrix subcommand, which will produce a CSV file containing the matrix.

pankmer adj-matrix -i Sp_Chr19_index.tar -o Sp_Chr19_adj_matrix.csv

Note

The input index argument proided with the -i flag can be tar archive or a directory.

Plot a clustered heatmap

To visualize the adjacency matrix, we can plot a clustered heatmap of the adjacency values. In this case we use the Jaccard similarity metric for pairwise comparisons between genomes:

pankmer clustermap -i Sp_Chr19_adj_matrix.csv \
  -o Sp_Chr19_adj_matrix.svg \
  --metric jaccard \
  --width 6.5 \
  --height 6.5

example heatmap

Generate a gene variability heatmap

Generate a heatmap showing variability of genes across genomes. The following command uses the --n-features option to limit analysis to the first two genes from each input GFF file. The resulting image shows the level of variability observed across genes from each genome.

pankmer reg_heatmap -i Sp_Chr19_index/ \
  -r Sp_Chr19_genomes/Sp7498_HiC_Chr19.fasta.gz Sp_Chr19_genomes/Sp9509_oxford_v3_Chr19.fasta.gz Sp_Chr19_genomes/Sp9512_a02_genome_Chr19.fasta.gz \
  -f Sp_Chr19_features/Sp7498_HiC_Chr19.gff.gz Sp_Chr19_features/Sp9509_oxford_v3_Chr19.gff3.gz Sp_Chr19_features/Sp9512_a02_genes_Chr19.gff3.gz \
  -o Sp_Chr19_gene_var.png \
  --n-features 2 \
  --height 3

example heatmap

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pankmer-0.9.0.tar.gz (33.3 kB view details)

Uploaded Source

Built Distribution

pankmer-0.9.0-cp310-cp310-macosx_10_9_x86_64.whl (137.9 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

File details

Details for the file pankmer-0.9.0.tar.gz.

File metadata

  • Download URL: pankmer-0.9.0.tar.gz
  • Upload date:
  • Size: 33.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.4

File hashes

Hashes for pankmer-0.9.0.tar.gz
Algorithm Hash digest
SHA256 06586fba4d5b1e5a19e391a75031d3b49d4815cc104d57ba19bdd624bc464e3e
MD5 f98bd5299f54a63a8e493e88d7aaf2b8
BLAKE2b-256 ec87885615746e80d9141ab39f7997ee1ac00677e583de2ddcd9b167d76993a6

See more details on using hashes here.

File details

Details for the file pankmer-0.9.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pankmer-0.9.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e64688af1af38d9d06b95455b5e253d29fcfa01653af23b2816460ac3b108e0e
MD5 a960951fa27c34cd2fca3bc518de05ce
BLAKE2b-256 90ef053ee614c849e63dd9b1f844e3ec39793adcfdfd57dbf47a0dca9dc87abe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page