Generate a PanGenome given a set of genomes

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Please submit questions to the Issues page on GitLab

Primary contact: Todd P. Michael, tmicha@salk.edu

PanKmer

k-mer based and reference-free pangenome analysis. See the quickstart below, or read the documentation.

Repository structure

benchmark/ Shell scripts defining benchmarking runs
docs/ Source code and makefiles for Sphinx documentation
- source/ Documentation source code in ReStructuredText format
- Makefile Sphinx infrastructure
- make.bat Sphinx infrastructure
- pankmer-manual.pdf User manual generated from sphinx-build -b latexpdf
example/ Files for use in README examples, can probably remove this
rust/ Cargo config and Rust source code
- src/ Rust code
- Cargo.toml Cargo config
snakemake/ Snakemake workflows, some of these are currently outdated
src/pankmer/ Python source code
test/ Python unit tests (run with pytest)

License

PanKmer is licensed under a Salk Institute BSD license

Installation

In a conda environment

First create an environment that includes all dependencies:

conda create -c conda-forge -c bioconda -n pankmer \
  python=3.10 cython gff2bed more-itertools pybedtools \
  python-newick pyfaidx rust seaborn upsetplot urllib3 \
  tabix dash-bootstrap-components

Then install PanKmer with pip:

conda activate pankmer
pip install pankmer

With pip

PanKmer is built with Rust, so you will need to install it if you have not already done so. Then you can install PanKmer with pip:

pip install pankmer

Check installation

Check that the installation was successful by running:

pankmer --version

Tutorial

Download example dataset

The download-example subcommand will download a small example dataset of Chr19 sequences from S. polyrhiza.

pankmer download-example -d .

After running this command the directory PanKmer_example_Sp_Chr19/ will be present in the working directory. It contains FASTA files representing Chr19 from three genomes, and GFF files giving their gene annotations.

ls PanKmer_example_Sp_Chr19/*

PanKmer_example_Sp_Chr19/README.md

PanKmer_example_Sp_Chr19/Sp_Chr19_features:
Sp9509_oxford_v3_Chr19.gff3.gz Sp9512_a02_genes_Chr19.gff3.gz

PanKmer_example_Sp_Chr19/Sp_Chr19_genomes:
Sp7498_HiC_Chr19.fasta.gz Sp9509_oxford_v3_Chr19.fasta.gz Sp9512_a02_genome_Chr19.fasta.gz

To get started, navigate to the downloaded directory.

cd PanKmer_example_Sp_Chr19/

Build a k-mer index

The k-mer index is a table tracking presence or absence of k-mers in the set of input genomes. To build an index, use the index subcommand and provide a directory containing the input genomes.

pankmer index -g Sp_Chr19_genomes/ -o Sp_Chr19_index.tar

After completion, the index will be present as a tar file Sp_Chr19_index.tar.

tar -tvf Sp_Chr19_index.tar

Sp_Chr19_index/
Sp_Chr19_index/kmers.bgz
Sp_Chr19_index/metadata.json
Sp_Chr19_index/scores.bgz

Note

The input genomes argument proided with the -g flag can be a directory, a tar archive, or a space-separated list of FASTA files.

If the output argument provided with the -o flag ends with .tar, then the index will be written as a tar archive. Otherwise it will be written as a directory.

Create an adjacency matrix

A useful application of the k-mer index is to generate an adjacency matrix. This is a table of k-mer similarity values for each pair of genomes in the index. We can generate one using the adj-matrix subcommand, which will produce a CSV or TSV file containing the matrix.

pankmer adj-matrix -i Sp_Chr19_index.tar -o Sp_Chr19_adj_matrix.csv
pankmer adj-matrix -i Sp_Chr19_index.tar -o Sp_Chr19_adj_matrix.tsv

Note

The input index argument proided with the -i flag can be tar archive or a directory.

Plot a clustered heatmap

To visualize the adjacency matrix, we can plot a clustered heatmap of the adjacency values. In this case we use the Jaccard similarity metric for pairwise comparisons between genomes:

pankmer clustermap -i Sp_Chr19_adj_matrix.csv \
  -o Sp_Chr19_adj_matrix.svg \
  --metric jaccard \
  --width 6.5 \
  --height 6.5

example heatmap

Generate a gene variability heatmap

Generate a heatmap showing variability of genes across genomes. The following command uses the --n-features option to limit analysis to the first two genes from each input GFF3 file. The resulting image shows the level of variability observed across genes from each genome.

pankmer anchor-heatmap -i Sp_Chr19_index.tar \
  -a Sp_Chr19_genomes/Sp9509_oxford_v3_Chr19.fasta.gz Sp_Chr19_genomes/Sp9512_a02_genome_Chr19.fasta.gz \
  -f Sp_Chr19_features/Sp9509_oxford_v3_Chr19.gff3.gz Sp_Chr19_features/Sp9512_a02_genes_Chr19.gff3.gz \
  -o Sp_Chr19_gene_var.png \
  --n-features 2 \
  --height 3

example heatmap

Pangenome datasets

The pankmer download-example subcommand can be used to download genomes from several publicly available pangenome datasets. See the help text:

pankmer download-example --help

usage: pankmer download-example [-h] [-d <dir/>] [-s {Spolyrhiza,Slycopersicum,Zmays,Hsapiens,Bsubtilis,Athaliana}] [-n <int>]

options:
  -h, --help            show this help message and exit
  -d <dir/>, --dir <dir/>
                        destination directory for example data
  -s {Spolyrhiza,Slycopersicum,Zmays,Hsapiens,Bsubtilis,Athaliana}, --species {Spolyrhiza,Slycopersicum,Zmays,Hsapiens,Bsubtilis,Athaliana}
                        download publicly available genomes. Species: max_samples. Spolyrhiza: 3, Slycopersicum: 46, Zmays: 54, Hsapiens: 94,    Bsubtilis: 164, Athaliana: 1135
  -n <int>, --n-samples <int>
                        number of samples to download, must be less than species max [1]

The -s/--species option selects the species, and the -n/--n-samples option selects the number of samples to download. The maximum number of samples for each species is:

Species	Max samples
S. polyrhiza	3
S. lycopersicum	46
Z. mays	54
H. sapiens	94
B. subtilis	164
A. thaliana	1135

See below a description of each pangenome dataset

S. lycopersicum

46 Solanum lycopersicum genomes from the SolOmics database. See also: Nature article .

Z. mays

54 Zea mays genomes from the downloads page of MaizeGDB.

H. sapiens

94 Homo sapiens haplotypes from Year 1 of the Human Pangenome Reference Consortium/Human Pangenome Project. Download details found at the HPRC/HPP github repository. Nature article

B. subtilis

164 B. subtilis genomes from NCBI.

A. thaliana

1135 A. thaliana pseudo-genomes from the data center of 1001 Genomes

S. polyrhiza

A collection of 3 Spirodela polyrhiza clones Sp7498, Sp9509, Sp9512, from the following sources: Sp7498 and Sp9509 sequences were sourced from the following references found at http://spirodelagenome.org:

Sp9509_oxford_v3
NCBI: GCA_900492545.1
CoGe: id51364
This genome was generated with Oxford Nanopore and polished with Illumina, scaffolded against the previous Illumina-based genome Sp9509v3 and validated with BioNano optical maps and multi-color FISH (mcFISH).

Hoang PNT, Michael TP, Gilbert S, Chu P, Motley TS, Appenroth KJ, Schubert I, Lam E. Generating a high-confidence reference genome map of the Greater Duckweed by integration of cytogenomic, optical mapping and Oxford Nanopore technologies. Plant J. 2018 Jul 28.

Sp7498_HiC
CoGe: 55877
This assembly was generated using Oxford Nanopore long reads and Illumina-based HiC scaffolding.

Harkess A, McGlaughlin F, Bilkey N, Elliott K, Emenecker R, Mattoon E, Miller K, Vierstra R, Meyers BC, Michael TP. High contiguity Spirodela polyrhiza genomes reveal conserved chromosomal structure. Submitted.

Sp9512 sequence was sourced from research data for the following in-progress publication:

Pasaribu B, Acosta K, Aylward A, Abramson BW, Colt K, Hartwick NT, Liang Y, Shanklin J, Michael TP, Lam E Genomics of turions from the Greater Duckweed reveal pathways for tissue dormancy and reemergence strategy of an aquatic plant.

Sp9512 can be downloaded from Michael lab AWS storage.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.20.1

Mar 12, 2024

0.20.0

Mar 1, 2024

0.19.1

Feb 13, 2024

0.19.0

Feb 7, 2024

0.18.1

Jan 19, 2024

0.18.0

Nov 24, 2023

0.17.1

Sep 21, 2023

0.17.0

Aug 23, 2023

0.16.1

Aug 21, 2023

0.16.0

Aug 15, 2023

0.15.3

Aug 8, 2023

0.15.2

Aug 8, 2023

0.15.1

Aug 8, 2023

0.15.0

Aug 7, 2023

0.14.0

Jul 27, 2023

0.13.1

Jul 18, 2023

0.13.0

Jul 17, 2023

0.12.11

Jul 14, 2023

0.12.10

Jun 16, 2023

0.12.9

Jun 16, 2023

0.12.8

Jun 14, 2023

0.12.7

Jun 9, 2023

0.12.6

Jun 9, 2023

0.12.5

Jun 7, 2023

0.12.4

Jun 6, 2023

0.12.2

Jun 3, 2023

0.12.1

May 29, 2023

0.12.0

May 28, 2023

0.11.8

Apr 19, 2023

0.11.7

Apr 19, 2023

0.11.6

Apr 16, 2023

0.11.5

Apr 13, 2023

0.11.4

Feb 25, 2023

0.11.3

Feb 24, 2023

0.11.2

Feb 24, 2023

0.11.0

Feb 22, 2023

0.10.0

Feb 9, 2023

0.9.0

Feb 4, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pankmer-0.20.1.tar.gz (85.6 kB view hashes)

Uploaded Mar 12, 2024 Source

Built Distribution

pankmer-0.20.1-cp310-cp310-macosx_10_7_x86_64.whl (748.2 kB view hashes)

Uploaded Mar 12, 2024 CPython 3.10 macOS 10.7+ x86-64

Hashes for pankmer-0.20.1.tar.gz

Hashes for pankmer-0.20.1.tar.gz
Algorithm	Hash digest
SHA256	`baf9d255ad4e57784b6a6de9f7e82b86d441af5fbce0e35fdb1ebb9dc36a1eec`
MD5	`ba482a890f7d1c4cced858b49b7fdd61`
BLAKE2b-256	`6056cbb0abff82b8e10988bc3a29768ffe8a428dc161cdd6cd96f590bc577860`

Hashes for pankmer-0.20.1-cp310-cp310-macosx_10_7_x86_64.whl

Hashes for pankmer-0.20.1-cp310-cp310-macosx_10_7_x86_64.whl
Algorithm	Hash digest
SHA256	`8047a58554aba8bec6ea91fc3f65ba439bf15ed12597add96829d87419b266af`
MD5	`2d66367ddc750492331b4b795b2b87d4`
BLAKE2b-256	`8ff1ef8a11db960fb00a6193d38d2dc1e2cf51f6bb3595b59447f6349825c4f3`

pankmer 0.20.1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

PanKmer

Repository structure

License

Installation

In a conda environment

With pip

Check installation

Tutorial

Download example dataset

Build a k-mer index

Note

Create an adjacency matrix

Note

Plot a clustered heatmap

Generate a gene variability heatmap

Pangenome datasets

S. lycopersicum

Z. mays

H. sapiens

B. subtilis

A. thaliana

S. polyrhiza

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution