Skip to main content

Reduced dimension embeddings for pathogen sequences

Project description

pathogen-embed

PyPI version install with bioconda

Create reduced dimension embeddings for pathogen sequences

pathogen-embed is an open-source software to run reduced dimension embeddings (PCA, MDS, t-SNE, and UMAP) on viral populations. For more details, read Nanduri et al. and check out the corresponding GitHub repository.

Installation

With pip

pip install pathogen-embed

With Conda

conda install -c conda-forge -c bioconda pathogen-embed

Quickstart

The following commands show an example of how to apply the pathogen-embed tools to a small set of seasonal influenza A/H3N2 hemagglutinin (HA) sequences. To start, calculate the distance matrix between each pair of sequences in the dataset.

pathogen-distance \
  --alignment tests/data/h3n2_ha_alignment.fasta \
  --output distances.csv

(Optional) For faster distance calculations without indel support, use snp-dists. This command converts the default tab-delimited output of snp-dists into the comma-delimited format expected by pathogen-embed.

snp-dists -c -b tests/data/h3n2_ha_alignment.fasta > distances.csv

Create a t-SNE embedding from the distance matrix. Note that the perplexity is the number of nearest neighbors to consider in the embedding calculations, so this value has to be less than or equal to the total number of samples in the input (N=50, here).

pathogen-embed \
  --alignment tests/data/h3n2_ha_alignment.fasta \
  --distance-matrix distances.csv \
  --output-dataframe tsne.csv \
  --output-figure tsne.pdf \
  --output-pairwise-distance-figure tsne_pairwise_distances.pdf \
  t-sne \
    --perplexity 45.0

The following figure shows the resulting embedding.

Example t-SNE embedding of seasonal influenza A/H3N2 hemagglutinin sequences

The following figure shows the distribution of pairwise Euclidean distances by corresponding pairwise genetic distance. The equation in the figure title shows how genetic distances (x in the equation) scale to Euclidean distances (y) in the embedding.

Distribution of Euclidean distances between pairs of genomes in a t-SNE embedding by the pairwise genetic distance between the same genomes

Find clusters in the embedding.

pathogen-cluster \
  --embedding tsne.csv \
  --label-attribute tsne_label \
  --output-dataframe tsne_with_clusters.csv \
  --output-figure tsne_with_clusters.pdf

The following image shows the t-SNE embedding colored by clusters. Note that the underlying clustering algorithm, HDBSCAN, allows samples to not be assigned to any cluster if there isn't a reliable cluster to place them in. These unassigned samples receive a cluster label of "-1".

Example t-SNE embedding of seasonal influenza A/H3N2 hemagglutinin sequences colored by the cluster label assigned by pathogen-cluster

If you know the minimum genetic distance you want to require between clusters, you can use the equation from the pairwise distance figure above to determine the corresponding minimum Euclidean distance to pass to pathogen-cluster's --distance-threshold argument.

Example: Identify reassortment groups from multiple gene alignments

To identify potential reassortment groups for viruses with segmented genomes, you can calculate one distance matrix per gene and pass multiple distance matrices to pathogen-embed. Internally, pathogen-embed sums the given distances matrices into a single matrix to use for an embedding. The clusters in the resulting embedding represent genetic diversity in each gene individually and potential reassortment between genes. The following example shows how to apply this approach to alignments for seasonal influenza A/H3N2 HA and NA.

Calculate a separate distance matrix per gene alignment for HA and NA. Note that alignments must have the same sequence names in the same order. If they do not, sort your alignments by sequence name with a tool like seqkit first (e.g., seqkit sort -n alignment.fasta > alignment.sorted.fasta).

pathogen-distance \
  --alignment tests/data/h3n2_ha_alignment.sorted.fasta \
  --output ha_distances.csv

pathogen-distance \
  --alignment tests/data/h3n2_na_alignment.sorted.fasta \
  --output na_distances.csv

Create a t-SNE embedding using the HA/NA alignments and distance matrices. The t-SNE embedding gets initialized by a PCA embedding from the alignments.

pathogen-embed \
  --alignment tests/data/h3n2_ha_alignment.sorted.fasta tests/data/h3n2_na_alignment.sorted.fasta \
  --distance-matrix ha_distances.csv na_distances.csv \
  --output-dataframe tsne.csv \
  --output-figure tsne.pdf \
  --output-pairwise-distance-figure tsne_pairwise_distances.pdf \
  t-sne \
    --perplexity 45.0

Finally, find clusters in the embedding which represent the within and between diversity of the given HA and NA sequences.

pathogen-cluster \
  --embedding tsne.csv \
  --label-attribute tsne_label \
  --output-dataframe tsne_with_clusters.csv \
  --output-figure tsne_with_clusters.pdf

Compare the resulting embedding and clusters to the embedding above from only HA sequences, to get a sense of how including the NA sequences affects the results.

Example t-SNE embedding of seasonal influenza A/H3N2 hemagglutinin and neuraminidase sequences colored by the cluster label assigned by pathogen-cluster

Build documentation

Build the Documentation:

make -C /docs html

Clean the docs.

make -C /docs clean

Releasing a new version on PyPI

  1. Update CHANGES.md to reflect changes since the last release and set the correct version number for the new release at the top of the file.
  2. Update the version number in setup.py.
  3. Create a new GitHub release, setting the same version number as above as the release tag and, optionally, generating release notes automatically. When you select "Publish release", GitHub Actions will run the publish workflow which uses the pypi-publish action to create a new release on PyPI for you.
  4. Create a pull request to bump the version of the corresponding Bioconda package.

Run tests during development

Clone this repository locally.

git clone https://github.com/blab/pathogen-embed.git
cd pathogen-embed

Install an editable version of the package into a custom Conda environment.

conda create -n pathogen-embed python=3.11
conda activate pathogen-embed
python3 -m pip install -e '.[dev]'

Run tests with cram.

cram --shell=/bin/bash tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pathogen_embed-2.3.0.tar.gz (17.9 kB view details)

Uploaded Source

Built Distribution

pathogen_embed-2.3.0-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file pathogen_embed-2.3.0.tar.gz.

File metadata

  • Download URL: pathogen_embed-2.3.0.tar.gz
  • Upload date:
  • Size: 17.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for pathogen_embed-2.3.0.tar.gz
Algorithm Hash digest
SHA256 47ef737b7609593c6de143c17f54bc1e526aaa615773aad73bb6464de67cbff1
MD5 ecdcf2c961a14aa6188c8bdd0413e854
BLAKE2b-256 bfaf870b196c6d3a67c4f2b27fcd3d93b3a6931462e61f63a346ff64e40208fe

See more details on using hashes here.

File details

Details for the file pathogen_embed-2.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pathogen_embed-2.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0321f74520703116fa9bd6a79fb46b1d9b825737d4a094dc2d237ff5de4f012d
MD5 dd6893042f8962255b20c78d01a11150
BLAKE2b-256 ce84649a5341066949ee08f9b5ec86d5ea015ddb7176c607b3a76f26fd0d4faf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page