Skip to main content

USUM: Plotting sequence similarity using USEARCH & UMAP

Project description

USUM: Plotting sequence similarity using USEARCH & UMAP

USUM uses USEARCH and UMAP (or t-SNE) to plot DNA 🧬 and protein 🧶 sequence similarity embeddings.

PyPI - Downloads PyPI license PyPI version CI

Installation

  1. Install USEARCH dependency manually: https://drive5.com/usearch/download.html
    (consider supporting the author by buying the 64bit license)

  2. Install usum using PIP:

pip install usum

Usage

Use usum to plot input protein or DNA sequences in FASTA format.

Show all available options using usum --help

Minimal example

usum example.fa --maxdist 0.2 --termdist 0.3 --output example

Multiple input files with labels

usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --output example

This will produce a PNG plot:

UMAP static example

An interactive Bokeh HTML plot is also created:

UMAP Bokeh example

Using t-SNE instead of UMAP

You can also produce a t-SNE plot using the --tsne flag.

usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --tsne --output example

This will produce a PNG plot:

UMAP static example

Plotting random subset

You can use --limit to extract and plot a random subset of the input sequences.

# Plot 10k sequences from each input file
usum first.fa second.fa --labels First Second --limit 10000 --maxdist 0.2 --termdist 0.3 --output example

You can control randomness and reproducibility using the --seed option.

Plotting options

See usum --help for all plotting options.

See UMAP API Guide for more info about the UMAP options.

  • Use --limit to plot a random subset of records
  • Use --width and --height to control plot size in pixels
  • Use --resume to reuse previous distance matrix from the output folder
  • Use --tsne to produce a t-SNE embedding instead of UMAP (you can use this with --resume)
  • Use --umap-spread to control how close together the embedded points are in the UMAP embedding
  • Use --umap-min-dist to control minimum distance between points in UMAP embedding
  • Use --neighbors to control number of neighbors in UMAP graph

Reusing previous results

When changing just the plot options, you can use --resume to reuse previous results from the output folder.

Warning This will reuse the previous distance matrix, so changes to limits or USEARCH args won't take effect.

# Reuse result from umap output directory
usum --resume --output example --width 600 --height 600 --theme fire

Programmatic use

from usum import usum

# Show help
help(usum)

# Run USUM
usum(inputs=['input.fa'], output='usum', maxdist=0.2, termdist=0.3)

How it works

  • A sparse distance matrix is calculated using USEARCH calc_distmx command.
  • The distances are based on % identity, so the method is agnostic to sequence type (DNA or protein)
  • The distance matrix is embedded as a precomputed metric using UMAP
  • The embedding is plotted using umap.plot.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

usum-0.1.6.tar.gz (9.3 kB view hashes)

Uploaded Source

Built Distribution

usum-0.1.6-py3-none-any.whl (9.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page