Skip to main content

USUM: Plotting sequence similarity using USEARCH & UMAP

Project description

USUM: Plotting sequence similarity using USEARCH & UMAP

USUM uses USEARCH and UMAP (or t-SNE) to plot DNA 🧬 and protein 🧶 sequence similarity embeddings.

PyPI - Downloads PyPI license PyPI version CI

Installation

  1. Install USEARCH dependency manually: https://drive5.com/usearch/download.html
    (consider supporting the author by buying the 64bit license)

  2. Install usum using PIP:

pip install usum

Usage

Use usum to plot input protein or DNA sequences in FASTA format.

Show all available options using usum --help

Minimal example

usum example.fa --maxdist 0.2 --termdist 0.3 --output example

Multiple input files with labels

usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --output example

This will produce a PNG plot:

UMAP static example

An interactive Bokeh HTML plot is also created:

UMAP Bokeh example

Using t-SNE instead of UMAP

You can also produce a t-SNE plot using the --tsne flag.

usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --tsne --output example

This will produce a PNG plot:

UMAP static example

Plotting random subset

You can use --limit to extract and plot a random subset of the input sequences.

# Plot 10k sequences from each input file
usum first.fa second.fa --labels First Second --limit 10000 --maxdist 0.2 --termdist 0.3 --output example

You can control randomness and reproducibility using the --seed option.

Plotting options

See usum --help for all plotting options.

See UMAP API Guide for more info about the UMAP options.

  • Use --limit to plot a random subset of records
  • Use --width and --height to control plot size in pixels
  • Use --resume to reuse previous distance matrix from the output folder
  • Use --tsne to produce a t-SNE embedding instead of UMAP (you can use this with --resume)
  • Use --umap-spread to control how close together the embedded points are in the UMAP embedding
  • Use --umap-min-dist to control minimum distance between points in UMAP embedding
  • Use --neighbors to control number of neighbors in UMAP graph

Reusing previous results

When changing just the plot options, you can use --resume to reuse previous results from the output folder.

Warning This will reuse the previous distance matrix, so changes to limits or USEARCH args won't take effect.

# Reuse result from umap output directory
usum --resume --output example --width 600 --height 600 --theme fire

Programmatic use

from usum import usum

# Show help
help(usum)

# Run USUM
usum(inputs=['input.fa'], output='usum', maxdist=0.2, termdist=0.3)

How it works

  • A sparse distance matrix is calculated using USEARCH calc_distmx command.
  • The distances are based on % identity, so the method is agnostic to sequence type (DNA or protein)
  • The distance matrix is embedded as a precomputed metric using UMAP
  • The embedding is plotted using umap.plot.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

usum-0.1.6.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

usum-0.1.6-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file usum-0.1.6.tar.gz.

File metadata

  • Download URL: usum-0.1.6.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200325 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.6

File hashes

Hashes for usum-0.1.6.tar.gz
Algorithm Hash digest
SHA256 4b99927b561f94ed40d77a05c810a9af2d2dcb1b6c9938c269882acef9141131
MD5 7fbd585a0511fe596851e6df994954db
BLAKE2b-256 ad628c0ba3713e23f2cc6764d7dfb2703fb53a3569529a48bef48061853f4c4e

See more details on using hashes here.

File details

Details for the file usum-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: usum-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200325 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.6

File hashes

Hashes for usum-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 3b643a5ceb1cf3ea1d1419641966cd1337796797fa1e2da081666ef5ea315528
MD5 39cb287f9b841c23d02ed1b5cd812317
BLAKE2b-256 65a836a4d2037c64e945d9490044bb6e1c6d0f3128101ee427935c657419e789

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page