USUM: Plotting sequence similarity using USEARCH & UMAP
Project description
USUM: Plotting sequence similarity using USEARCH & UMAP
USUM uses USEARCH and UMAP (or t-SNE) to plot DNA 🧬 and protein 🧶 sequence similarity embeddings.
Installation
-
Install
USEARCH
dependency manually: https://drive5.com/usearch/download.html
(consider supporting the author by buying the 64bit license) -
Install
usum
using PIP:
pip install usum
Usage
Use usum
to plot input protein or DNA sequences in FASTA format.
Show all available options using usum --help
Minimal example
usum example.fa --maxdist 0.2 --termdist 0.3 --output example
Multiple input files with labels
usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --output example
This will produce a PNG plot:
An interactive Bokeh HTML plot is also created:
Using t-SNE instead of UMAP
You can also produce a t-SNE plot using the --tsne
flag.
usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --tsne --output example
This will produce a PNG plot:
Plotting random subset
You can use --limit
to extract and plot a random subset of the input sequences.
# Plot 10k sequences from each input file
usum first.fa second.fa --labels First Second --limit 10000 --maxdist 0.2 --termdist 0.3 --output example
You can control randomness and reproducibility using the --seed
option.
Plotting options
See usum --help
for all plotting options.
See UMAP API Guide for more info about the UMAP options.
- Use
--limit
to plot a random subset of records - Use
--width
and--height
to control plot size in pixels - Use
--resume
to reuse previous distance matrix from the output folder - Use
--tsne
to produce a t-SNE embedding instead of UMAP (you can use this with--resume
) - Use
--umap-spread
to control how close together the embedded points are in the UMAP embedding - Use
--umap-min-dist
to control minimum distance between points in UMAP embedding - Use
--neighbors
to control number of neighbors in UMAP graph
Reusing previous results
When changing just the plot options, you can use --resume
to reuse previous results from the output folder.
Warning This will reuse the previous distance matrix, so changes to limits or USEARCH args won't take effect.
# Reuse result from umap output directory
usum --resume --output example --width 600 --height 600 --theme fire
Programmatic use
from usum import usum
# Show help
help(usum)
# Run USUM
usum(inputs=['input.fa'], output='usum', maxdist=0.2, termdist=0.3)
How it works
- A sparse distance matrix is calculated using USEARCH calc_distmx command.
- The distances are based on % identity, so the method is agnostic to sequence type (DNA or protein)
- The distance matrix is embedded as a
precomputed
metric using UMAP - The embedding is plotted using umap.plot.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.