USUM: Plotting sequence similarity using USEARCH & UMAP
Project description
USUM: Plotting sequence similarity using USEARCH & UMAP
USUM uses USEARCH and UMAP (or t-SNE) to plot DNA 🧬 and protein 🧶 sequence similarity embeddings.
Installation
-
Install
USEARCH
dependency manually: https://drive5.com/usearch/download.html
(consider supporting the author by buying the 64bit license) -
Install
usum
using PIP:
pip install usum
Usage
Use usum
to plot input protein or DNA sequences in FASTA format.
Show all available options using usum --help
Minimal example
usum example.fa --maxdist 0.2 --termdist 0.3 --output example
Multiple input files with labels
usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --output example
This will produce a PNG plot:
An interactive Bokeh HTML plot is also created:
Using t-SNE instead of UMAP
You can also produce a t-SNE plot using the --tsne
flag.
usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --tsne --output example
This will produce a PNG plot:
Plotting random subset
You can use --limit
to extract and plot a random subset of the input sequences.
# Plot 10k sequences from each input file
usum first.fa second.fa --labels First Second --limit 10000 --maxdist 0.2 --termdist 0.3 --output example
You can control randomness and reproducibility using the --seed
option.
Plotting options
See usum --help
for all plotting options.
See UMAP API Guide for more info about the UMAP options.
- Use
--limit
to plot a random subset of records - Use
--width
and--height
to control plot size in pixels - Use
--resume
to reuse previous distance matrix from the output folder - Use
--tsne
to produce a t-SNE embedding instead of UMAP (you can use this with--resume
) - Use
--umap-spread
to control how close together the embedded points are in the UMAP embedding - Use
--umap-min-dist
to control minimum distance between points in UMAP embedding - Use
--neighbors
to control number of neighbors in UMAP graph
Reusing previous results
When changing just the plot options, you can use --resume
to reuse previous results from the output folder.
Warning This will reuse the previous distance matrix, so changes to limits or USEARCH args won't take effect.
# Reuse result from umap output directory
usum --resume --output example --width 600 --height 600 --theme fire
Programmatic use
from usum import usum
# Show help
help(usum)
# Run USUM
usum(inputs=['input.fa'], output='usum', maxdist=0.2, termdist=0.3)
How it works
- A sparse distance matrix is calculated using USEARCH calc_distmx command.
- The distances are based on % identity, so the method is agnostic to sequence type (DNA or protein)
- The distance matrix is embedded as a
precomputed
metric using UMAP - The embedding is plotted using umap.plot.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file usum-0.1.6.tar.gz
.
File metadata
- Download URL: usum-0.1.6.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200325 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b99927b561f94ed40d77a05c810a9af2d2dcb1b6c9938c269882acef9141131 |
|
MD5 | 7fbd585a0511fe596851e6df994954db |
|
BLAKE2b-256 | ad628c0ba3713e23f2cc6764d7dfb2703fb53a3569529a48bef48061853f4c4e |
File details
Details for the file usum-0.1.6-py3-none-any.whl
.
File metadata
- Download URL: usum-0.1.6-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200325 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3b643a5ceb1cf3ea1d1419641966cd1337796797fa1e2da081666ef5ea315528 |
|
MD5 | 39cb287f9b841c23d02ed1b5cd812317 |
|
BLAKE2b-256 | 65a836a4d2037c64e945d9490044bb6e1c6d0f3128101ee427935c657419e789 |