USUM: Plotting sequence similarity using USEARCH & UMAP
Project description
USUM: Plotting sequence similarity using USEARCH & UMAP
USUM uses USEARCH and UMAP (or t-SNE) to plot DNA 🧬 and protein 🧶 sequence similarity embeddings.
Installation
-
Install
USEARCHdependency manually: https://drive5.com/usearch/download.html
(consider supporting the author by buying the 64bit license) -
Install
usumusing PIP:
pip install usum
Usage
Use usum to plot input protein or DNA sequences in FASTA format.
Show all available options using usum --help
Minimal example
usum example.fa --maxdist 0.2 --termdist 0.3 --output example
Multiple input files with labels
usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --output example
This will produce a PNG plot:
An interactive Bokeh HTML plot is also created:
Using t-SNE instead of UMAP
You can also produce a t-SNE plot using the --tsne flag.
usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --tsne --output example
This will produce a PNG plot:
Plotting random subset
You can use --limit to extract and plot a random subset of the input sequences.
# Plot 10k sequences from each input file
usum first.fa second.fa --labels First Second --limit 10000 --maxdist 0.2 --termdist 0.3 --output example
You can control randomness and reproducibility using the --seed option.
Plotting options
See usum --help for all plotting options.
See UMAP API Guide for more info about the UMAP options.
- Use
--limitto plot a random subset of records - Use
--widthand--heightto control plot size in pixels - Use
--resumeto reuse previous distance matrix from the output folder - Use
--tsneto produce a t-SNE embedding instead of UMAP (you can use this with--resume) - Use
--umap-spreadto control how close together the embedded points are in the UMAP embedding - Use
--umap-min-distto control minimum distance between points in UMAP embedding - Use
--neighborsto control number of neighbors in UMAP graph
Reusing previous results
When changing just the plot options, you can use --resume to reuse previous results from the output folder.
Warning This will reuse the previous distance matrix, so changes to limits or USEARCH args won't take effect.
# Reuse result from umap output directory
usum --resume --output example --width 600 --height 600 --theme fire
Programmatic use
from usum import usum
# Show help
help(usum)
# Run USUM
usum(inputs=['input.fa'], output='usum', maxdist=0.2, termdist=0.3)
How it works
- A sparse distance matrix is calculated using USEARCH calc_distmx command.
- The distances are based on % identity, so the method is agnostic to sequence type (DNA or protein)
- The distance matrix is embedded as a
precomputedmetric using UMAP - The embedding is plotted using umap.plot.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file usum-0.1.6.tar.gz.
File metadata
- Download URL: usum-0.1.6.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200325 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b99927b561f94ed40d77a05c810a9af2d2dcb1b6c9938c269882acef9141131
|
|
| MD5 |
7fbd585a0511fe596851e6df994954db
|
|
| BLAKE2b-256 |
ad628c0ba3713e23f2cc6764d7dfb2703fb53a3569529a48bef48061853f4c4e
|
File details
Details for the file usum-0.1.6-py3-none-any.whl.
File metadata
- Download URL: usum-0.1.6-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200325 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b643a5ceb1cf3ea1d1419641966cd1337796797fa1e2da081666ef5ea315528
|
|
| MD5 |
39cb287f9b841c23d02ed1b5cd812317
|
|
| BLAKE2b-256 |
65a836a4d2037c64e945d9490044bb6e1c6d0f3128101ee427935c657419e789
|