A package to calculate and visualise approximate cluster identities for a large number of short nucleotide sequences using minimizers.
Project description
Approximate Cluster Identities (ACI)
A python package to visualise the approximate within and between cluster identities of a large number of short sequences as assigned by e.g. mmseqs2, cd-hit or panaroo.
Installation
pip install approximate-cluster-identities
Usage
aci -h
Create visualisations of approximate between and within cluster nucleotide identities for short sequences.
positional arguments:
input_fasta Input FASTA file of all sequences.
input_json Input JSON file with cluster assignments ({<sequence header>: <cluster assignment>}).
optional arguments:
-h, --help show this help message and exit
--clusterGML CLUSTERGML
Output path of GML clustering file to view with Cytoscape or similar.
--distanceTable DISTANCETABLE
Output path of CSV of distances (may take a long time).
--clusterPlot CLUSTERPLOT
Output path of jointplot to visualise between and within cluster identities.
--kmerSize KMERSIZE Kmer size (default: 9).
--windowSize WINDOWSIZE
Minimiser window size (default: 20).
--threshold THRESHOLD
Jaccard similarity threshold (default: 0.9).
--threads THREADS Threads for sketching and jaccard distance calculations (default: 1).
--shorter Assess identity relative to the shorter sequence.
Methods
We calculate sequence identities by pairwise calculation of jaccard distances using minimizers of size --kmerSize
where 1 k-mer is sampled from a window that slides across each sequence, each containing a total of --windowSize
k-mers. Increasing --windowSize
will decrease the number of minimizers per sequence, decreasing the sensitivity of the identity calculations but increasing the speed of the programme. This tool is designed to give you an idea of how variable a large number of short sequences are within and between clusters to choose an appropriate sequencing clustering tool and its parameters.
Example output
Example cluster plots for data in test/
using --windowSize 1
and --windowSize 100
.
Window size = 1
Mean identities
Mode identities
Median identities
Range identities
Window size = 100
Mean identities
Mode identities
Median identities
Range identities
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file approximate_cluster_identities-0.1.6.tar.gz
.
File metadata
- Download URL: approximate_cluster_identities-0.1.6.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 484110664cfb5983c56c392a1dc060f88dbf5bbd5e082fef639bc597056afdf8 |
|
MD5 | 87bb4716d6554c56c77bed006a53bc50 |
|
BLAKE2b-256 | ba8ef9ffb756d79b4914963434bd9b08a916776ba2f7c1091ebd615ad7999fb1 |
File details
Details for the file approximate_cluster_identities-0.1.6-py3-none-any.whl
.
File metadata
- Download URL: approximate_cluster_identities-0.1.6-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e2717b24cca4f2a59463ab51c7f16e0cca43261e9ecb1a3b8d76d6425970407 |
|
MD5 | 92ed51affd2c41e2e6e987c60d78d614 |
|
BLAKE2b-256 | 0b5744751b79e374862be3b83ccedf7241311a3b0f598f7718bef001f8191768 |