Accurate protein sequence clustering via LSH, Smith-Waterman alignment, and Leiden community detection
Project description
Docs · Report Bug · Request Feature
ClustKIT is a bioinformatics tool for protein sequence clustering. It combines MinHash sketching, locality-sensitive hashing (LSH), banded Smith-Waterman alignment with BLOSUM62 scoring, and Leiden community detection to achieve high clustering accuracy at all identity thresholds, including the challenging low-identity regime (30-50%) where greedy heuristic methods lose sensitivity.
Features
- Accurate at low identity: Smith-Waterman alignment with BLOSUM62 scoring and Leiden graph partitioning produce well-connected clusters, especially at thresholds below 50%
- Fast: Multi-stage filtering (LSH + Jaccard pre-filter + length-ratio filter) eliminates >98% of candidate pairs before alignment
- Scalable: C/OpenMP alignment kernel with configurable thread count; optional GPU acceleration via CuPy
- Flexible clustering: Leiden community detection (default), connected components, or greedy
Installation
pip install clustkit
With GPU acceleration (requires CUDA 12.x):
pip install clustkit[gpu]
From source:
git clone https://github.com/JLSteenwyk/ClustKIT.git
cd ClustKIT
pip install -e ".[dev]"
Quick start
# Cluster proteins at 50% identity using 8 threads
clustkit -i proteins.fasta -o output/ -t 0.5 --threads 8
# Use connected components instead of Leiden
clustkit -i proteins.fasta -o output/ -t 0.7 --cluster-method connected --threads 4
# GPU-accelerated alignment
clustkit -i proteins.fasta -o output/ -t 0.3 --device 0 --threads 8
Output files:
output/clusters.tsv— Cluster assignments (sequence_id, cluster_id, is_representative)output/representatives.fasta— Representative sequencesoutput/run_info.json— Run parameters and statistics
Pipeline overview
ClustKIT clusters sequences through six phases:
- Read — Parse FASTA/FASTQ, integer-encode sequences
- Sketch — MinHash bottom-s sketches with adaptive k-mer size
- LSH — Locality-sensitive hashing to find candidate pairs
- Align — Banded Smith-Waterman with BLOSUM62 scoring, affine gap penalties, length-ratio and Jaccard pre-filters
- Graph — Sparse similarity graph construction
- Cluster — Leiden community detection (default), connected components, or greedy
CLI reference
clustkit Cluster sequences by identity threshold
Options
| Option | Description | Default |
|---|---|---|
-i, --input |
Input FASTA/FASTQ file | required |
-o, --output |
Output directory | required |
-t, --threshold |
Identity threshold (0.0-1.0) | 0.9 |
--threads |
Number of CPU threads | 1 |
--device |
cpu, auto, or GPU device ID (e.g., 0) |
cpu |
--cluster-method |
leiden (default), connected, or greedy |
leiden |
--alignment |
align (SW, accurate) or kmer (fast) |
align |
--clustering-mode |
Presets: balanced, accurate, or fast |
balanced |
--sensitivity |
LSH sensitivity: low, medium, high |
per mode |
--sketch-size |
MinHash sketch size | 128 |
-k, --kmer-size |
K-mer size for sketching | 5 |
--representative |
longest, centroid, or most_connected |
longest |
--format |
Output format: tsv or cdhit |
tsv |
Dependencies
Core (installed automatically):
| Package | Purpose |
|---|---|
| numpy | Array operations |
| numba | JIT-compiled fallback kernels |
| biopython | FASTA/FASTQ parsing |
| scipy | Sparse graph, connected components |
| leidenalg | Leiden community detection |
| python-igraph | Graph representation for Leiden |
| typer + rich | CLI framework |
Optional:
| Package | Install | Purpose |
|---|---|---|
| cupy-cuda12x | pip install clustkit[gpu] |
GPU-accelerated Smith-Waterman alignment |
License
MIT License. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file clustkit-0.1.1.tar.gz.
File metadata
- Download URL: clustkit-0.1.1.tar.gz
- Upload date:
- Size: 75.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7badb8184569dc7e22700ecfef3eab186d31e7f1a2e17741f805bd7f6ba6212e
|
|
| MD5 |
4608616c879776aac0ff115bc96fd650
|
|
| BLAKE2b-256 |
b29b24795a9b349ad543e6b0bd85d63654d8f31cbf198cdb952cb8b988285076
|
File details
Details for the file clustkit-0.1.1-py2.py3-none-any.whl.
File metadata
- Download URL: clustkit-0.1.1-py2.py3-none-any.whl
- Upload date:
- Size: 73.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d88de4ed94b5e6ae01f285b5115df9ccd982fb64aa76c23057da8d6ba3ec60d
|
|
| MD5 |
3b55d075192a29f2f4b0cdafe23286d5
|
|
| BLAKE2b-256 |
6f7a788beaa2e4bd896abd92931d3af33e67c5cd9bdd17fd0e1590c6503a2193
|