Skip to main content

Accurate protein sequence clustering via LSH, Smith-Waterman alignment, and Leiden community detection

Project description

Logo

Docs · Report Bug · Request Feature



ClustKIT is a bioinformatics tool for protein sequence clustering. It combines MinHash sketching, locality-sensitive hashing (LSH), banded Smith-Waterman alignment with BLOSUM62 scoring, and Leiden community detection to achieve high clustering accuracy at all identity thresholds, including the challenging low-identity regime (30-50%) where greedy heuristic methods lose sensitivity.

Features

  • Accurate at low identity: Smith-Waterman alignment with BLOSUM62 scoring and Leiden graph partitioning produce well-connected clusters, especially at thresholds below 50%
  • Fast: Multi-stage filtering (LSH + Jaccard pre-filter + length-ratio filter) eliminates >98% of candidate pairs before alignment
  • Scalable: C/OpenMP alignment kernel with configurable thread count; optional GPU acceleration via CuPy
  • Flexible clustering: Leiden community detection (default), connected components, or greedy

Installation

pip install clustkit

With GPU acceleration (requires CUDA 12.x):

pip install clustkit[gpu]

From source:

git clone https://github.com/JLSteenwyk/ClustKIT.git
cd ClustKIT
pip install -e ".[dev]"

Quick start

# Cluster proteins at 50% identity using 8 threads
clustkit -i proteins.fasta -o output/ -t 0.5 --threads 8

# Use connected components instead of Leiden
clustkit -i proteins.fasta -o output/ -t 0.7 --cluster-method connected --threads 4

# GPU-accelerated alignment
clustkit -i proteins.fasta -o output/ -t 0.3 --device 0 --threads 8

Output files:

  • output/clusters.tsv — Cluster assignments (sequence_id, cluster_id, is_representative)
  • output/representatives.fasta — Representative sequences
  • output/run_info.json — Run parameters and statistics
  • output/cluster_size_distribution.png — Size distribution plot (with --plot)

Pipeline overview

ClustKIT clusters sequences through six phases:

  1. Read — Parse FASTA/FASTQ, integer-encode sequences
  2. Sketch — MinHash bottom-s sketches with adaptive k-mer size
  3. LSH — Locality-sensitive hashing to find candidate pairs
  4. Align — Banded Smith-Waterman with BLOSUM62 scoring, affine gap penalties, length-ratio and Jaccard pre-filters
  5. Graph — Sparse similarity graph construction
  6. Cluster — Leiden community detection (default), connected components, or greedy

CLI reference

clustkit   Cluster sequences by identity threshold

Options

Option Description Default
-i, --input Input FASTA/FASTQ file required
-o, --output Output directory required
-t, --threshold Identity threshold (0.0-1.0) 0.9
--threads Number of CPU threads 1
--device cpu, auto, or GPU device ID (e.g., 0) cpu
--cluster-method leiden (default), connected, or greedy leiden
--alignment align (SW, accurate) or kmer (fast) align
--clustering-mode Presets: balanced, accurate, or fast balanced
--sensitivity LSH sensitivity: low, medium, high per mode
--sketch-size MinHash sketch size 128
-k, --kmer-size K-mer size for sketching 5
--representative longest, centroid, or most_connected longest
--format Output format: tsv or cdhit tsv
--plot Generate cluster size distribution plot off

Dependencies

Core (installed automatically):

Package Purpose
numpy Array operations
numba JIT-compiled fallback kernels
biopython FASTA/FASTQ parsing
scipy Sparse graph, connected components
leidenalg Leiden community detection
python-igraph Graph representation for Leiden
typer + rich CLI framework
matplotlib + pypubfigs Cluster size distribution plot (--plot)

Optional:

Package Install Purpose
cupy-cuda12x pip install clustkit[gpu] GPU-accelerated Smith-Waterman alignment

License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clustkit-0.1.2.tar.gz (77.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clustkit-0.1.2-py2.py3-none-any.whl (74.9 kB view details)

Uploaded Python 2Python 3

File details

Details for the file clustkit-0.1.2.tar.gz.

File metadata

  • Download URL: clustkit-0.1.2.tar.gz
  • Upload date:
  • Size: 77.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for clustkit-0.1.2.tar.gz
Algorithm Hash digest
SHA256 33c8afe6471b3fa2b3195bfc80878365edfb59f104e903c6b6df3e87ae656d72
MD5 584c2a079ba0f3d6949f9a3210ac52bc
BLAKE2b-256 cd3e45ca940a3ed672d6344d9164e8a22359f41b5cc8d51cd1d948cf417d7ac5

See more details on using hashes here.

File details

Details for the file clustkit-0.1.2-py2.py3-none-any.whl.

File metadata

  • Download URL: clustkit-0.1.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 74.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for clustkit-0.1.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 c68b8b3978f3fdbb2287e30f92d35dbcb4b9ea07224fddbcda50a0ea06ce12ef
MD5 f3cdd0aaafaaab05008c38a8ca9424f8
BLAKE2b-256 0c588ea549bc676d29f0e86f1a8f1556ca97b03844eb62ea713e71b97fc9c648

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page