Skip to main content

Accurate protein sequence clustering via LSH, Smith-Waterman alignment, and Leiden community detection

Project description

Logo

Docs · Report Bug · Request Feature



ClustKIT is a bioinformatics tool for protein sequence clustering. It combines MinHash sketching, locality-sensitive hashing (LSH), banded Smith-Waterman alignment with BLOSUM62 scoring, and Leiden community detection to achieve high clustering accuracy at all identity thresholds, including the challenging low-identity regime (30-50%) where greedy heuristic methods lose sensitivity.

Features

  • Accurate at low identity: Smith-Waterman alignment with BLOSUM62 scoring and Leiden graph partitioning produce well-connected clusters, especially at thresholds below 50%
  • Fast: Multi-stage filtering (LSH + Jaccard pre-filter + length-ratio filter) eliminates >98% of candidate pairs before alignment
  • Scalable: C/OpenMP alignment kernel with configurable thread count; optional GPU acceleration via CuPy
  • Flexible clustering: Leiden community detection (default), connected components, or greedy

Installation

pip install clustkit

With GPU acceleration (requires CUDA 12.x):

pip install clustkit[gpu]

From source:

git clone https://github.com/JLSteenwyk/ClustKIT.git
cd ClustKIT
pip install -e ".[dev]"

Quick start

# Cluster proteins at 50% identity using 8 threads
clustkit -i proteins.fasta -o output/ -t 0.5 --threads 8

# Use connected components instead of Leiden
clustkit -i proteins.fasta -o output/ -t 0.7 --cluster-method connected --threads 4

# GPU-accelerated alignment
clustkit -i proteins.fasta -o output/ -t 0.3 --device 0 --threads 8

Output files:

  • output/clusters.tsv — Cluster assignments (sequence_id, cluster_id, is_representative)
  • output/representatives.fasta — Representative sequences
  • output/run_info.json — Run parameters and statistics

Pipeline overview

ClustKIT clusters sequences through six phases:

  1. Read — Parse FASTA/FASTQ, integer-encode sequences
  2. Sketch — MinHash bottom-s sketches with adaptive k-mer size
  3. LSH — Locality-sensitive hashing to find candidate pairs
  4. Align — Banded Smith-Waterman with BLOSUM62 scoring, affine gap penalties, length-ratio and Jaccard pre-filters
  5. Graph — Sparse similarity graph construction
  6. Cluster — Leiden community detection (default), connected components, or greedy

CLI reference

clustkit   Cluster sequences by identity threshold

Options

Option Description Default
-i, --input Input FASTA/FASTQ file required
-o, --output Output directory required
-t, --threshold Identity threshold (0.0-1.0) 0.9
--threads Number of CPU threads 1
--device cpu, auto, or GPU device ID (e.g., 0) cpu
--cluster-method leiden (default), connected, or greedy leiden
--alignment align (SW, accurate) or kmer (fast) align
--clustering-mode Presets: balanced, accurate, or fast balanced
--sensitivity LSH sensitivity: low, medium, high per mode
--sketch-size MinHash sketch size 128
-k, --kmer-size K-mer size for sketching 5
--representative longest, centroid, or most_connected longest
--format Output format: tsv or cdhit tsv

Dependencies

Core (installed automatically):

Package Purpose
numpy Array operations
numba JIT-compiled fallback kernels
biopython FASTA/FASTQ parsing
scipy Sparse graph, connected components
leidenalg Leiden community detection
python-igraph Graph representation for Leiden
typer + rich CLI framework

Optional:

Package Install Purpose
cupy-cuda12x pip install clustkit[gpu] GPU-accelerated Smith-Waterman alignment

License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clustkit-0.1.1.tar.gz (75.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clustkit-0.1.1-py2.py3-none-any.whl (73.2 kB view details)

Uploaded Python 2Python 3

File details

Details for the file clustkit-0.1.1.tar.gz.

File metadata

  • Download URL: clustkit-0.1.1.tar.gz
  • Upload date:
  • Size: 75.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for clustkit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7badb8184569dc7e22700ecfef3eab186d31e7f1a2e17741f805bd7f6ba6212e
MD5 4608616c879776aac0ff115bc96fd650
BLAKE2b-256 b29b24795a9b349ad543e6b0bd85d63654d8f31cbf198cdb952cb8b988285076

See more details on using hashes here.

File details

Details for the file clustkit-0.1.1-py2.py3-none-any.whl.

File metadata

  • Download URL: clustkit-0.1.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 73.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for clustkit-0.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 1d88de4ed94b5e6ae01f285b5115df9ccd982fb64aa76c23057da8d6ba3ec60d
MD5 3b55d075192a29f2f4b0cdafe23286d5
BLAKE2b-256 6f7a788beaa2e4bd896abd92931d3af33e67c5cd9bdd17fd0e1590c6503a2193

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page