clustkit

Accurate protein sequence clustering via LSH, Smith-Waterman alignment, and Leiden community detection

These details have not been verified by PyPI

Project links

Project description

ClustKIT is a bioinformatics tool for protein sequence clustering. It combines MinHash sketching, locality-sensitive hashing (LSH), banded Smith-Waterman alignment with BLOSUM62 scoring, and Leiden community detection to achieve high clustering accuracy at all identity thresholds, including the challenging low-identity regime (30-50%) where greedy heuristic methods lose sensitivity.

Features

Accurate at low identity: Smith-Waterman alignment with BLOSUM62 scoring and Leiden graph partitioning produce well-connected clusters, especially at thresholds below 50%
Fast: Multi-stage filtering (LSH + Jaccard pre-filter + length-ratio filter) eliminates >98% of candidate pairs before alignment
Scalable: C/OpenMP alignment kernel with configurable thread count; optional GPU acceleration via CuPy
Flexible clustering: Leiden community detection (default), connected components, or greedy

Installation

pip install clustkit

With GPU acceleration (requires CUDA 12.x):

pip install clustkit[gpu]

From source:

git clone https://github.com/JLSteenwyk/ClustKIT.git
cd ClustKIT
pip install -e ".[dev]"

Quick start

# Cluster proteins at 50% identity using 8 threads
clustkit -i proteins.fasta -o output/ -t 0.5 --threads 8

# Use connected components instead of Leiden
clustkit -i proteins.fasta -o output/ -t 0.7 --cluster-method connected --threads 4

# GPU-accelerated alignment
clustkit -i proteins.fasta -o output/ -t 0.3 --device 0 --threads 8

Output files:

output/clusters.tsv — Cluster assignments (sequence_id, cluster_id, is_representative)
output/representatives.fasta — Representative sequences
output/run_info.json — Run parameters and statistics
output/cluster_size_distribution.png — Size distribution plot (with --plot)

Pipeline overview

ClustKIT clusters sequences through six phases:

Read — Parse FASTA/FASTQ, integer-encode sequences
Sketch — MinHash bottom-s sketches with adaptive k-mer size
LSH — Locality-sensitive hashing to find candidate pairs
Align — Banded Smith-Waterman with BLOSUM62 scoring, affine gap penalties, length-ratio and Jaccard pre-filters
Graph — Sparse similarity graph construction
Cluster — Leiden community detection (default), connected components, or greedy

CLI reference

clustkit   Cluster sequences by identity threshold

Options

Option	Description	Default
`-i, --input`	Input FASTA/FASTQ file	required
`-o, --output`	Output directory	required
`-t, --threshold`	Identity threshold (0.0-1.0)	0.9
`--threads`	Number of CPU threads	1
`--device`	`cpu`, `auto`, or GPU device ID (e.g., `0`)	`cpu`
`--cluster-method`	`leiden` (default), `connected`, or `greedy`	`leiden`
`--alignment`	`align` (SW, accurate) or `kmer` (fast)	`align`
`--clustering-mode`	Presets: `balanced`, `accurate`, or `fast`	`balanced`
`--sensitivity`	LSH sensitivity: `low`, `medium`, `high`	per mode
`--sketch-size`	MinHash sketch size	128
`-k, --kmer-size`	K-mer size for sketching	5
`--representative`	`longest`, `centroid`, or `most_connected`	`longest`
`--format`	Output format: `tsv` or `cdhit`	`tsv`
`--plot`	Generate cluster size distribution plot	off

Dependencies

Core (installed automatically):

Package	Purpose
numpy	Array operations
numba	JIT-compiled fallback kernels
biopython	FASTA/FASTQ parsing
scipy	Sparse graph, connected components
leidenalg	Leiden community detection
python-igraph	Graph representation for Leiden
typer + rich	CLI framework
matplotlib + pypubfigs	Cluster size distribution plot (`--plot`)

Optional:

Package	Install	Purpose
cupy-cuda12x	`pip install clustkit[gpu]`	GPU-accelerated Smith-Waterman alignment

License

MIT License. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Apr 10, 2026

0.1.1

Apr 10, 2026

0.1.0

Apr 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clustkit-0.1.2.tar.gz (77.0 kB view details)

Uploaded Apr 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clustkit-0.1.2-py2.py3-none-any.whl (74.9 kB view details)

Uploaded Apr 10, 2026 Python 2Python 3

File details

Details for the file clustkit-0.1.2.tar.gz.

File metadata

Download URL: clustkit-0.1.2.tar.gz
Upload date: Apr 10, 2026
Size: 77.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for clustkit-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`33c8afe6471b3fa2b3195bfc80878365edfb59f104e903c6b6df3e87ae656d72`
MD5	`584c2a079ba0f3d6949f9a3210ac52bc`
BLAKE2b-256	`cd3e45ca940a3ed672d6344d9164e8a22359f41b5cc8d51cd1d948cf417d7ac5`

See more details on using hashes here.

File details

Details for the file clustkit-0.1.2-py2.py3-none-any.whl.

File metadata

Download URL: clustkit-0.1.2-py2.py3-none-any.whl
Upload date: Apr 10, 2026
Size: 74.9 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for clustkit-0.1.2-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`c68b8b3978f3fdbb2287e30f92d35dbcb4b9ea07224fddbcda50a0ea06ce12ef`
MD5	`f3cdd0aaafaaab05008c38a8ca9424f8`
BLAKE2b-256	`0c588ea549bc676d29f0e86f1a8f1556ca97b03844eb62ea713e71b97fc9c648`

See more details on using hashes here.

clustkit 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Features

Installation

Quick start

Pipeline overview

CLI reference

Options

Dependencies

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes