clustrX: Highly Robust and Sensitive Protein Clustering Using Similarity Networks and Leiden Community Detection

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

clustrX: Highly Robust and Sensitive Protein Clustering

clustrX is a high-performance framework designed to transform sequence similarity search results into biologically coherent protein families. By modeling homology as a weighted mathematical network and applying the Leiden community detection algorithm, clustrX provides a sensitive and robust solution for clustering sequences, especially in complex scenarios involving remote homology and short peptides.

🚀 Key Features

Leiden Community Detection: Beyond simple links, clustrX identifies densely connected communities, ensuring high internal cohesion and preventing artificial family merging (e.g., due to domain bridges).
Agnostic Input: Works with results from BLAST, Diamond, MMseqs2, and HMMER. Or others using the custom input option.
Dynamic Coverage Filter: Our recommended approach to handle sequences of varying lengths to obtain the most reliable and biologically sound results.
Ultra-Fast Performance: Powered by Polars (Rust-based) for data processing and igraph (C-based) for network analysis.
Integrated Workflow: From similarity hits to Multiple Sequence Alignments (MSAs) in a single command.

📦 Installation

You can install clustrX using two main methods. Note the difference in dependency management:

Option A: Via Conda (Recommended)

This is the easiest way as it automatically installs all external dependencies, including MAFFT for alignments.

conda install -c bioconda clustrx

Option B: Via Pip (Using a Virtual Environment)

To avoid conflicts with other packages and ensure the clustrx command is correctly recognized by your system (avoiding PATH issues), we highly recommend using a virtual environment:

Create a new environment:
```
python -m venv clustrx_env
```
Activate it:
- Windows: clustrx_env\Scripts\activate
- Linux/macOS: source clustrx_env/bin/activate
Install:
```
pip install clustrX
```

[!TIP] If the clustrx command is not recognized after installation (common on Windows), it is likely because the installation directory is not in your system's PATH. You can either add it manually or use the following foolproof method: python -m clustrx [arguments]

Note: If you use Pip, remember that you must install MAFFT manually on your system if you plan to use the --mafft option.

⚙️ Input Formats & Requirements

clustrX is designed to be a post-processing layer. It requires two main inputs:

Similarity Hits: A tabular file (BLAST-like or HMMER).
Sequences: A FASTA file containing the sequences referenced in the hits.

Using BLAST

clustrX works natively with the default tabular output of BLAST (-outfmt 6).

blastp -query sequences.fasta -db database -out hits.tsv -outfmt 6

Using Diamond or MMseqs2

If you use these tools, you must ensure the output is in BLAST tabular format (outfmt 6):

Diamond:

diamond blastp -q query.fasta -d db.dmnd -o hits.tsv --outfmt 6

MMseqs2:

mmseqs easy-search query.fasta target.fasta hits.tsv tmp --format-mode 0

Using HMMER

HMMER outputs require specific flags depending on the filtering level you need:

domtblout (Recommended): Use the --domtblout flag in hmmsearch or phmmer. This format provides alignment coordinates, which are required for using the Dynamic Coverage filter.
```
hmmsearch --domtblout hits.domtblout profile.hmm database.fasta
```
tblout: Use the --tblout flag. Note that this format lacks coordinate information; therefore, Dynamic Coverage cannot be applied (only E-value and Bitscore filters will be used).
```
hmmsearch --tblout hits.tblout profile.hmm database.fasta
```

🧬 The Power of Dynamic Coverage

We strongly recommend using the Dynamic Coverage mode (--coverage dynamic) for most scientific applications. For more information about this, please, read the paper.

Standard clustering methods often use fixed thresholds that fail to resolve relationships between sequences of very different sizes. Our dynamic filter uses a hyperbolic decay function (calibrated with a 50-residue scale factor) that:

Increases stringency for short peptides (up to 0.8 coverage) to filter out statistical noise.
Gradually relaxes for larger proteins (down to 0.4 coverage) to maximize sensitivity in detecting remote homology.

🛠️ Workflow & Usage

The clustrX pipeline follows a clear 3-step logic:

Filter: Hits are filtered based on E-value, Bitscore, and (recommended) Dynamic Coverage.
Cluster: A similarity network is built where edges are weighted by Bitscore, then partitioned using Leiden algorithm.
Output: Results are exported. Note: Fasta generation and alignments are optional.

Example: Recommended Scientific Run

clustrx -i hits.tsv -f sequences.fasta --coverage dynamic --write-fasta --mafft --outdir results_full

--write-fasta: (Optional) Creates a FASTA file for each generated cluster.
--mafft: (Optional) Automatically performs Multiple Sequence Alignment for each cluster.

💡 Use Cases

Protein Family Discovery: Organizing large proteomes into evolutionarily related groups.
Short Peptide Classification: Specifically tuned for the discovery of Antimicrobial Peptides (AMPs), toxins, signaling peptides or others.
Remote Homology Exploration: Identifying relationships in the "twilight zone" (identity < 30%) where traditional greedy methods fragment families.
Domain-Aware Clustering: Using HMMER domtblout inputs to cluster sequences based on specific functional domains.

📝 Citation

If you use clustrX in your research, please cite:

Benítez-Prián, M. & San Mauro, D. (2026). clustrX: Highly Robust and Sensitive Protein Clustering Using Similarity Networks and Leiden Community Detection.

👤 Authors

Mario Benítez-Prián & Diego San Mauro

Contact: mario.benitezprian@gmail.com | GitHub

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

itssmarioo

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.1

May 8, 2026

This version

1.0.0

May 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clustrx-1.0.0.tar.gz (16.8 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clustrx-1.0.0-py3-none-any.whl (13.1 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file clustrx-1.0.0.tar.gz.

File metadata

Download URL: clustrx-1.0.0.tar.gz
Upload date: May 8, 2026
Size: 16.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clustrx-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`a192c0f437ea5f08e4113b595f16e2eff2b8ee3db44f3529e62cc9eb3a283ca4`
MD5	`bbd56695a3de1fc5581608b9f651ec76`
BLAKE2b-256	`68ff2d2ab06f483185c923c41e97b95cf50a6d408ef0813aeeb6935020402498`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clustrx-1.0.0.tar.gz:

Publisher: publish.yml on mario-benitez-prian/clustrX

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clustrx-1.0.0.tar.gz
- Subject digest: a192c0f437ea5f08e4113b595f16e2eff2b8ee3db44f3529e62cc9eb3a283ca4
- Sigstore transparency entry: 1474463573
- Sigstore integration time: May 8, 2026
Source repository:
- Permalink: mario-benitez-prian/clustrX@ef6f292c021241316d7f5accbbf4956dbcb49553
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/mario-benitez-prian
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ef6f292c021241316d7f5accbbf4956dbcb49553
- Trigger Event: push

File details

Details for the file clustrx-1.0.0-py3-none-any.whl.

File metadata

Download URL: clustrx-1.0.0-py3-none-any.whl
Upload date: May 8, 2026
Size: 13.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clustrx-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6c171440993c469c2a0aea4fb7e2b22d4d77fd803c9fb173d4909c5105e5092b`
MD5	`52436cfe404bede03eb69f99351a9cfc`
BLAKE2b-256	`a49b47ddc842d5bae75c56f90dc957b91dcd43dc0dd690bb991db32d1af589a9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clustrx-1.0.0-py3-none-any.whl:

Publisher: publish.yml on mario-benitez-prian/clustrX

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clustrx-1.0.0-py3-none-any.whl
- Subject digest: 6c171440993c469c2a0aea4fb7e2b22d4d77fd803c9fb173d4909c5105e5092b
- Sigstore transparency entry: 1474464154
- Sigstore integration time: May 8, 2026
Source repository:
- Permalink: mario-benitez-prian/clustrX@ef6f292c021241316d7f5accbbf4956dbcb49553
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/mario-benitez-prian
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ef6f292c021241316d7f5accbbf4956dbcb49553
- Trigger Event: push

clustrX 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

clustrX: Highly Robust and Sensitive Protein Clustering

🚀 Key Features

📦 Installation

Option A: Via Conda (Recommended)

Option B: Via Pip (Using a Virtual Environment)

⚙️ Input Formats & Requirements

Using BLAST

Using Diamond or MMseqs2

Using HMMER

🧬 The Power of Dynamic Coverage

🛠️ Workflow & Usage

Example: Recommended Scientific Run

💡 Use Cases

📝 Citation

👤 Authors

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance