Python tools for protein sequence clustering and dataset splitting

These details have not been verified by PyPI

Project links

Project description

protclust logo

protclust

A Python library for working with protein sequence data, providing:

Clustering capabilities via MMseqs2
Machine learning dataset creation with cluster-aware splits

Requirements

This library requires MMseqs2, which must be installed and accessible via the command line. MMseqs2 can be installed using one of the following methods:

Installation Options for MMseqs2

Homebrew:
```
brew install mmseqs2
```

Conda:

conda install -c conda-forge -c bioconda mmseqs2

Docker:
```
docker pull ghcr.io/soedinglab/mmseqs2
```

Static Build (AVX2, SSE4.1, or SSE2):

wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz
tar xvfz mmseqs-linux-avx2.tar.gz
export PATH=$(pwd)/mmseqs/bin/:$PATH

MMseqs2 must be accessible via the mmseqs command in your system's PATH. If the library cannot detect MMseqs2, it will raise an error.

Installation

You can install protclust using pip:

pip install protclust

Or if installing from source, clone the repository and run:

pip install -e .

For development purposes, also install the testing dependencies:

pip install pytest pytest-cov pre-commit ruff

Features

Sequence Clustering and Dataset Creation

import pandas as pd
from protclust import clean, cluster, split, set_verbosity

# Enable detailed logging (optional)
set_verbosity(verbose=True)

# Example data
df = pd.DataFrame({
    "id": ["seq1", "seq2", "seq3", "seq4"],
    "sequence": ["ACDEFGHIKL", "ACDEFGHIKL", "MNPQRSTVWY", "MNPQRSTVWY"]
})

# Clean data
clean_df = clean(df, sequence_col="sequence")

# Cluster sequences
clustered_df = cluster(clean_df, sequence_col="sequence", id_col="id")

# Split data into train and test sets
train_df, test_df = split(clustered_df, group_col="cluster_representative", test_size=0.3)

print("Train set:\n", train_df)
print("Test set:\n", test_df)

# MILP-based splitting with property balancing
from protclust import milp_split
train_df, test_df = milp_split(
    clustered_df,
    group_col="cluster_representative",
    test_size=0.3,
    balance_cols=["molecular_weight", "hydrophobicity"]
)

Parameters

Common parameters for clustering functions:

df: Pandas DataFrame containing sequence data
sequence_col: Column name containing sequences
id_col: Column name containing unique identifiers
min_seq_id: Minimum sequence identity threshold (0.0-1.0, default 0.3)
coverage: Minimum alignment coverage (0.0-1.0, default 0.5)
cov_mode: Coverage mode (0-3, default 0)
cluster_mode: Clustering algorithm (0: Set-Cover, 1: Connected component, 2: Greedy by length, default 0)
cluster_steps: Number of cascaded clustering steps for large datasets (default 1)
test_size: Desired fraction of data in test set (default 0.2)
random_state: Random seed for reproducibility
tolerance: Acceptable deviation from desired split sizes (default 0.05)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Run tests (pytest tests/)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use protclust in your research, please cite:

@software{protclust,
  author = {Michael Scutari},
  title = {protclust: Protein Sequence Clustering and ML Dataset Creation},
  url = {https://github.com/michaelscutari/protclust},
  version = {0.2.0},
  year = {2025},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Sep 9, 2025

0.1.5.post1

Mar 21, 2025

0.1.5

Mar 21, 2025

0.1.4.post1

Mar 20, 2025

0.1.4

Mar 20, 2025

0.1.3

Mar 19, 2025

0.1.2

Mar 19, 2025

0.1.1

Mar 19, 2025

0.1.0

Mar 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protclust-0.2.0.tar.gz (37.0 kB view details)

Uploaded Sep 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

protclust-0.2.0-py3-none-any.whl (14.2 kB view details)

Uploaded Sep 9, 2025 Python 3

File details

Details for the file protclust-0.2.0.tar.gz.

File metadata

Download URL: protclust-0.2.0.tar.gz
Upload date: Sep 9, 2025
Size: 37.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for protclust-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`230e305b7336015a9db9c795d4602334571bc5cff6c92504e70e72f383a13379`
MD5	`c8f1a552627a86a953e1d634a582e43a`
BLAKE2b-256	`14f4244e75efcc50983b9ad27c86d5f03a9d7ff72344dc8c7e741dd3ffa41f97`

See more details on using hashes here.

File details

Details for the file protclust-0.2.0-py3-none-any.whl.

File metadata

Download URL: protclust-0.2.0-py3-none-any.whl
Upload date: Sep 9, 2025
Size: 14.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for protclust-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5d0a91ccad77a5dfaf2258c40448abf4b8dc0cf14946b616bb8356de89fad36f`
MD5	`1ce106cee8421ad9b12c691df64e32a5`
BLAKE2b-256	`b80d2e89ec6371d9251f1f2a03dfff633abbdf2301d6a3b3939328a46d2e2517`

See more details on using hashes here.

protclust 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

protclust

Requirements

Installation Options for MMseqs2

Installation

Installation

Features

Sequence Clustering and Dataset Creation

Parameters

Contributing

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes