Python tools for protein sequence clustering and dataset splitting
Project description
protclust
A Python library for working with protein sequence data, providing:
- Clustering capabilities via MMseqs2
- Machine learning dataset creation with cluster-aware splits
Requirements
This library requires MMseqs2, which must be installed and accessible via the command line. MMseqs2 can be installed using one of the following methods:
Installation Options for MMseqs2
-
Homebrew:
brew install mmseqs2
-
Conda:
conda install -c conda-forge -c bioconda mmseqs2
-
Docker:
docker pull ghcr.io/soedinglab/mmseqs2
-
Static Build (AVX2, SSE4.1, or SSE2):
wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz tar xvfz mmseqs-linux-avx2.tar.gz export PATH=$(pwd)/mmseqs/bin/:$PATH
MMseqs2 must be accessible via the mmseqs command in your system's PATH. If the library cannot detect MMseqs2, it will raise an error.
Installation
Installation
You can install protclust using pip:
pip install protclust
Or if installing from source, clone the repository and run:
pip install -e .
For development purposes, also install the testing dependencies:
pip install pytest pytest-cov pre-commit ruff
Features
Sequence Clustering and Dataset Creation
import pandas as pd
from protclust import clean, cluster, split, set_verbosity
# Enable detailed logging (optional)
set_verbosity(verbose=True)
# Example data
df = pd.DataFrame({
"id": ["seq1", "seq2", "seq3", "seq4"],
"sequence": ["ACDEFGHIKL", "ACDEFGHIKL", "MNPQRSTVWY", "MNPQRSTVWY"]
})
# Clean data
clean_df = clean(df, sequence_col="sequence")
# Cluster sequences
clustered_df = cluster(clean_df, sequence_col="sequence", id_col="id")
# Split data into train and test sets
train_df, test_df = split(clustered_df, group_col="cluster_representative", test_size=0.3)
print("Train set:\n", train_df)
print("Test set:\n", test_df)
# MILP-based splitting with property balancing
from protclust import milp_split
train_df, test_df = milp_split(
clustered_df,
group_col="cluster_representative",
test_size=0.3,
balance_cols=["molecular_weight", "hydrophobicity"]
)
Parameters
Common parameters for clustering functions:
df: Pandas DataFrame containing sequence datasequence_col: Column name containing sequencesid_col: Column name containing unique identifiersmin_seq_id: Minimum sequence identity threshold (0.0-1.0, default 0.3)coverage: Minimum alignment coverage (0.0-1.0, default 0.5)cov_mode: Coverage mode (0-3, default 0)cluster_mode: Clustering algorithm (0: Set-Cover, 1: Connected component, 2: Greedy by length, default 0)cluster_steps: Number of cascaded clustering steps for large datasets (default 1)test_size: Desired fraction of data in test set (default 0.2)random_state: Random seed for reproducibilitytolerance: Acceptable deviation from desired split sizes (default 0.05)
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Run tests (
pytest tests/) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use protclust in your research, please cite:
@software{protclust,
author = {Michael Scutari},
title = {protclust: Protein Sequence Clustering and ML Dataset Creation},
url = {https://github.com/michaelscutari/protclust},
version = {0.2.0},
year = {2025},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file protclust-0.2.0.tar.gz.
File metadata
- Download URL: protclust-0.2.0.tar.gz
- Upload date:
- Size: 37.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
230e305b7336015a9db9c795d4602334571bc5cff6c92504e70e72f383a13379
|
|
| MD5 |
c8f1a552627a86a953e1d634a582e43a
|
|
| BLAKE2b-256 |
14f4244e75efcc50983b9ad27c86d5f03a9d7ff72344dc8c7e741dd3ffa41f97
|
File details
Details for the file protclust-0.2.0-py3-none-any.whl.
File metadata
- Download URL: protclust-0.2.0-py3-none-any.whl
- Upload date:
- Size: 14.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d0a91ccad77a5dfaf2258c40448abf4b8dc0cf14946b616bb8356de89fad36f
|
|
| MD5 |
1ce106cee8421ad9b12c691df64e32a5
|
|
| BLAKE2b-256 |
b80d2e89ec6371d9251f1f2a03dfff633abbdf2301d6a3b3939328a46d2e2517
|