Skip to main content

Python tools for protein sequence clustering and dataset splitting

Project description

protclust logo

protclust

PyPI version Tests Coverage License: MIT Python Version

A Python library for working with protein sequence data, providing:

  • Clustering capabilities via MMseqs2
  • Machine learning dataset creation with cluster-aware splits

Requirements

This library requires MMseqs2, which must be installed and accessible via the command line. MMseqs2 can be installed using one of the following methods:

Installation Options for MMseqs2

  • Homebrew:

    brew install mmseqs2
    
  • Conda:

    conda install -c conda-forge -c bioconda mmseqs2
    
  • Docker:

    docker pull ghcr.io/soedinglab/mmseqs2
    
  • Static Build (AVX2, SSE4.1, or SSE2):

    wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz
    tar xvfz mmseqs-linux-avx2.tar.gz
    export PATH=$(pwd)/mmseqs/bin/:$PATH
    

MMseqs2 must be accessible via the mmseqs command in your system's PATH. If the library cannot detect MMseqs2, it will raise an error.

Installation

Installation

You can install protclust using pip:

pip install protclust

Or if installing from source, clone the repository and run:

pip install -e .

For development purposes, also install the testing dependencies:

pip install pytest pytest-cov pre-commit ruff

Features

Sequence Clustering and Dataset Creation

import pandas as pd
from protclust import clean, cluster, split, set_verbosity

# Enable detailed logging (optional)
set_verbosity(verbose=True)

# Example data
df = pd.DataFrame({
    "id": ["seq1", "seq2", "seq3", "seq4"],
    "sequence": ["ACDEFGHIKL", "ACDEFGHIKL", "MNPQRSTVWY", "MNPQRSTVWY"]
})

# Clean data
clean_df = clean(df, sequence_col="sequence")

# Cluster sequences
clustered_df = cluster(clean_df, sequence_col="sequence", id_col="id")

# Split data into train and test sets
train_df, test_df = split(clustered_df, group_col="cluster_representative", test_size=0.3)

print("Train set:\n", train_df)
print("Test set:\n", test_df)

# MILP-based splitting with property balancing
from protclust import milp_split
train_df, test_df = milp_split(
    clustered_df,
    group_col="cluster_representative",
    test_size=0.3,
    balance_cols=["molecular_weight", "hydrophobicity"]
)

Parameters

Common parameters for clustering functions:

  • df: Pandas DataFrame containing sequence data
  • sequence_col: Column name containing sequences
  • id_col: Column name containing unique identifiers
  • min_seq_id: Minimum sequence identity threshold (0.0-1.0, default 0.3)
  • coverage: Minimum alignment coverage (0.0-1.0, default 0.5)
  • cov_mode: Coverage mode (0-3, default 0)
  • cluster_mode: Clustering algorithm (0: Set-Cover, 1: Connected component, 2: Greedy by length, default 0)
  • cluster_steps: Number of cascaded clustering steps for large datasets (default 1)
  • test_size: Desired fraction of data in test set (default 0.2)
  • random_state: Random seed for reproducibility
  • tolerance: Acceptable deviation from desired split sizes (default 0.05)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Run tests (pytest tests/)
  4. Commit your changes (git commit -m 'Add some amazing feature')
  5. Push to the branch (git push origin feature/amazing-feature)
  6. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use protclust in your research, please cite:

@software{protclust,
  author = {Michael Scutari},
  title = {protclust: Protein Sequence Clustering and ML Dataset Creation},
  url = {https://github.com/michaelscutari/protclust},
  version = {0.2.0},
  year = {2025},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protclust-0.2.0.tar.gz (37.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

protclust-0.2.0-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file protclust-0.2.0.tar.gz.

File metadata

  • Download URL: protclust-0.2.0.tar.gz
  • Upload date:
  • Size: 37.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for protclust-0.2.0.tar.gz
Algorithm Hash digest
SHA256 230e305b7336015a9db9c795d4602334571bc5cff6c92504e70e72f383a13379
MD5 c8f1a552627a86a953e1d634a582e43a
BLAKE2b-256 14f4244e75efcc50983b9ad27c86d5f03a9d7ff72344dc8c7e741dd3ffa41f97

See more details on using hashes here.

File details

Details for the file protclust-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: protclust-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 14.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for protclust-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5d0a91ccad77a5dfaf2258c40448abf4b8dc0cf14946b616bb8356de89fad36f
MD5 1ce106cee8421ad9b12c691df64e32a5
BLAKE2b-256 b80d2e89ec6371d9251f1f2a03dfff633abbdf2301d6a3b3939328a46d2e2517

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page