Skip to main content

Cluster Multiple-Sequence Alignments (MSA) using DBSCAN

Project description

AFCluster-api

A Python-API version of AF_Cluster by Wayment-Steele et al. (2023) In their original GitHub repository they include scripts to perform MSA clustering but do not have a functional API interface that easily allows integrating their workflow into custom settings. This project adapts and refactors their original ClusterMSA.py script into a API format.

Installation

The AFCluster-api version can be install via pip using

pip install afcluster

Usage

The core of the API is the AFCluster object which unifies the complete functionality of the package under one hood, including:

  • performing DBSCAN with a fixed epsilon value
  • performing gridsearch for a suitable epsilon value
  • writing a3m output files for identified clusters
  • writing a cluster metadata table file (csv)
  • plotting PCA or t-SNE for the clustering results

The AFCluster class accepts sequence inputs as either

  • list of strings
  • pandas dataframe with "sequence" columne
  • pandas series of strings in any case the first element is interpreted as the query sequence!

For example:

>>> from afcluster import AFCluster, read_a3m

# load an MSA into a pandas dataframe
>>> msa = read_a3m("tests/test.a3m")
>>> print(msa.head())
                                              header                                           sequence
0                                               >101  MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFA...
1  >UniRef100_A0A964YKG2\t118\t0.907\t2.876E-28\t...  MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFA...
2  >UniRef100_A0A177EKP9\t117\t0.894\t3.948E-28\t...  MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFA...
3  >UniRef100_UPI002231809B\t117\t0.907\t3.948E-2...  MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFA...
4  >UniRef100_A0A6C0JTL4\t117\t0.894\t5.421E-28\t...  MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFA...


# now determine an epsilon value for clustering
# (using multiprocessing for spped)
>>> clusterer = AFCluster()
>>> eps = clusterer.gridsearch_eps(msa)
>>> print(f"determined {eps=}")
determined eps=8.0

# now we can cluster (the clusterer remembers the determined epsilon value)
# we also determine the consensus sequence for each cluster and compute
# levenshtein distances (as 1-d !!!) to the query and consensus sequences 
>>> out_df = clusterer.cluster(msa, consensus_sequence=True, levenshtein=True)
>>> print(out_df.head())
                                            sequence  cluster_id  ... levenshtein_query levenshtein_consensus
0  MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFA...          -1  ...          1.000000              0.697368
1  MQVFIKTLTGKTITLDVEPSDTIESVKQKIQDKEGIPPDQQRLIFA...           0  ...          0.907895              0.907895
2  MQIFVKTLTGKTITLDVENSDTVDTVKTKIQDKEGIPPDQQRLIFA...           0  ...          0.894737              0.894737
3  MQIFVKTLTGKTVTLDVDPSDTIENVKAMIQDKEGIPPDQQRLIFA...           0  ...          0.907895              0.907895
4  MQIFVKTLTGKTITLEVDPSNTIETVKQMIQDKEGIPPDQQRLIYA...           0  ...          0.894737              0.894737

# now generate a PCA plot
>>> ax = clusterer.pca() # ax is a matplotlib.Axes object

# and write a3m output files for each cluster to an output directory
>>> clusterer.write_a3m("clustered_results")
>>> clusterer.write_cluster_table("clustered_results/clusters.csv")

The PCA plot generated with clusterer.pca(). A difference to the original implementation is that we do not simply highlight the first 10 (default) clusters but the 10 biggest clusters instead.

Comparison to the Original

Since we did some code refactoring, we also benchmarked the performance of our implementation versus Wayment-Steele et al.'s original version. We found that our refactored code base runs in roughly half the time that the original implementation required. A workflow invloving a gridsearch for a suitable epsilon value even ran three times faster compared to the original ClusterMSA.py script.

method task time (s)
ours cluster + pca 9.7
original cluster + pca 17.26
ours search eps + cluster + pca 29.65
original search eps + cluster + pca 102.37
ours cluster + pca + tsne 56.44
original cluster + pca + tsne 129.55
ours search eps + cluster + pca + tsne 98.22
original search eps + cluster + pca + tsne 153.63

Computation times are averages. Performance measures were computed using the scripts in the support/performance directory, on an M4 Macbook Pro (2024) using 5 repeats each and an MSA with 19K sequences.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

afcluster-0.1.2-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file afcluster-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: afcluster-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 16.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for afcluster-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0dff5ad8412623b864302bff7ea13092dbf3030f59ee0b884f74b967169ff4c8
MD5 4f857863525a405e34065a45303f53bd
BLAKE2b-256 7fa1a6e35917834ada5057ba0c334166bf11d88378a1c2f9eb1188bc8daf6df2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page