Cluster Multiple-Sequence Alignments (MSA) using DBSCAN

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

AFCluster-api

A Python-API version of AF_Cluster by Wayment-Steele et al. (2023) In their original GitHub repository they include scripts to perform MSA clustering but do not have a functional API interface that easily allows integrating their workflow into custom settings. This project adapts and refactors their original ClusterMSA.py script into a API format.

Installation

The AFCluster-api version can be install via pip using

pip install afcluster

Usage

The core of the API is the AFCluster object which unifies the complete functionality of the package under one hood, including:

performing DBSCAN with a fixed epsilon value
performing gridsearch for a suitable epsilon value
writing a3m output files for identified clusters
writing a cluster metadata table file (csv)
plotting PCA or t-SNE for the clustering results

The AFCluster class accepts sequence inputs as either

list of strings
pandas dataframe with "sequence" columne
pandas series of strings in any case the first element is interpreted as the query sequence!

For example:

>>> from afcluster import AFCluster, read_a3m

# load an MSA into a pandas dataframe
>>> msa = read_a3m("tests/test.a3m")
>>> print(msa.head())
                                              header                                           sequence
0                                               >101  MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFA...
1  >UniRef100_A0A964YKG2\t118\t0.907\t2.876E-28\t...  MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFA...
2  >UniRef100_A0A177EKP9\t117\t0.894\t3.948E-28\t...  MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFA...
3  >UniRef100_UPI002231809B\t117\t0.907\t3.948E-2...  MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFA...
4  >UniRef100_A0A6C0JTL4\t117\t0.894\t5.421E-28\t...  MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFA...


# now determine an epsilon value for clustering
# (using multiprocessing for spped)
>>> clusterer = AFCluster()
>>> eps = clusterer.gridsearch_eps(msa)
>>> print(f"determined {eps=}")
determined eps=8.0

# now we can cluster (the clusterer remembers the determined epsilon value)
# we also determine the consensus sequence for each cluster and compute
# levenshtein distances (as 1-d !!!) to the query and consensus sequences 
>>> out_df = clusterer.cluster(msa, consensus_sequence=True, levenshtein=True)
>>> print(out_df.head())
                                            sequence  cluster_id  ... levenshtein_query levenshtein_consensus
0  MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFA...          -1  ...          1.000000              0.697368
1  MQVFIKTLTGKTITLDVEPSDTIESVKQKIQDKEGIPPDQQRLIFA...           0  ...          0.907895              0.907895
2  MQIFVKTLTGKTITLDVENSDTVDTVKTKIQDKEGIPPDQQRLIFA...           0  ...          0.894737              0.894737
3  MQIFVKTLTGKTVTLDVDPSDTIENVKAMIQDKEGIPPDQQRLIFA...           0  ...          0.907895              0.907895
4  MQIFVKTLTGKTITLEVDPSNTIETVKQMIQDKEGIPPDQQRLIYA...           0  ...          0.894737              0.894737

# now generate a PCA plot
>>> ax = clusterer.pca() # ax is a matplotlib.Axes object

# and write a3m output files for each cluster to an output directory
>>> clusterer.write_a3m("clustered_results")
>>> clusterer.write_cluster_table("clustered_results/clusters.csv")

The PCA plot generated with clusterer.pca(). A difference to the original implementation is that we do not simply highlight the first 10 (default) clusters but the 10 biggest clusters instead.

Comparison to the Original

Since we did some code refactoring, we also benchmarked the performance of our implementation versus Wayment-Steele et al.'s original version. We found that our refactored code base runs in roughly half the time that the original implementation required. A workflow invloving a gridsearch for a suitable epsilon value even ran three times faster compared to the original ClusterMSA.py script.

method	task	time (s)
ours	cluster + pca	9.7
original	cluster + pca	17.26
ours	search eps + cluster + pca	29.65
original	search eps + cluster + pca	102.37
ours	cluster + pca + tsne	56.44
original	cluster + pca + tsne	129.55
ours	search eps + cluster + pca + tsne	98.22
original	search eps + cluster + pca + tsne	153.63

Computation times are averages. Performance measures were computed using the scripts in the support/performance directory, on an M4 Macbook Pro (2024) using 5 repeats each and an MSA with 19K sequences.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.2

May 2, 2025

0.1.1

Apr 29, 2025

0.1.0

Apr 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

afcluster-0.1.2-py3-none-any.whl (16.2 kB view details)

Uploaded May 2, 2025 Python 3

File details

Details for the file afcluster-0.1.2-py3-none-any.whl.

File metadata

Download URL: afcluster-0.1.2-py3-none-any.whl
Upload date: May 2, 2025
Size: 16.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for afcluster-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0dff5ad8412623b864302bff7ea13092dbf3030f59ee0b884f74b967169ff4c8`
MD5	`4f857863525a405e34065a45303f53bd`
BLAKE2b-256	`7fa1a6e35917834ada5057ba0c334166bf11d88378a1c2f9eb1188bc8daf6df2`

See more details on using hashes here.

afcluster 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AFCluster-api

Installation

Usage

Comparison to the Original

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes