Skip to main content

Performs clustering of molecular dynamics and Monte Carlo trajectories.

Project description

clusttraj - Solvent-Informed Clustering of Trajectories with Python

clusttraj logo

License: GPL v3 build docs codecov PyPI version DOI


This Python package receives a molecular dynamics or Monte Carlo trajectory (in .pdb, .xyz or any format supported by OpenBabel), finds the minimum RMSD between the structures with label reordering and optimal alignment, and performs agglomerative clustering (a kind of unsupervised machine learning) to classify similar conformations.

What the script does is to calculate the distance (using the minimum RMSD) between each configuration of the trajectory, building a RMSD matrix (stored in the condensed form). Different strategies can be used in order to compute distances that correspond to the expected minimum RMSD, such as atom reordering or stepwise alignments. Notice that calculating the RMSD matrix might take some time depending on how long your trajectories are and how many atoms there are in each configuration. The RMSD matrix can also be read from a file (with the -i option) to avoid recalculating it every time you want to change the linkage method (with-m) or distance of the clustering.

Installation

The following libraries are used by clusttraj:

We also have qmllib as an optional dependency as one of the reordering algorithms.

OpenBabel is a runtime dependency, but it is not installed by default when installing clusttraj with pip. If you use Conda, install OpenBabel from conda-forge before installing clusttraj:

conda install -c conda-forge openbabel
pip install clusttraj

For pip-only environments, clusttraj provides an optional dependency that installs the openbabel-wheel package:

pip install "clusttraj[openbabel]"

Avoid mixing Conda OpenBabel and openbabel-wheel in the same environment. If you see OpenBabel import or linker errors, remove one provider and reinstall OpenBabel from the package manager used by that environment.

You can install clusttraj using pip

pip install clusttraj

If you want to use the qmllib reordering algorithm, you can install it with:

pip install clusttraj[qml]

Citation

If you use clusttraj in your academic work, please cite:

Rafael Bicudo Ribeiro and Henrique Musseli Cezar
"clusttraj: A Solvent-Informed Clustering Tool for Molecular Modeling"
Journal of Chemical Theory and Computation, 21, 6759–6768, 2025.
https://pubs.acs.org/doi/10.1021/acs.jctc.5c00634

Usage

To see all the options run the script with the -h command option:

clusttraj -h

or

python -m clusttraj -h

The mandatory arguments are the path to the file containing the trajectory (in a format that OpenBabel can read with Pybel), and one clustering criterion: the maximum RMSD to join two configurations (-rmsd), the silhouette score option (-ss), or the number of clusters to generate (-nc/--n-clusters).

clusttraj trajectory.xyz -rmsd 1.0

To cut the dendrogram to a fixed number of clusters, use --n-clusters:

clusttraj trajectory.xyz --n-clusters 5

or

clusttraj trajectory.xyz -ss

Additional options are available for specifying the input and output files and selecting how the clustering is done. The possible methods used for the agglomerative clustering are the ones available in the linkage method of SciPy's hierarchical clustering. A list with the possible methods (selected with -m) and the description of each of them can be found here.

The default method for the linkage is average, since it was found to have a good compromise with the number of clusters and the actual similarity. To learn more about how the clustering is performed using this algorithm, see UPGMA.

If the -n option is used, the hydrogens are ignored when performing the Kabsch algorithm to find the superposition and calculating the RMSD. This is useful to avoid clustering identical structures with just a methyl group rotated as different.

The -e or --reorder option, tries to reorder the atoms to increase the overlap and reduce the RMSD. The algorithm can be selected with --reorder-alg, between qml (default), hungarian, brute or distance. For more information about the implementation, see the RMSD package. The reorder option can be used together with the -ns option, that receives an integer with the number of atoms of the solute. When the -ns option is used, the script will first superpose the configurations considering only the solute atoms and then reorder considering only the solvent atoms (the atoms in the input that are after the ns atoms). For solute-solvent systems, the use of -ns is strongly encouraged.

To use an already saved RMSD matrix, specify the file containing the RMSD matrix in the condensed form with the -i option. The options -i and -od are mutually exclusive.

The -p flag specifies that pdf plots of some information will be saved. In this case, the filenames will start with the same name used for the clusters output (specified with the -oc option). When the option is used, the following is saved to disk:

  • A plot with the multidimensional scaling representation of the RMSD matrix, colored with the clustering information
  • The dendrogram
  • The cluster classification evolution, that shows how during the trajectory, the configurations were classificated. This might be useful to analyze the quality of your sampling.

If the -cc option is specified (along with a format supported by OpenBabel) the configurations belonging to the same cluster are superposed and printed to a file. The superpositions are done considering the medoid of the cluster as reference. The medoid is printed as the first structure in the clustered strcuture files. If you did not consider the hydrogens while building the RMSD matrix, remember to use the -n option even if with -i in this case, since the superposition is done considering the flag.

The -mc option (along with a format supported by OpenBabel) saves the superposed medoid configuration for all the clusters in a single file. The first medoid is used as the reference configuration and the remaining medoids are superposed to it.

Threading and parallelization

The -np option specified the number of processes to be used to calculate the RMSD matrix. Since this is the most time consuming task of the clustering, and due to being a embarassingly parallel problem, it was parallelized using a Python multiprocessing pool. The default value for -np is 4.

When using -np make sure you also set the correct number of threads for numpy. If you want to use just the multiprocessing parallelization (recommended) use the following bash commands to set the number of numpy threads to one:

export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1

Output

The logging is done both to stdout and to the file clusttraj.log. The number of clusters that were found, as well as the number of members for each cluster are printed in a table. Below there is an example of how this information is printed:

$ clusttraj trajectory.xyz -rmsd 3.2 -np 4 -p -n -cc xyz
2024-12-12 17:48:19,268 INFO     [distmat.py:34] <get_distmat> Calculating RMSD matrix using 4 threads

2024-12-12 17:48:23,800 INFO     [distmat.py:38] <get_distmat> Saving condensed RMSD matrix to distmat.npy

2024-12-12 17:48:23,801 INFO     [classify.py:97] <classify_structures> Clustering using 'average' method to join the clusters

2024-12-12 17:48:23,803 INFO     [classify.py:105] <classify_structures> Saving clustering classification to clusters.dat

2024-12-12 17:48:23,804 INFO     [main.py:59] <main> Writing superposed configurations per cluster to files clusters_confs_*.xyz

2024-12-12 17:48:26,729 INFO     [main.py:102] <main> A total 100 snapshots were read and 7 cluster(s) was(were) found.
The cluster sizes are:
Cluster	Size
1	3
2	3
3	31
4	30
5	18
6	3
7	12

2024-12-12 17:48:26,729 INFO     [main.py:126] <main> Total wall time: 7.462641 s

In the cluster output file (-oc option, default filename clusters.dat) the classification for each structure in the trajectory is printed. For example, if the first structure of the trajectory belongs to the cluster number 2, the second structure belongs to cluster 1, the third to cluster 2 and so on, the file clusters.dat will start with

$ head clusters.dat
7
4
5
3
4
7
6
7
4
3

The plot of the multidimensional representation (when the -p option is used) have each cluster colored in one color as the following picture: Example MDS

The dendrogram has an horizontal line plotted with it indicating the cutoff used for defining the clusters: Example dendrogram

The evolution of the classification with the trajectory looks like: Example evolution

If you wish to use the RMSD matrix file to other uses, bear in mind that the matrix is stored in the condensed form, i.e., only the superior diagonal matrix is printed (not including the diagonal) in NumPy's .npy format. It means that if you have N structures in your trajectory, your file (specified with -od option, default filename distmat.npy) will have N(N-1)/2 lines, with each line representing a distance.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusttraj-1.2.0.tar.gz (65.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clusttraj-1.2.0-py3-none-any.whl (49.3 kB view details)

Uploaded Python 3

File details

Details for the file clusttraj-1.2.0.tar.gz.

File metadata

  • Download URL: clusttraj-1.2.0.tar.gz
  • Upload date:
  • Size: 65.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clusttraj-1.2.0.tar.gz
Algorithm Hash digest
SHA256 efc3c95efca51bc67aedd3d0883a475fc519a3419cb24fd056b9c6db0ac11850
MD5 532d7dd2871491ede68d9c9e60b4139d
BLAKE2b-256 0f085f8c888f1999405999ce6d97bf38a99305ad3c7dfceac719372e478de85c

See more details on using hashes here.

Provenance

The following attestation bundles were made for clusttraj-1.2.0.tar.gz:

Publisher: release.yml on hmcezar/clusttraj

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clusttraj-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: clusttraj-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 49.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clusttraj-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a4ca1ed9c5feb111f687618f3156055bf975a3d71567bbb9cee4ec39c5054f28
MD5 e458f7043ef3218cdbaede16cbd79eb2
BLAKE2b-256 c790e596e42472a9815d091088b3c5747460cc2f038ecabedd4c1d1bdb31885c

See more details on using hashes here.

Provenance

The following attestation bundles were made for clusttraj-1.2.0-py3-none-any.whl:

Publisher: release.yml on hmcezar/clusttraj

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page