Performs clustering of molecular dynamics and Monte Carlo trajectories.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

hmcezar

These details have not been verified by PyPI

Project description

clusttraj - Solvent-Informed Clustering of Trajectories with Python

build

This Python package receives a molecular dynamics or Monte Carlo trajectory (in .pdb, .xyz or any format supported by OpenBabel), finds the minimum RMSD between the structures with label reordering and optimal alignment, and performs agglomerative clustering (a kind of unsupervised machine learning) to classify similar conformations.

What the script does is to calculate the distance (using the minimum RMSD) between each configuration of the trajectory, building a RMSD matrix (stored in the condensed form). Different strategies can be used in order to compute distances that correspond to the expected minimum RMSD, such as atom reordering or stepwise alignments. Notice that calculating the RMSD matrix might take some time depending on how long your trajectories are and how many atoms there are in each configuration. The RMSD matrix can also be read from a file (with the -i option) to avoid recalculating it every time you want to change the linkage method (with-m) or distance of the clustering.

Installation

The following libraries are used by clusttraj:

We also have qmllib as an optional dependency as one of the reordering algorithms.

OpenBabel is a runtime dependency, but it is not installed by default when installing clusttraj with pip. If you use Conda, install OpenBabel from conda-forge before installing clusttraj:

conda install -c conda-forge openbabel
pip install clusttraj

For pip-only environments, clusttraj provides an optional dependency that installs the openbabel-wheel package:

pip install "clusttraj[openbabel]"

Avoid mixing Conda OpenBabel and openbabel-wheel in the same environment. If you see OpenBabel import or linker errors, remove one provider and reinstall OpenBabel from the package manager used by that environment.

You can install clusttraj using pip

pip install clusttraj

If you want to use the qmllib reordering algorithm, you can install it with:

pip install clusttraj[qml]

Citation

If you use clusttraj in your academic work, please cite:

Rafael Bicudo Ribeiro and Henrique Musseli Cezar
"clusttraj: A Solvent-Informed Clustering Tool for Molecular Modeling"
Journal of Chemical Theory and Computation, 21, 6759–6768, 2025.
https://pubs.acs.org/doi/10.1021/acs.jctc.5c00634

Usage

To see all the options run the script with the -h command option:

clusttraj -h

python -m clusttraj -h

The mandatory arguments are the path to the file containing the trajectory (in a format that OpenBabel can read with Pybel), and one clustering criterion: the maximum RMSD to join two configurations (-rmsd), the silhouette score option (-ss), or the number of clusters to generate (-nc/--n-clusters).

clusttraj trajectory.xyz -rmsd 1.0

To cut the dendrogram to a fixed number of clusters, use --n-clusters:

clusttraj trajectory.xyz --n-clusters 5

clusttraj trajectory.xyz -ss

Additional options are available for specifying the input and output files and selecting how the clustering is done. The possible methods used for the agglomerative clustering are the ones available in the linkage method of SciPy's hierarchical clustering. A list with the possible methods (selected with -m) and the description of each of them can be found here.

The default method for the linkage is average, since it was found to have a good compromise with the number of clusters and the actual similarity. To learn more about how the clustering is performed using this algorithm, see UPGMA.

If the -n option is used, the hydrogens are ignored when performing the Kabsch algorithm to find the superposition and calculating the RMSD. This is useful to avoid clustering identical structures with just a methyl group rotated as different.

The -e or --reorder option, tries to reorder the atoms to increase the overlap and reduce the RMSD. The algorithm can be selected with --reorder-alg, between qml (default), hungarian, brute or distance. For more information about the implementation, see the RMSD package. The reorder option can be used together with the -ns option, that receives an integer with the number of atoms of the solute. When the -ns option is used, the script will first superpose the configurations considering only the solute atoms and then reorder considering only the solvent atoms (the atoms in the input that are after the ns atoms). For solute-solvent systems, the use of -ns is strongly encouraged.

To use an already saved RMSD matrix, specify the file containing the RMSD matrix in the condensed form with the -i option. The options -i and -od are mutually exclusive.

The -p flag specifies that pdf plots of some information will be saved. In this case, the filenames will start with the same name used for the clusters output (specified with the -oc option). When the option is used, the following is saved to disk:

A plot with the multidimensional scaling representation of the RMSD matrix, colored with the clustering information
The dendrogram
The cluster classification evolution, that shows how during the trajectory, the configurations were classificated. This might be useful to analyze the quality of your sampling.

If the -cc option is specified (along with a format supported by OpenBabel) the configurations belonging to the same cluster are superposed and printed to a file. The superpositions are done considering the medoid of the cluster as reference. The medoid is printed as the first structure in the clustered strcuture files. If you did not consider the hydrogens while building the RMSD matrix, remember to use the -n option even if with -i in this case, since the superposition is done considering the flag.

The -mc option (along with a format supported by OpenBabel) saves the superposed medoid configuration for all the clusters in a single file. The first medoid is used as the reference configuration and the remaining medoids are superposed to it.

Threading and parallelization

The -np option specified the number of processes to be used to calculate the RMSD matrix. Since this is the most time consuming task of the clustering, and due to being a embarassingly parallel problem, it was parallelized using a Python multiprocessing pool. The default value for -np is 4.

When using -np make sure you also set the correct number of threads for numpy. If you want to use just the multiprocessing parallelization (recommended) use the following bash commands to set the number of numpy threads to one:

export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1

Output

The logging is done both to stdout and to the file clusttraj.log. The number of clusters that were found, as well as the number of members for each cluster are printed in a table. Below there is an example of how this information is printed:

$ clusttraj trajectory.xyz -rmsd 3.2 -np 4 -p -n -cc xyz
2024-12-12 17:48:19,268 INFO     [distmat.py:34] <get_distmat> Calculating RMSD matrix using 4 threads

2024-12-12 17:48:23,800 INFO     [distmat.py:38] <get_distmat> Saving condensed RMSD matrix to distmat.npy

2024-12-12 17:48:23,801 INFO     [classify.py:97] <classify_structures> Clustering using 'average' method to join the clusters

2024-12-12 17:48:23,803 INFO     [classify.py:105] <classify_structures> Saving clustering classification to clusters.dat

2024-12-12 17:48:23,804 INFO     [main.py:59] <main> Writing superposed configurations per cluster to files clusters_confs_*.xyz

2024-12-12 17:48:26,729 INFO     [main.py:102] <main> A total 100 snapshots were read and 7 cluster(s) was(were) found.
The cluster sizes are:
Cluster	Size
1	3
2	3
3	31
4	30
5	18
6	3
7	12

2024-12-12 17:48:26,729 INFO     [main.py:126] <main> Total wall time: 7.462641 s

In the cluster output file (-oc option, default filename clusters.dat) the classification for each structure in the trajectory is printed. For example, if the first structure of the trajectory belongs to the cluster number 2, the second structure belongs to cluster 1, the third to cluster 2 and so on, the file clusters.dat will start with

$ head clusters.dat
7
4
5
3
4
7
6
7
4
3

The plot of the multidimensional representation (when the -p option is used) have each cluster colored in one color as the following picture: Example MDS

The dendrogram has an horizontal line plotted with it indicating the cutoff used for defining the clusters: Example dendrogram

The evolution of the classification with the trajectory looks like: Example evolution

If you wish to use the RMSD matrix file to other uses, bear in mind that the matrix is stored in the condensed form, i.e., only the superior diagonal matrix is printed (not including the diagonal) in NumPy's .npy format. It means that if you have N structures in your trajectory, your file (specified with -od option, default filename distmat.npy) will have N(N-1)/2 lines, with each line representing a distance.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

hmcezar

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.2.0

May 15, 2026

1.1.0

May 13, 2026

1.0.0

Apr 22, 2025

0.3.4

Feb 4, 2025

0.3.3

Jan 11, 2025

0.3.2

Jan 11, 2025

0.3.1

Dec 26, 2024

0.3.0

Dec 26, 2024

0.2.1

Dec 20, 2024

0.1.3

Sep 27, 2023

0.1.2

Sep 25, 2023

0.1.1

Sep 25, 2023

0.1.0

Sep 25, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusttraj-1.2.0.tar.gz (65.4 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clusttraj-1.2.0-py3-none-any.whl (49.3 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file clusttraj-1.2.0.tar.gz.

File metadata

Download URL: clusttraj-1.2.0.tar.gz
Upload date: May 15, 2026
Size: 65.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clusttraj-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`efc3c95efca51bc67aedd3d0883a475fc519a3419cb24fd056b9c6db0ac11850`
MD5	`532d7dd2871491ede68d9c9e60b4139d`
BLAKE2b-256	`0f085f8c888f1999405999ce6d97bf38a99305ad3c7dfceac719372e478de85c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clusttraj-1.2.0.tar.gz:

Publisher: release.yml on hmcezar/clusttraj

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clusttraj-1.2.0.tar.gz
- Subject digest: efc3c95efca51bc67aedd3d0883a475fc519a3419cb24fd056b9c6db0ac11850
- Sigstore transparency entry: 1548513327
- Sigstore integration time: May 15, 2026
Source repository:
- Permalink: hmcezar/clusttraj@ab96e23e5988ca3c3c7af76b1b06c634da57e791
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/hmcezar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ab96e23e5988ca3c3c7af76b1b06c634da57e791
- Trigger Event: push

File details

Details for the file clusttraj-1.2.0-py3-none-any.whl.

File metadata

Download URL: clusttraj-1.2.0-py3-none-any.whl
Upload date: May 15, 2026
Size: 49.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clusttraj-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a4ca1ed9c5feb111f687618f3156055bf975a3d71567bbb9cee4ec39c5054f28`
MD5	`e458f7043ef3218cdbaede16cbd79eb2`
BLAKE2b-256	`c790e596e42472a9815d091088b3c5747460cc2f038ecabedd4c1d1bdb31885c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clusttraj-1.2.0-py3-none-any.whl:

Publisher: release.yml on hmcezar/clusttraj

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clusttraj-1.2.0-py3-none-any.whl
- Subject digest: a4ca1ed9c5feb111f687618f3156055bf975a3d71567bbb9cee4ec39c5054f28
- Sigstore transparency entry: 1548513349
- Sigstore integration time: May 15, 2026
Source repository:
- Permalink: hmcezar/clusttraj@ab96e23e5988ca3c3c7af76b1b06c634da57e791
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/hmcezar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ab96e23e5988ca3c3c7af76b1b06c634da57e791
- Trigger Event: push

clusttraj 1.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

clusttraj - Solvent-Informed Clustering of Trajectories with Python

Installation

Citation

Usage

Threading and parallelization

Output

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance