Lumping tool for state trajectories
Project description
The Most Probable Path Algorithm for Reducing the Number of States in a State Trajectory
This package implements the most probable path (MPP) algorithm, which is used to coarse-grain the number of discrete states of a Markov process. Based on a microstate trajectory, a Markov state model is estimated utilising msmhelper. The transition probabilities and optional other descriptors are used to combine states in such a way that the final macrostates exhibit a given minimum population and metastability.
In the most basic example, a lumping tree is generated by iteratively selecting the least stable state and merging it with the state, to which its transition probability is highest. In a second step, the tree is parsed in reverse order (starting at the root) in order to identify the macrostates. For details, see publication tbd.
Features
- Perform the most probable path (MPP) algorithm on a given microstate trajectory
- Variety of analysis plots
- Multi trajectory support
- Extensions to the basic algorithm
- Similarity by Kullback-Leibler divergence of transition probabilities
- Incorporation Jenson-Shannon divergence of a feature (e.g. contact distances)
- Stochastic lumping
- Three levels of user interface
- as integration in your Python code (use the
MPP.LumpingorMPP.run.Dataobjects) - as Python module (
python -m MPP.run --help) - in your Snakemake workflow (see ´workflow`)
- as integration in your Python code (use the
Installation
The package is available in the Python Package Index (PyPI) and can be installed via pip:
python -m pip install mpp-lumping
Usage
Dependent on your skills and needs, the module can be used at three different entry points:
- Integrate the central
MPP.LumpingorMPP.run.Dataobject in your Python pipeline. - Use the high-level Python interface (
python -m MPP.run ...) via the command line. - In a Snakemake workflow where you only need to provide the configuration of your system and you're ready to go.
Config File
Config files (YAML files) are used to pass the information of where the files are located and some lumping parameters. Below you see a reference config file with all possible parameters. Note that only the following fields are mandatory: source, microstate trajectory, multi feature trajectory, frame length, lagtime, pop_thr, q_min. Please refer to the wiki for a detailed description of the parameters (tbd).
source: data/HP35/input/ # root directory of the other files
microstate trajectory: microstate_trajectory # the microstate trajectory
multi feature trajectory: contact_distances_trajectory # the feature trajectory, each line contains the feature values of the respective feature
cluster file: clusters # Defines the contact clusters, each row contains the contact indices of a contact cluster
contact index file: contacts.ndx # residue indices for contacts.
limits: null # contains the lengths of each trajectory when multiple trajectories are concatenated
topology file: structure.pdb # topology file used with the xtc trajectory
xtc file: trajectory.xtc # the xtc trajectory file
helices: helices # definition of secondary structure elements
contact threshold: null # Threshold for in the feature space below which e.g. a contact is considered to be formed.
frame length: 0.2 # in ns / frame
lagtime: 50 # in frames
pop_thr: 0.005 # population threshold for macrostates
q_min: 0.5 # minimum metastability of macrostates
n timescales: 3 # number of timescales to plot in the implied timescales plot. 3 is the default.
# For stochastic lumping
stochastic:
method: n
param: 2
n: 10
# PyMol rendering
view: view # contains the view information for PyMol
width: 500 # width of the image in px
height: 500 # height of the image in px
Python Module
Running the package requires a configuration file as described above as a first parameter. The following two positional arguments define the similarity between two states. First comes the dynamic similarity (T, KL, none), second the geometric similarity (JS, none). For reference, G-PCCA can be performed by issuing gpcca first and then the number of macrostates to create. Pass ref in order to take the number of macrostates from the T none lumping (the similarity between states corresponds to the transition probability between them).
Provide the target file (where to store the plot) with the option -o. The lumping tree (the result of the first, potentially intense step) is defined by a Z matrix, which can be stored and loaded by the -Z option. If the provided path exists, this file is loaded as Z matrix. For the --rmsd option, the same holds true as it is intense to calculate. With the --rmsd-feature option, you can select if the RMSD should be calculated for the C alpha atoms in the Cartesian coordinate space or the space of your feature, e.g. contact distances. -r allows you to draw N random frames of each macrostate and -p produces desired plots. --get-least-moving-residues saves the indices in the state trajectory of the least moving residues per macrostate to a text file. This allows to find the residues, which participate in the most stable contact distances for each macrostate.
~$ python -m MPP.run --help
usage: Perform MPP on MD simulation data [-h] [-o OUT] [-Z Z] [--rmsd RMSD]
[--rmsd-feature RMSD_FEATURE] [--xtc-stride XTC_STRIDE] [-r N]
[-p PLOT]
[--get-least-moving-residues GET_LEAST_MOVING_RESIDUES]
data_specification d g
This program allows for the analysis of MD data utilizing the most probable path algorithm. It allows
for easy plotting of different quality measures.
positional arguments:
data_specification yaml file containing specification of files and parameters of the simulation
d dij to be used.
g gij to be used.
options:
-h, --help show this help message and exit
-o OUT, --out OUT Where to store the plot
-Z Z Perform MPP and write the Z matrix
--rmsd RMSD Generate and write RMSD to file
--rmsd-feature RMSD_FEATURE
'CA' for C-alpha RMSD or 'feature' for feature RMSD (default: CA)
-r N, --draw-random N
Draw N random frames for each macrostate
-p PLOT, --plot PLOT Generate listed plots. Possible arguments include dendrogram, timescales,
sankey, contacts, macrotraj, ck_test, rmsd, delta_rmsd, state_network,
macro_feature, stochastic_state_similarity, relative_implied_timescales,
transition_matrix, transition_time and macrostate_trajectory. The latter writes
the macrostate trajectory to a txt file.
--get-least-moving-residues GET_LEAST_MOVING_RESIDUES
Write least moving residues for each macrostate to a file
The Snakemake Workflow
Snakemake is a workflow organization tool and used here to provide a high level user interface. In general, you only need to tell snakemake which file you would like to have, e.g.
snakemake --cores 'all' --sdm conda -p data/HP35/results/{t,t_js,kl,kl_js}/dendrogram.pdf --cache
Explanation of some flags:
--coresNumber of cores to utilize. 'all' for all cores.--software-deployment-method, --sdmUse conda to deploy software environment.--snakefile, -sUse a non-local snakefile. Use e.g.-s /data/evaluation/MPP/stochastic_MPP_Felix/tools/MPP/workflow/Snakefile--dry-run, --dryrun, nDo not execute anything, just print out the jobs that would be run.--cacheSo rules may be eligible for caching. Enable it with this option.--force, -fForce recreation of the given file(s).--printshellcmds, -pPrint out the shell commands that will be executed.
More information can be found here: Snakemake Documentation
Note that bash parameter expansion (the use of { and }) is possible to create e.g. several diagrams at once for multiple systems and/or setups.
Data Directory Structure
data/
├── System1
│ ├── input
│ │ ├── clusters
│ │ ├── config.yml
│ │ ├── contact_distances_trajectory
│ │ ├── contacts.ndx
│ │ ├── microstate_trajectory
│ │ ├── README.md
│ │ ├── topology.pdb
│ │ ├── trajectory.xtc
│ │ └── view
│ └── results
│ ├── t
│ │ ├── output_file1
│ │ ├── output_file2
│ │ └── ...
│ ├── kl
│
├── System2
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mpp_lumping-0.9.1.tar.gz.
File metadata
- Download URL: mpp_lumping-0.9.1.tar.gz
- Upload date:
- Size: 447.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6f21d4bb5ae5c690f456c36fc81391ad1153682ff8fe3dbb6347864b574a27e
|
|
| MD5 |
0c56d186a8bb6705499a191b01f57ccb
|
|
| BLAKE2b-256 |
5a4d1fc4c63b6da3b9dd77c299704a4112509f28b895d3831ef92380c91038b9
|
File details
Details for the file mpp_lumping-0.9.1-py3-none-any.whl.
File metadata
- Download URL: mpp_lumping-0.9.1-py3-none-any.whl
- Upload date:
- Size: 51.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5eca8fc87cda272f8714f3489a8a962260d577b921bfd4c7e3e135e234454981
|
|
| MD5 |
257f8cd1b5badf14717dbb773486d5ca
|
|
| BLAKE2b-256 |
d6c9cf08871af20213a93e039b07e43e7c0c422c3ddf1cd61e785a3024a54feb
|