Skip to main content

Lumping tool for state trajectories

Project description

The Most Probable Path Algorithm for Reducing the Number of States in a State Trajectory

This package implements the most probable path (MPP) algorithm, which is used to coarse-grain the number of discrete states of a Markov process. Based on a microstate trajectory, a Markov state model is estimated utilising msmhelper. The transition probabilities and optional other descriptors are used to combine states in such a way that the final macrostates exhibit a given minimum population and metastability.

In the most basic example, a lumping tree is generated by iteratively selecting the least stable state and merging it with the state, to which its transition probability is highest. In a second step, the tree is parsed in reverse order (starting at the root) in order to identify the macrostates. For details, see publication tbd.

Features

  • Perform the most probable path (MPP) algorithm on a given microstate trajectory
  • Variety of analysis plots
  • Multi trajectory support
  • Extensions to the basic algorithm
    • Similarity by Kullback-Leibler divergence of transition probabilities
    • Incorporation Jenson-Shannon divergence of a feature (e.g. contact distances)
    • Stochastic lumping
  • Three levels of user interface
    • as integration in your Python code (use the MPP.Lumping or MPP.run.Data objects)
    • as Python module (python -m MPP.run --help)
    • in your Snakemake workflow (see ´workflow`)

Installation

The package is available in the Python Package Index (PyPI) and can be installed via pip:

python -m pip install mpp-lumping

Usage

Dependent on your skills and needs, the module can be used at three different entry points:

  • Integrate the central MPP.Lumping or MPP.run.Data object in your Python pipeline.
  • Use the high-level Python interface (python -m MPP.run ...) via the command line.
  • In a Snakemake workflow where you only need to provide the configuration of your system and you're ready to go.

Config File

Config files (YAML files) are used to pass the information of where the files are located and some lumping parameters. Below you see a reference config file with all possible parameters. Note that only the following fields are mandatory: source, microstate trajectory, multi feature trajectory, frame length, lagtime, pop_thr, q_min. Please refer to the wiki for a detailed description of the parameters (tbd).

source: data/HP35/input/ # root directory of the other files

microstate trajectory: microstate_trajectory # the microstate trajectory
multi feature trajectory: contact_distances_trajectory # the feature trajectory, each line contains the feature values of the respective feature
cluster file: clusters # Defines the contact clusters, each row contains the contact indices of a contact cluster
contact index file: contacts.ndx # residue indices for contacts.
limits: null # contains the lengths of each trajectory when multiple trajectories are concatenated

topology file: structure.pdb # topology file used with the xtc trajectory
xtc file: trajectory.xtc # the xtc trajectory file
helices: helices # definition of secondary structure elements

contact threshold: null # Threshold for in the feature space below which e.g. a contact is considered to be formed.
frame length: 0.2 # in ns / frame
lagtime: 50 # in frames
pop_thr: 0.005 # population threshold for macrostates
q_min: 0.5 # minimum metastability of macrostates

n timescales: 3 # number of timescales to plot in the implied timescales plot. 3 is the default.

# For stochastic lumping
stochastic:
  method: n
  param: 2
  n: 10

# PyMol rendering
view: view # contains the view information for PyMol
width: 500 # width of the image in px
height: 500 # height of the image in px

Python Module

Running the package requires a configuration file as described above as a first parameter. The following two positional arguments define the similarity between two states. First comes the dynamic similarity (T, KL, none), second the geometric similarity (JS, none). For reference, G-PCCA can be performed by issuing gpcca first and then the number of macrostates to create. Pass ref in order to take the number of macrostates from the T none lumping (the similarity between states corresponds to the transition probability between them).

Provide the target file (where to store the plot) with the option -o. The lumping tree (the result of the first, potentially intense step) is defined by a Z matrix, which can be stored and loaded by the -Z option. If the provided path exists, this file is loaded as Z matrix. For the --rmsd option, the same holds true as it is intense to calculate. With the --rmsd-feature option, you can select if the RMSD should be calculated for the C alpha atoms in the Cartesian coordinate space or the space of your feature, e.g. contact distances. -r allows you to draw N random frames of each macrostate and -p produces desired plots. --get-least-moving-residues saves the indices in the state trajectory of the least moving residues per macrostate to a text file. This allows to find the residues, which participate in the most stable contact distances for each macrostate.

~$ python -m MPP.run --help
usage: Perform MPP on MD simulation data [-h] [-o OUT] [-Z Z] [--rmsd RMSD]
                                         [--rmsd-feature RMSD_FEATURE] [--xtc-stride XTC_STRIDE] [-r N]
                                         [-p PLOT]
                                         [--get-least-moving-residues GET_LEAST_MOVING_RESIDUES]
                                         data_specification d g

This program allows for the analysis of MD data utilizing the most probable path algorithm. It allows
for easy plotting of different quality measures.

positional arguments:
  data_specification    yaml file containing specification of files and parameters of the simulation
  d                     dij to be used.
  g                     gij to be used.

options:
  -h, --help            show this help message and exit
  -o OUT, --out OUT     Where to store the plot
  -Z Z                  Perform MPP and write the Z matrix
  --rmsd RMSD           Generate and write RMSD to file
  --rmsd-feature RMSD_FEATURE
                        'CA' for C-alpha RMSD or 'feature' for feature RMSD (default: CA)
  -r N, --draw-random N
                        Draw N random frames for each macrostate
  -p PLOT, --plot PLOT  Generate listed plots. Possible arguments include dendrogram, timescales,
                        sankey, contacts, macrotraj, ck_test, rmsd, delta_rmsd, state_network,
                        macro_feature, stochastic_state_similarity, relative_implied_timescales,
                        transition_matrix, transition_time and macrostate_trajectory. The latter writes
                        the macrostate trajectory to a txt file.
  --get-least-moving-residues GET_LEAST_MOVING_RESIDUES
                        Write least moving residues for each macrostate to a file

The Snakemake Workflow

Snakemake is a workflow organization tool and used here to provide a high level user interface. In general, you only need to tell snakemake which file you would like to have, e.g.

snakemake --cores 'all' --sdm conda -p data/HP35/results/{t,t_js,kl,kl_js}/dendrogram.pdf --cache

Explanation of some flags:

  • --cores Number of cores to utilize. 'all' for all cores.
  • --software-deployment-method, --sdm Use conda to deploy software environment.
  • --snakefile, -s Use a non-local snakefile. Use e.g. -s /data/evaluation/MPP/stochastic_MPP_Felix/tools/MPP/workflow/Snakefile
  • --dry-run, --dryrun, n Do not execute anything, just print out the jobs that would be run.
  • --cache So rules may be eligible for caching. Enable it with this option.
  • --force, -f Force recreation of the given file(s).
  • --printshellcmds, -p Print out the shell commands that will be executed.

More information can be found here: Snakemake Documentation

Note that bash parameter expansion (the use of { and }) is possible to create e.g. several diagrams at once for multiple systems and/or setups.

Data Directory Structure

data/
├── System1
│   ├── input
│      ├── clusters
│      ├── config.yml
│      ├── contact_distances_trajectory
│      ├── contacts.ndx
│      ├── microstate_trajectory
│      ├── README.md
│      ├── topology.pdb
│      ├── trajectory.xtc
│      └── view
│   └── results
│       ├── t
│          ├── output_file1
│          ├── output_file2
│          └── ...
│       ├── kl
│
├── System2

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mpp_lumping-0.9.1.tar.gz (447.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mpp_lumping-0.9.1-py3-none-any.whl (51.7 kB view details)

Uploaded Python 3

File details

Details for the file mpp_lumping-0.9.1.tar.gz.

File metadata

  • Download URL: mpp_lumping-0.9.1.tar.gz
  • Upload date:
  • Size: 447.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for mpp_lumping-0.9.1.tar.gz
Algorithm Hash digest
SHA256 f6f21d4bb5ae5c690f456c36fc81391ad1153682ff8fe3dbb6347864b574a27e
MD5 0c56d186a8bb6705499a191b01f57ccb
BLAKE2b-256 5a4d1fc4c63b6da3b9dd77c299704a4112509f28b895d3831ef92380c91038b9

See more details on using hashes here.

File details

Details for the file mpp_lumping-0.9.1-py3-none-any.whl.

File metadata

  • Download URL: mpp_lumping-0.9.1-py3-none-any.whl
  • Upload date:
  • Size: 51.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for mpp_lumping-0.9.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5eca8fc87cda272f8714f3489a8a962260d577b921bfd4c7e3e135e234454981
MD5 257f8cd1b5badf14717dbb773486d5ca
BLAKE2b-256 d6c9cf08871af20213a93e039b07e43e7c0c422c3ddf1cd61e785a3024a54feb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page