Implementation of the LoCoHD metric for quantitative protein structure and substructure comparison
Project description
Welcome to LoCoHD!
LoCoHD (Local Composition Hellinger Distance) is a metric for comparing protein structures. It can be used for one single structure-structure comparison, for the comparison of multiple structures inside ensembles, or for the comparison of structures inside an MD simulation trajectory. It is also a general-purpose metric for labelled point clouds with variable point counts. In contrast to RMSD, the TM-score, lDDT, or GDT_TS, it is based on the measurement of local composition differences, rather than of the Euclidean deviations.
Where can I read about it?
This work is yet to be published in a scientific journal.
How can I install it?
From PyPI
With pip, it is easy to add LoCoHD to your packages:
pip install loco-hd
Building from source
To build LoCoHD from source, first you need to install Rust to your system. You also need Python3, pip, and the package Maturin. Both Rust and Maturin can be installed with the following one-liners:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
pip install maturin
Next, clone the repository and enter it:
git clone https://github.com/fazekaszs/loco_hd && cd loco_hd
Run Maturin to install LoCoHD into your active environment:
maturin develop
And you are done!
Running the Rust unit tests
Unit tests can be run with Cargo. Since this is a PyO3 project, an additional flag is needed:
cargo test --no-default-features
How can I use it?
LoCoHD was intended to be used within Python scripts, mostly through BioPython as the main .pdb
file reader. It is also possible to use it with other protein/molecular structure readers, but the user has to write the appropriate parser that converts the information within the file into the information required for LoCoHD. An example for this can be found here, where the structures come from a molecular dynamics trajectory and parsing is achieved by MDAnalysis.
For the comparison of two protein structures with LoCoHD the following simple steps are necessary:
1. Loading the structures from pdb files
# These imports are necessary for the union of the sections!
from pathlib import Path
from Bio.PDB.PDBParser import PDBParser
from loco_hd import *
structure1 = PDBParser(QUIET=True).get_structure("s1", "path/to/structure1.pdb")
structure2 = PDBParser(QUIET=True).get_structure("s2", "path/to/structure2.pdb")
2. Selecting the primitive typing scheme
In this section, the true protein structures (with "true" atoms) are converted into primitive template structures (lists containing PrimitiveAtomTemplate
instances). These serve as intermediate instances between the Atom
class (from BioPython) and the PrimitiveAtom
class (from loco-hd).
primitive_assigner = PrimitiveAssigner(Path("path/to/primitive/typing/scheme.json"))
pra_templates1 = primitive_assigner.assign_primitive_structure(structure1)
pra_templates2 = primitive_assigner.assign_primitive_structure(structure2)
3. Selecting the anchor atoms
Here, it is assumed that the two structures contain the same number of anchor atoms and are paired in the same order. This is not necessary, since the anchor atom selection and pairing is easily customizable by just selecting the primitive atom index pairs. In these example it is only assumed to simplify things.
In the case, where all atoms are anchor atoms we can use:
anchor_pairs = [
(idx, idx)
for idx in range(len(pra_templates1))
]
Or if only primitive atoms with the "Cent"
primitive type are anchors:
anchor_pairs = [
(idx, idx)
for idx, prat in enumerate(pra_templates1)
if atom.primitive_type == "Cent"
]
The only important thing is that the indices inside the tuples must be valid within the first and second primitive atom (template) lists.
4. Conversion of PrimitiveAtomTemplate
instances to PrimitiveAtom
instances
The intermediate templates are only necessary, so we can have an opportunity to set the tag
field of our PrimitiveAtom
s. This field is used for the conditional setting of the "environment" of each anchor atom. For example, this can be used to ban homo-residue contacts, i.e. to ban a primitive atom from the environment of an anchor atom if the primitive atom comes from the same residue as the anchor. For further explanation see section #5.
To do the conversion in a clean and effective manner we can define the following function:
def prat_to_pra(prat: PrimitiveAtomTemplate) -> PrimitiveAtom:
resi_id = prat.atom_source.source_residue
resname = prat.atom_source.source_residue_name
source = f"{resi_id[2]}/{resi_id[3][1]}-{resname}"
return PrimitiveAtom(
prat.primitive_type,
source, # this is the tag field!
prat.coordinates
)
After this, a simple map
will do:
pra1 = list(map(prat_to_pra, pra_templates1))
pra2 = list(map(prat_to_pra, pra_templates2))
5. Creating the LoCoHD
instance
This will create a simple LoCoHD instance that operates with a uniform weight function between 3 and 10 angströms and doesn't consider the tag
field of the primitive atoms (i.e. it accepts any anchor atom - primitive atom contacts):
lchd = LoCoHD(primitive_assigner.all_primitive_types)
To explicitly state the weight function use:
w_func = WeightFunction("uniform", [3., 10.])
lchd = LoCoHD(primitive_assigner.all_primitive_types, w_func)
There is a collection of weight functions available.
Or to explicitly state the tag-pairing rule:
w_func = WeightFunction("uniform", [3., 10.])
tag_pairing_rule = TagPairingRule({"accept_same": False})
lchd = LoCoHD(
primitive_assigner.all_primitive_types,
w_func,
tag_pairing_rule
)
The latter code creates a LoCoHD
instance that considers the tag
field and disregards primitive atoms in the environment that have the same tag as the anchor atom.
Other tag pairing rules are also available.
Finally, the number of parallel threads LoCoHD can use can also be set as a last argument:
lchd = LoCoHD(
primitive_assigner.all_primitive_types,
w_func,
tag_pairing_rule,
4
)
6. Calculation of the LoCoHD scores
The LoCoHD class offers several methods for LoCoHD score calculation. These are the:
from_anchors
method, calculating a single LoCoHD score from two anchor atom environments,from_dmxs
method, calculating several LoCoHD scores, each belonging to corresponding row-pairs of primitive atom distance-matrices,from_coords
method, calculating several LoCoHD scores from the coordinates of primitive atoms (it uses thefrom_dmxs
method under the hood),from_primitives
method, calculating several LoCoHD scores from a list ofPrimitiveAtom
instances.
Most of the time the from_primitives
method should be used. This is the only method that uses PrimitiveAtom
instances, takes tag pairing rules into account, and speeds up calculations through the use of an upper distance cutoff for the environments.
lchd_scores = lchd.from_primitives(
pra1,
pra2,
anchor_pairs,
10. # upper distance cutoff at 10 angströms
)
This gives a list of LoCoHD scores (floats), each describing the environmental difference/distance/dissimilarity between two anchor atom environments. This is a score between 0 and 1, with larger values meaning greater dissimilarity.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file loco_hd-0.1.3.tar.gz
.
File metadata
- Download URL: loco_hd-0.1.3.tar.gz
- Upload date:
- Size: 120.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.14.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e844300c475912c619517b6517fe87718ee38fd6f5d341e209e2bc41dcbac3b |
|
MD5 | 0f9c323444a788e5c09be8dc27a77631 |
|
BLAKE2b-256 | e407a0b0548e67dc9b0e48fcd11fc662cc59222ded4243213b1a1455401befce |
File details
Details for the file loco_hd-0.1.3-cp310-cp310-manylinux_2_34_x86_64.whl
.
File metadata
- Download URL: loco_hd-0.1.3-cp310-cp310-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 348.1 kB
- Tags: CPython 3.10, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/0.14.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0916ef3da020f27e397c57833ee3d50b8c3ee79c7b99075f72c03007d5a7b360 |
|
MD5 | 2d3abb6a3f18887872b29cca56d9e6b6 |
|
BLAKE2b-256 | 2672e0e0d2756fd37306cd8db553b712ef2603df5585cc78cba874ce6ea115e9 |