Molecular Barcode (MolBar): Molecular Identifier for Organic and Inorganic Molecules
Project description
MolBar
This package provides an implementation of the Molecular Barcode (MolBar) as a quantum chemistry-inspired molecular identifier to ensure data uniqueness in databases, supporting organic and inorganic molecules while attempting to describe relative and absolute configuration including centric, axial/helical and planar chirality.
- ChemRxiv Paper: https://chemrxiv.org/engage/chemrxiv/article-details/65e3cd80e9ebbb4db9c71da0
- Documentation: https://git.rwth-aachen.de/bannwarthlab/molbar/-/blob/main/README.md?ref_type=heads
- Source code: https://git.rwth-aachen.de/bannwarthlab/molbar
- Bug reports: https://git.rwth-aachen.de/bannwarthlab/molbar/-/issues
- Email contact: van.staalduinen@pc.rwth-aachen.de
It does this by fragmentating a molecule into rigid parts which are then idealized with a specialized non-physical force field. The molecule is then described by different matrices encoding topology (connectivity), topography (3D positions of atoms after input unification), and absolute configuration (by calculating a chirality index). The final barcode is the concatenated spectra of these matrices.
Current Limitations
So far, the input file must contain 3D coordinates and explicit hydrogens.
Further, it should work well for organic and inorganic molecules with typical 2c2e bonding. It can describe molecules based on their relative and absolute configuration, including centric, axial/helical and planar chirality.
As the usual starting point are 3D Cartesian coordinates, right now, problems can occur if it is not easy to determine which atoms are bonded, especially for metal complexes with η-bonds. Further, problems can occur if the geometry around a metal in a complex cannot be classified by one of the standard VSEPR model. If you are not sure, just use the -d option when using MolBar as a commandline tool or use write_trj=True when using MolBar as a Python module to look at the optimized trajectories of each fragment. If something is unclear to you or something unusual happens, I would appreciate if you report it by posting issues or by e-mail (van.staalduinen@pc.rwth-aachen.de).
For rigidity analysis, MolBar only considers double/triple bonds and rings to be rigid. For example, an obstacle to rotation due to bulkiness of substituents is not taken into account, but can be added manually from the input file (additional dihedral constraint, but that should be used as an exception and carefully).
Getting started (tested on Linux and macOS, compiling works for Windows only in WSL)
For Linux/macOS
Using a virtual environment is highly recommended because it allows you to create isolated environments with their own dependencies, without interfering with other Python projects or the system Python installation. This ensures that your Python environment remains consistent and reproducible across different machines and over time. To create one, type in the following command in your terminal:
python3 -m venv path/to/venv
To activate the enviroment, type in:
source path/to/venv/bin/activate
To install Molbar, enter`the following command in your terminal:
pip install molbar
For Windows
Since compiling in a standard Windows environment does not work yet, it is highly recommended to use the WSL (Windows Subsystem for Linux) extension. Simply follow this installation guide: https://learn.microsoft.com/en-us/windows/wsl/install. Note that a Fortran compiler needs to be installed manually in the WSL environment. Otherwise, the installation of MolBar will result in an error.
For Python usage, it is highly recommended to use Visual Studio Code (VSC) as it provides specific extensions to code directly in WSL. A more detailed guide can be found here: https://code.visualstudio.com/docs/remote/wsl
MolBar Structure
For l-alanine, the MolBar reads:
MolBar | 1.0.0 | C3NO2H7 | -339 -140 -110 -32 13 20 20 20 160 237 432 528 850 | -209 -8 130 160 354 633 | -154 -117 -67 -40 9 9 9 60 74 156 342 457 922 | -31 0 0 0 11
MolBar is constructed as follows:
Version: 1.0.0
Molecular Formula: C3H7NO2
Topology Spectrum: -339 -140 -110 -32 13 20 20 20 160 237 432 528 850 (Encoding atomic connectivity)
Heavy Atom Topology Spectrum: -209 -8 130 160 354 633 (Encoding atomic connectivity without hydrogen atoms. So if for two molecules, the topology spectra are different but the tautomer spectra are the same, both molecules are tautomeric structures)
Topography Spectrum : -154 -117 -67 -40 9 9 9 60 74 156 342 457 922 (3D arrangement of atoms in Cartesian space, also describes diastereomerism)
Absolute Configuration: -31 0 0 0 11 (Encoding absolute configuration for each fragment)
Python Module Usage
MolBar can be generated by Python function calls:
- for a single molecule with
get_molbars_from_coordinates
by specifying the Cartesian coordinates as a list, - for several molecules at once with
get_molbars_from_coordinates
by giving a list of lists with Cartesian coordinates, - for a single molecule with
get_molbars_from_file
by specifying a file path, - for several molecules at once with
get_molbars_from_files
by specifying a list of file paths.
1. get_molbar_from_coordinates
from molbar.barcode import get_molbar_from_coordinates
def get_molbar_from_coordinates(coordinates: list, elements: list, return_data=False, timing=False, input_constraint=None, mode="mb") -> Union[str, dict]
Args:
coordinates (list): Molecular geometry provided by atomic Cartesian coordinates with shape (n_atoms, 3).
elements (list): A list of elements in that molecule.
return_data (bool): Whether to return MolBar data.
timing (bool): Whether to print the duration of this calculation.
input_constraint (dict, optional): A dict of extra constraints for the calculation. See below for more information. USED ONLY IN EXCEPTIONAL CASES.
mode (str): Whether to calculate the molecular barcode ("mb") or only the topology part of the molecular barcode ("topo").
Returns:
Union[str, dict]: Either MolBar or the MolBar and MolBar data.
Example for input constraints as a Python dict. Input constraint should be used only in exceptional cases. However, it may be useful to constrain bonds with a additional dihedral for the barcode that are normally considered single bonds but whose rotation is hindered (e.g., 90° binol systems with bulky substituents).
{
'constraints': {
'dihedrals': [{'atoms': [1,2,3,4], 'value':90.0},...]} #atoms: list of atoms that define the dihedral, value is the ideal dihedral angle in degrees, atom indexing starts with 1.
}
2. get_molbars_from_coordinates
NOTE: If you need to process multiple molecules at once, it is recommended to use this function and specify the number of threads that can be used to process multiple molecules simultaneously.
from molbar.barcode import get_molbars_from_coordinates
def get_molbars_from_coordinates(list_of_coordinates: list, list_of_elements: list, return_data=False, threads=1, timing=False, input_constraints=None, progress=False, mode="mb") -> Union[list, Union[str, dict]]:
Args:
list_of_coordinates (list): A list of molecular geometries provided by atomic Cartesian coordinates with shape (n_molecules, n_atoms, 3).
list_of_elements (list): A list of element lists for each molecule in the list_of_coordinates with shape (n_molecules, n_atoms).
return_data (bool): Whether to return MolBar data.
threads (int): Number of threads to use for the calculation. If you need to process multiple molecules at once, it is recommended to use this function and specify the number of threads that can be used to process multiple molecules simultaneously.
timing (bool): Whether to print the duration of this calculation.
input_constraints (list, optional): A list of constraints for the calculation. Each constraint in that list is a Python dict as shown above for get_molbar_from_coordinates.
progress (bool): Whether to show a progress bar.
mode (str): Whether to calculate the molecular barcode ("mb") or the topology part of the molecular barcode ("topo").
Returns:
Union[list, Union[str, dict]]: Either MolBar or the MolBar and MolBar data.
3. get_molbar_from_file
from molbar.barcode import get_molbar_from_file
def get_molbar_from_file(file: str, return_data=False, timing=False, input_constraint=None, mode="mb", write_trj=False) -> Union[str, dict]:
Args:
file (str): The path to the file containing the molecule information (either .xyz/.sdf/.mol format).
return_data (bool): Whether to return MolBar data.
timing (bool): Whether to print the duration of this calculation.
input_constraint (dict, optional): A dict of extra constraints for the calculation. See below for more information. USED ONLY IN EXCEPTIONAL CASES.
mode (str): Whether to calculate the molecular barcode ("mb") or only the topology part of the molecular barcode ("topo").
write_trj (bool, optional): Whether to write a trajectory of the unification process. Defaults to False.
Returns:
Union[str, dict]: Either MolBar or the MolBar and MolBar data.
Example for input file in .yml format. Input constraint should be used only in exceptional cases. However, it may be useful to constrain bonds with a additional dihedral for the barcode that are normally considered single bonds but whose rotation is hindered (e.g., 90° binol systems with bulky substituents).
constraints:
dihedrals:
- atoms: [30, 18, 14, 13] # List of atoms involved in the dihedral
value: 90.0 # Actual values for the dihedral parameters
4. get_molbars_from_files
NOTE: If you need to process multiple molecules at once, it is recommended to use this function and specify the number of threads that can be used to process multiple molecules simultaneously.
from molbar.barcode import get_molbars_from_files
def get_molbars_from_files(files: list, return_data=False, threads=1, timing=False, input_constraints=None, progress=False, mode="mb", write_trj=False) ->Union[list, Union[str, dict]]:
Args:
files (list): The list of paths to the files containing the molecule information (either .xyz/.sdf/.mol format).
return_data (bool): Whether to return MolBar data.
threads (int): Number of threads to use for the calculation. If you need to process multiple molecules at once, it is recommended to use this function and specify the number of threads that can be used to process multiple molecules simultaneously.
timing (bool): Whether to print the duration of this calculation.
input_constraints (list, optional): A list of file paths to the input files for the calculation. Each constrained is specified by a file path to a .yml file, as shown above for get_molbar_from_file.
progress (bool): Whether to show a progress bar.
mode (str): Whether to calculate the molecular barcode ("mb") or the topology part of the molecular barcode ("topo").
write_trj (bool, optional): Whether to write a trajectory of the unification process. Defaults to False.
Returns:
Union[list, Union[str, dict]]: Either MolBar or the MolBar and MolBar data.
Commandline Usage
MolBar can also be used as commandline tool. Just simply type:
molbar coord.xyz
and the MolBar is printed to the stdout.
NOTE: If you need to process several molecules at once, it is recommended to pass all molecules to the code at once (e.g. with *.xyz) while specifying the number of threads the code should use:
molbar *.xyz -T N_threads -s
The latter option (-s) is used to store the barcode to .mb files.
Further, the commandline tool provides several options:
usage: molbar [-h] [-r] [-i INP [INP ...]] [-d] [-T THREADS] [-s] [-t] [-p] [-m {mb,topo,opt}] files [files ...]
positional arguments:
files file(s)
options:
-m {mb,topo,opt}, --mode {mb,topo,opt}
The mode to use for the calculations (either "mb" (default, calculates MolBar), "topo" (topology part only)
or "opt" (using stand-alone force field idealization, writes ".opt" with final structure))
-i INP [INP ...], --inp INP [INP ...]
Path to input file in .yml format to add further constraints. Example input can be found below.
-d, --data Whether to print MolBar data.
Writes a "filename/" directory containing a json file with
important information that defines MolBar. Writes idealization trajectories of each fragment to same directory.
-T THREADS, --threads THREADS
The number of threads to use for parallel processing of several files. MolBar generation for a single file is not parallelized. Should be used together with -s/--save (e.g. molbar *.xyz -T 8 -s)
-s, --save Whether to save the result to a file of type "filename.mb"
-t, --time Print out timings.
-p, --progress Use a progress bar when several files are handled.
Example for input file constraints in yml format. Input constraint should be used only in exceptional cases. However, it may be useful to constrain bonds with a additional dihedral for the barcode that are normally considered single bonds but whose rotation is hindered (e.g., 90° binol systems with bulky substituents).
constraints:
dihedrals:
- atoms: [30, 18, 14, 13] # List of atoms involved in the dihedral
value: 90.0 # Actual values for the dihedral parameters
Using the unification force field for the whole molecule.
The force field can be used to idealize the structure of a whole molecule where the coordinates are either given in Python by a file:
- as a commandline tool with the
molbar coord.xyz -m opt
option - in Python with
idealize_structure_from_file
by providing a file path - in Python with
idealize_structure_from_coordinates
by providing Cartesian coordinates as a list
Commandline tool
molbar coord.xyz -m opt
This writes a coord.opt file that contains the idealized coordinates.
In Python from a file:
from molbar.barcode import idealize_structure_from_file
def idealize_structure_from_file(file: str, return_data=False, timing=False, input_constraint=None, write_trj=False) -> Union[list, str]
Args:
file (str): The path to the input file to be processed.
return_data (bool): Whether to print MolBar data.
timing (bool): Whether to print the duration of this calculation.
input_constraint (str): The path to the input file containing the constraint for the calculation. See down below for more information.
write_trj (bool, optional): Whether to write a trajectory of the unification process. Defaults to False.
Returns:
n_atoms (int): Number of atoms in the molecule.
energy (float): Final energy of the molecule after idealization.
coordinates (list): Final coordinates of the molecule after idealization.
elements (list): Elements of the molecule.
data (dict): Molbar data.
This is an example input as a yml file:
bond_order_assignment: False # False if bond order assignment should be skipped, only reasonable opt mode (standalone force-field optimization)
cycle_detection: True # False if cycle detection should be skipped, only reasonable opt mode (standalone force-field optimization).
repulsion_charge: 100.0 # Charged used for the Coulomb term in the Force field, every atom-atom pair uses the same charge, only reasonable opt mode (standalone force-field optimization). Defaults to 100.0
set_edges: True #False if no bonds should be constrained automatically.
set_angles: True #False if no angles should be constrained automatically.
set_dihedrals: True # False if no dihedrals should be constrained automatically.
set_repulsion: True #False if no coulomb term should be used automatically.
constraints:
bonds:
- atoms: [19, 23] # List of atoms involved in the bond
value: 1.5 # Ideal bond length.
angles:
- atoms: [19, 23, 35] # List of atoms involved in the angle
value: 45.0 # Angle to which the angle between the three atoms is to be constrained
- atoms: [35, 23, 19] # List of atoms involved in the angle
value: 45.0 # Angle to which the angle between the three atoms is to be constrained
dihedrals:
- atoms: [30, 18, 14, 13] # List of atoms involved in the dihedral
value: 90.0 # Actual values for the dihedral parameters
In Python from a list of Cartesian coordinates:
from molbar.barcode import idealize_structure_from_coordinates
def idealize_structure_from_coordinates(coordinates: list, elements: list, return_data=False, timing=False, input_constraint=None) -> Union[list, str]:
Args:
coordinates (list): Cartesian coordinates of the molecule.
elements (list): Elements of the molecule.
return_data (bool, optional): Whether to return MolBar data.
timing (bool, optional): Whether to print the duration of this calculation.
input_constraint (dict, optional): The constraint for the calculation. See documentation for more information.
Returns:
n_atoms (int): Number of atoms in the molecule.
energy (float): Final energy of the molecule after idealization.
coordinates (list): Final coordinates of the molecule after idealization.
elements (list): Elements of the molecule.
data (dict): MolBar data.
This is an example input as a Python dict:
{'bond_order_assignment': True, #False if bond order assignment should be skipped, only reasonable opt mode (standalone force-field optimization)
'cycle_detection': True, #False if cycle detection should be skipped, only reasonable opt mode (standalone force-field optimization).
'set_edges': True #False if no bonds should be constrained automatically.
'set_angles': True #False if no angles should be constrained automatically.
'set_dihedrals': True #False if no dihedrals should be constrained automatically.
'set_repulsion': True #False if no coulomb term should be used automatically.
'repulsion_charge': 100.0, # Charged used for the Coulomb term in the Force field, every atom-atom pair uses the same charge, only reasonable opt mode (standalone force-field optimization). Defaults to 100.0
'constraints': {'bonds': [{'atoms': [1,2], 'value':1.5},...], #atoms: list of atoms that define the bond, value is the ideal bond length in angstrom, atom indexing starts with 1.
'angles': [{'atoms': [1,2,3], 'value':90.0},...], #atoms: list of atoms that define the angle, value is the ideal angle in degrees, atom indexing starts with 1.
'dihedrals': [{'atoms': [1,2,3,4], 'value':180.0},...]} #atoms: list of atoms that define the dihedral, value is the ideal dihedral angle in degrees, atom indexing starts with 1.
}
Acknowledgements
MolBar relies on the following libraries and packages:
Thank you!
License and Disclaimer
MIT License
Copyright (c) 2022 Nils van Staalduinen, Christoph Bannwarth
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for molbar-1.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b395a3c081bf9dacab9afae718826a1be77c52ac8f4f7eb5dde55020a047dc5 |
|
MD5 | 4c19b0d0cdaa94104c5130ca0c98a696 |
|
BLAKE2b-256 | 040d603ce96314b698f3d30a8bd0847bd647d424d577a69e0c5defc1e2799987 |
Hashes for molbar-1.1.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a3328422b27b21e2c1d6e2a0ed3ab8373f7cc8f956cb9f56e71955c5304f217d |
|
MD5 | 4a1f3c7b9a35a693fc8c57d1b6922ced |
|
BLAKE2b-256 | 80c583482366e604de58998e90ec76b268c6a6c9400d399bd6813bceddabe081 |
Hashes for molbar-1.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8493b554c9ff34be2b24218fda562dfbdb32f3c05bbaa1df4649908bf238c987 |
|
MD5 | 1ba496c27b3c03856d0c4d371141d9dc |
|
BLAKE2b-256 | 0b4394679fbc3530c0d9a457ac477843e655ba479abccccc71a4b07e5067539c |
Hashes for molbar-1.1.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2162a2e1f974cbf9453f1a8c961e9494a25622c47faf22f4572f48f248abd6a3 |
|
MD5 | a089ca423f1de2e3109a5809e80aad2f |
|
BLAKE2b-256 | b3287f7c526a83567425dd97dde716f71fc6927a1e6f12eab7e3af2d815da78b |
Hashes for molbar-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c534c876f62c6aec14786518c3d8a82149df8989f15bbcb0e77d96d1bca2119d |
|
MD5 | 16b18482f1ce50feb1926b861b06935b |
|
BLAKE2b-256 | 3f54b788909794670731134398a18c3985413e4b7b3be0bb19f4ebcd2c604944 |
Hashes for molbar-1.1.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 879ff5e9bbad61c0c052a6860c51f73e3c414f2d67f8913a2f8a94ede27264c5 |
|
MD5 | 811a32708b371cc59777306ebb25484c |
|
BLAKE2b-256 | e045b985eba68ef6a9577df40e376c7bb04910a88bd566b92f00aee7f76449c6 |
Hashes for molbar-1.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e56589c528f52e589552b8c87bc4476d9a8ce5be16dd0ddf85f19d2141848c0c |
|
MD5 | b20e78429dd6532af1c9a38401cccd50 |
|
BLAKE2b-256 | c8be615c9432b51024f82c25f3ecd84b78ace8e235253d73a7535361febbef01 |
Hashes for molbar-1.1.1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd311f58483effdde8ca17b3bda6bcbe02b08abadd95e291a38018aefed67f3e |
|
MD5 | 29bec170dcde304d7d52f5871eed09da |
|
BLAKE2b-256 | 90ff13c6b1d6f426a168954ba9fbdc1f334673f1bd6daebcda4423989e8c5473 |
Hashes for molbar-1.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e9911b386a77deed8455a5bccbf87f930baa92cae193f793be0bc4f438df04cc |
|
MD5 | 8613509b69b7cda26b7544ceeef1d3bc |
|
BLAKE2b-256 | 5353047c480823d5e01b8ffab6f59401f7b478288c3064ada47b11ca181445ef |
Hashes for molbar-1.1.1-cp38-cp38-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e9d7c4042e21579b4e9b201e145d9a3ebd1d0339daaeb0e3eaf6aa40e4d61f6d |
|
MD5 | b5944dc0a0834d13a2cef599ee85d411 |
|
BLAKE2b-256 | 9fce335642cb5187dbb2563e40759cd73330286cac86cf6a846f59ef4730f7da |