Skip to main content

Molecular Barcode (MolBar): Molecular Identifier for Organic and Inorganic Molecules

Project description

MolBar

logo

This package provides an implementation of the Molecular Barcode (MolBar) as a quantum chemistry-inspired molecular identifier to ensure data uniqueness in databases, supporting organic and inorganic molecules while attempting to describe relative and absolute configuration including centric, axial/helical and planar chirality.

License: MIT PyPI Downloads ChemRxiv Paper

It does this by fragmentating a molecule into rigid parts which are then idealized with a specialized non-physical force field. The molecule is then described by different matrices encoding topology (connectivity), topography (3D positions of atoms after input unification), and absolute configuration (by calculating a chirality index). The final barcode is the concatenated spectra of these matrices.

Current Limitations

So far, the input file must contain 3D coordinates and explicit hydrogens.

Further, it should work well for organic and inorganic molecules with typical 2c2e bonding. It can describe molecules based on their relative and absolute configuration, including centric, axial/helical and planar chirality.

As the usual starting point are 3D Cartesian coordinates, right now, problems can occur if it is not easy to determine which atoms are bonded, especially for metal complexes with η-bonds. Further, problems can occur if the geometry around a metal in a complex cannot be classified by one of the standard VSEPR model. If you are not sure, just use the -d option when using MolBar as a commandline tool or use write_trj=True when using MolBar as a Python module to look at the optimized trajectories of each fragment. If something is unclear to you or something unusual happens, I would appreciate if you report it by posting issues or by e-mail (van.staalduinen@pc.rwth-aachen.de).

For rigidity analysis, MolBar only considers double/triple bonds and rings to be rigid. For example, an obstacle to rotation due to bulkiness of substituents is not taken into account, but can be added manually from the input file (additional dihedral constraint, but that should be used as an exception and carefully).

Getting started (tested on Linux and macOS, compiling works for Windows only in WSL)

For Linux/macOS

Using a virtual environment is highly recommended because it allows you to create isolated environments with their own dependencies, without interfering with other Python projects or the system Python installation. This ensures that your Python environment remains consistent and reproducible across different machines and over time. To create one, type in the following command in your terminal:

 python3 -m venv path/to/venv

To activate the enviroment, type in:

 source path/to/venv/bin/activate

To install Molbar, enter`the following command in your terminal:

pip install molbar

For Windows

Since compiling in a standard Windows environment does not work yet, it is highly recommended to use the WSL (Windows Subsystem for Linux) extension. Simply follow this installation guide: https://learn.microsoft.com/en-us/windows/wsl/install. Note that a Fortran compiler needs to be installed manually in the WSL environment. Otherwise, the installation of MolBar will result in an error.

For Python usage, it is highly recommended to use Visual Studio Code (VSC) as it provides specific extensions to code directly in WSL. A more detailed guide can be found here: https://code.visualstudio.com/docs/remote/wsl

MolBar Structure

For l-alanine, the MolBar reads:

MolBar | 1.0.0 | C3NO2H7 | -339 -140 -110 -32 13 20 20 20 160 237 432 528 850 | -209 -8 130 160 354 633 | -154 -117 -67 -40 9 9 9 60 74 156 342 457 922 | -31 0 0 0 11

MolBar is constructed as follows:

Version: 1.0.0
Molecular Formula: C3H7NO2 
Topology Spectrum: -339 -140 -110 -32 13 20 20 20 160 237 432 528 850 (Encoding atomic connectivity)
Heavy Atom Topology Spectrum: -209 -8 130 160 354 633 (Encoding atomic connectivity without hydrogen atoms. So if for two molecules, the topology spectra are different but the tautomer spectra are the same, both molecules are tautomeric structures)
Topography Spectrum : -154 -117 -67 -40 9 9 9 60 74 156 342 457 922 (3D arrangement of atoms in Cartesian space, also describes diastereomerism)
Absolute Configuration: -31 0 0 0 11 (Encoding absolute configuration for each fragment)

Python Module Usage

MolBar can be generated by Python function calls:

  1. for a single molecule with get_molbars_from_coordinates by specifying the Cartesian coordinates as a list,
  2. for several molecules at once with get_molbars_from_coordinates by giving a list of lists with Cartesian coordinates,
  3. for a single molecule with get_molbars_from_file by specifying a file path,
  4. for several molecules at once with get_molbars_from_files by specifying a list of file paths.

1. get_molbar_from_coordinates

  from molbar.barcode import get_molbar_from_coordinates

  def get_molbar_from_coordinates(coordinates: list, elements: list, return_data=False, timing=False, input_constraint=None, mode="mb") -> Union[str, dict]

      Args:

          coordinates (list): Molecular geometry provided by atomic Cartesian coordinates with shape (n_atoms, 3).
          elements (list): A list of elements in that molecule.
          return_data (bool): Whether to return MolBar data.
          timing (bool): Whether to print the duration of this calculation.
          input_constraint (dict, optional): A dict of extra constraints for the calculation. See below for more information. USED ONLY IN EXCEPTIONAL CASES.
          mode (str): Whether to calculate the molecular barcode ("mb") or only the topology part of the molecular barcode ("topo").

      Returns:

          Union[str, dict]: Either MolBar or the MolBar and MolBar data.

Example for input constraints as a Python dict. Input constraint should be used only in exceptional cases. However, it may be useful to constrain bonds with a additional dihedral for the barcode that are normally considered single bonds but whose rotation is hindered (e.g., 90° binol systems with bulky substituents).

{
'constraints': {
                'dihedrals': [{'atoms': [1,2,3,4], 'value':90.0},...]} #atoms: list of atoms that define the dihedral, value is the ideal dihedral angle in degrees, atom indexing starts with 1.
}

2. get_molbars_from_coordinates

NOTE: If you need to process multiple molecules at once, it is recommended to use this function and specify the number of threads that can be used to process multiple molecules simultaneously.

from molbar.barcode import get_molbars_from_coordinates

def get_molbars_from_coordinates(list_of_coordinates: list, list_of_elements: list, return_data=False, threads=1, timing=False, input_constraints=None, progress=False,  mode="mb") -> Union[list, Union[str, dict]]:

    Args:

        list_of_coordinates (list): A list of molecular geometries provided by atomic Cartesian coordinates with shape (n_molecules, n_atoms, 3).
        list_of_elements (list): A list of element lists for each molecule in the list_of_coordinates with shape (n_molecules, n_atoms).
        return_data (bool): Whether to return MolBar data.
        threads (int): Number of threads to use for the calculation. If you need to process multiple molecules at once, it is recommended to use this function and specify the number of threads that can be used to process multiple molecules simultaneously.
        timing (bool):  Whether to print the duration of this calculation.
        input_constraints (list, optional): A list of constraints for the calculation. Each constraint in that list is a Python dict as shown above for get_molbar_from_coordinates.
        progress (bool): Whether to show a progress bar.
        mode (str): Whether to calculate the molecular barcode ("mb") or the topology part of the molecular barcode ("topo").

    Returns:

        Union[list, Union[str, dict]]: Either MolBar or the MolBar and MolBar data.

3. get_molbar_from_file

from molbar.barcode import get_molbar_from_file

def get_molbar_from_file(file: str, return_data=False, timing=False, input_constraint=None, mode="mb", write_trj=False) -> Union[str, dict]:

    Args:
        file (str): The path to the file containing the molecule information (either .xyz/.sdf/.mol format).
        return_data (bool): Whether to return MolBar data.
        timing (bool): Whether to print the duration of this calculation.
        input_constraint (dict, optional): A dict of extra constraints for the calculation. See below for more information. USED ONLY IN EXCEPTIONAL CASES.
        mode (str): Whether to calculate the molecular barcode ("mb") or only the topology part of the molecular barcode ("topo").
        write_trj (bool, optional): Whether to write a trajectory of the unification process. Defaults to False.
    
    Returns:

        Union[str, dict]: Either MolBar or the MolBar and MolBar data.

Example for input file in .yml format. Input constraint should be used only in exceptional cases. However, it may be useful to constrain bonds with a additional dihedral for the barcode that are normally considered single bonds but whose rotation is hindered (e.g., 90° binol systems with bulky substituents).

constraints:
  dihedrals:
    - atoms: [30, 18, 14, 13]  # List of atoms involved in the dihedral
      value:  90.0  # Actual values for the dihedral parameters

4. get_molbars_from_files

NOTE: If you need to process multiple molecules at once, it is recommended to use this function and specify the number of threads that can be used to process multiple molecules simultaneously.

from molbar.barcode import get_molbars_from_files

def get_molbars_from_files(files: list, return_data=False, threads=1, timing=False, input_constraints=None, progress=False, mode="mb", write_trj=False) ->Union[list, Union[str, dict]]:

    Args:

        files (list): The list of paths to the files containing the molecule information (either .xyz/.sdf/.mol format).
        return_data (bool): Whether to return MolBar data.
        threads (int): Number of threads to use for the calculation. If you need to process multiple molecules at once, it is recommended to use this function and specify the number of threads that can be used to process multiple molecules simultaneously.
        timing (bool):  Whether to print the duration of this calculation.
        input_constraints (list, optional): A list of file paths to the input files for the calculation. Each constrained is specified by a file path to a .yml file, as shown above for get_molbar_from_file.
        progress (bool): Whether to show a progress bar.
        mode (str): Whether to calculate the molecular barcode ("mb") or the topology part of the molecular barcode ("topo").
        write_trj (bool, optional): Whether to write a trajectory of the unification process. Defaults to False.

    Returns:

        Union[list, Union[str, dict]]: Either MolBar or the MolBar and MolBar data.

Commandline Usage

MolBar can also be used as commandline tool. Just simply type:

molbar coord.xyz

and the MolBar is printed to the stdout.

NOTE: If you need to process several molecules at once, it is recommended to pass all molecules to the code at once (e.g. with *.xyz) while specifying the number of threads the code should use:

molbar *.xyz -T N_threads -s

The latter option (-s) is used to store the barcode to .mb files.

Further, the commandline tool provides several options:

usage: molbar [-h] [-r] [-i INP [INP ...]] [-d] [-T THREADS] [-s] [-t] [-p] [-m {mb,topo,opt}] files [files ...]

positional arguments:
  files                 file(s)

options:
  -m {mb,topo,opt}, --mode {mb,topo,opt}
                      The mode to use for the calculations (either "mb" (default, calculates MolBar), "topo" (topology part only)
                      or "opt" (using stand-alone force field idealization, writes ".opt" with final structure))

  -i INP [INP ...], --inp INP [INP ...]
                        Path to input file in .yml format to add further constraints. Example input can be found below.

  -d, --data           Whether to print MolBar data. 
                        Writes a "filename/" directory containing a json file with
                        important information that defines MolBar. Writes idealization trajectories of each fragment to same directory.

  -T THREADS, --threads THREADS
                        The number of threads to use for parallel processing of several files. MolBar generation for a single file is not parallelized. Should be used together with -s/--save (e.g. molbar *.xyz -T 8 -s)

  -s, --save            Whether to save the result to a file of type "filename.mb"
  -t, --time            Print out timings.

  -p, --progress        Use a progress bar when several files are handled.

Example for input file constraints in yml format. Input constraint should be used only in exceptional cases. However, it may be useful to constrain bonds with a additional dihedral for the barcode that are normally considered single bonds but whose rotation is hindered (e.g., 90° binol systems with bulky substituents).

constraints:
  dihedrals:
    - atoms: [30, 18, 14, 13]  # List of atoms involved in the dihedral
      value:  90.0  # Actual values for the dihedral parameters

Using the unification force field for the whole molecule.

The force field can be used to idealize the structure of a whole molecule where the coordinates are either given in Python by a file:

  1. as a commandline tool with the molbar coord.xyz -m opt option
  2. in Python with idealize_structure_from_file by providing a file path
  3. in Python with idealize_structure_from_coordinates by providing Cartesian coordinates as a list

Commandline tool

molbar coord.xyz -m opt

This writes a coord.opt file that contains the idealized coordinates.

In Python from a file:

  from molbar.barcode import idealize_structure_from_file

  def idealize_structure_from_file(file: str, return_data=False, timing=False, input_constraint=None,  write_trj=False) -> Union[list, str]

      Args:

          file (str): The path to the input file to be processed.
          return_data (bool): Whether to print MolBar data.
          timing (bool): Whether to print the duration of this calculation.
          input_constraint (str): The path to the input file containing the constraint for the calculation. See down below for more information.
          write_trj (bool, optional): Whether to write a trajectory of the unification process. Defaults to False.
      Returns:
          n_atoms (int): Number of atoms in the molecule.
          energy (float): Final energy of the molecule after idealization.
          coordinates (list): Final coordinates of the molecule after idealization.
          elements (list): Elements of the molecule.
          data (dict): Molbar data.

This is an example input as a yml file:

bond_order_assignment: False  # False if bond order assignment should be skipped, only reasonable opt mode (standalone force-field optimization)
cycle_detection: True # False if cycle detection should be skipped, only reasonable opt mode (standalone force-field optimization).
repulsion_charge: 100.0 # Charged used for the Coulomb term in the Force field, every atom-atom pair uses the same charge, only reasonable opt mode (standalone force-field optimization). Defaults to 100.0
set_edges: True #False if no bonds should be constrained automatically.
set_angles: True #False if no angles should be constrained automatically.
set_dihedrals: True # False if no dihedrals should be constrained automatically.
set_repulsion: True #False if no coulomb term should be used automatically.

constraints:
  bonds:
    - atoms: [19, 23]  # List of atoms involved in the bond
      value: 1.5  # Ideal bond length. 
  angles:
    - atoms: [19, 23, 35]  # List of atoms involved in the angle
      value: 45.0  # Angle to which the angle between the three atoms is to be constrained
    - atoms: [35, 23, 19]  # List of atoms involved in the angle
      value: 45.0  # Angle to which the angle between the three atoms is to be constrained

  dihedrals:
    - atoms: [30, 18, 14, 13]  # List of atoms involved in the dihedral
      value:  90.0  # Actual values for the dihedral parameters

In Python from a list of Cartesian coordinates:

from molbar.barcode import idealize_structure_from_coordinates

def idealize_structure_from_coordinates(coordinates: list, elements: list, return_data=False, timing=False, input_constraint=None) -> Union[list, str]:

      Args:
          coordinates (list): Cartesian coordinates of the molecule.
          elements (list): Elements of the molecule.
          return_data (bool, optional): Whether to return MolBar data.
          timing (bool, optional): Whether to print the duration of this calculation.
          input_constraint (dict, optional): The constraint for the calculation. See documentation for more information.
          
      Returns:
          n_atoms (int): Number of atoms in the molecule.
          energy (float): Final energy of the molecule after idealization.
          coordinates (list): Final coordinates of the molecule after idealization.
          elements (list): Elements of the molecule.
          data (dict): MolBar data.

This is an example input as a Python dict:

  {'bond_order_assignment': True, #False if bond order assignment should be skipped, only reasonable opt mode (standalone force-field optimization)
  'cycle_detection': True, #False if cycle detection should be skipped, only reasonable opt mode (standalone force-field optimization).
  'set_edges': True #False if no bonds should be constrained automatically.
  'set_angles': True #False if no angles should be constrained automatically.
  'set_dihedrals': True #False if no dihedrals should be constrained automatically.
  'set_repulsion': True #False if no coulomb term should be used automatically.
  'repulsion_charge': 100.0, # Charged used for the Coulomb term in the Force field, every atom-atom pair uses the same charge, only reasonable opt mode (standalone force-field optimization). Defaults to 100.0
  'constraints': {'bonds': [{'atoms': [1,2], 'value':1.5},...], #atoms: list of atoms that define the bond, value is the ideal bond length in angstrom, atom indexing starts with 1.
                  'angles': [{'atoms': [1,2,3], 'value':90.0},...], #atoms: list of atoms that define the angle, value is the ideal angle in degrees, atom indexing starts with 1.
                  'dihedrals': [{'atoms': [1,2,3,4], 'value':180.0},...]}  #atoms: list of atoms that define the dihedral, value is the ideal dihedral angle in degrees, atom indexing starts with 1.
  }

Acknowledgements

MolBar relies on the following libraries and packages:

Thank you!

License and Disclaimer

MIT License

Copyright (c) 2022 Nils van Staalduinen, Christoph Bannwarth

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

molbar-1.1.1.tar.gz (131.0 kB view hashes)

Uploaded Source

Built Distributions

molbar-1.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

molbar-1.1.1-cp312-cp312-macosx_11_0_arm64.whl (288.4 kB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

molbar-1.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

molbar-1.1.1-cp311-cp311-macosx_11_0_arm64.whl (287.7 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

molbar-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

molbar-1.1.1-cp310-cp310-macosx_11_0_arm64.whl (287.7 kB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

molbar-1.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

molbar-1.1.1-cp39-cp39-macosx_11_0_arm64.whl (287.7 kB view hashes)

Uploaded CPython 3.9 macOS 11.0+ ARM64

molbar-1.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

molbar-1.1.1-cp38-cp38-macosx_11_0_arm64.whl (287.5 kB view hashes)

Uploaded CPython 3.8 macOS 11.0+ ARM64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page