Machine learned location of chemical adsorbates on high-symmetry surface sites.
Project description
ASESurfaceFinder
A utility for determining surface facets and adsorption points of ASE-based systems consisting of molecules on surfaces.
ASE comes with an excellent selection of utilities for working with atomic surfaces, enabling the construction of many common surface facets, the definition of high-symmetry points across these surfaces, and the adsorption of arbitrary molecules to these surface sites. However, computationally determining which of these sites an adsorbed molecule is bound to without prior knowledge is a non-trivial task, due to the symmetric equivalence of many sites and high similarity of many surface facets.
ASESurfaceFinder implements automated tools for training and validating random forest classification models (implemented in scikit-learn) that can identify surface sites based on the local atomic environment of adsorbed atoms. Given unseen adsorbed systems, it then enables these models to be used for prediction of both surface facet and high-symmetry adsorption site, to be used when cataloguing externally-generated adsorbed systems.
Installation
ASESurfaceFinder has been developed for Python 3.10 and newer. It is available on PyPI and conda-forge, using their respective package managers:
# Pip (PyPI)
pip install asesurfacefinder
# Conda (conda-forge)
conda install -c conda-forge asesurfacefinder
Usage
Given a workflow that produces XYZ geometries of molecules adsorbed on periodic solid surfaces in unknown locations, ASESurfaceFinder can be used to quickly categorise the adsorption sites by the surface's available high-symmetry points. For example, assuming the workflow can produce surface/adsorbate systems where the surface is one of the following (a 3x3x3 unit cell is shown):
Example Surface | ASE Construction | Visualisation (3x3x3) |
---|---|---|
Platinum FCC{100} | fcc100('Pt', (3,3,3)) |
|
Silver FCC{110} | fcc110('Ag', (3,3,3)) |
|
Gold FCC{111} | fcc111('Au', (3,3,3)) |
ASE defines the symmetrically equivalent adsorption sites for each of these surface facets (see ASE's documentation on surface construction). These are used as classification targets within ASESurfaceFinder.
To begin, the package's main class is initialised using these surfaces:
from asesurfacefinder import SurfaceFinder
from ase.build import fcc100, fcc110, fcc111
surfaces = [fcc100('Pt', (3,3,3)), fcc110('Ag', (3,3,3)), fcc111('Au', (3,3,3))]
surface_labels = ['Pt_fcc100', 'Ag_fcc110', 'Au_fcc111']
sf = SurfaceFinder(surfaces, labels=surface_labels)
The labels
argument is optional, and can be used for more precisely naming the surface facets; without it, they will be referred to by integers.
[!IMPORTANT] While surfaces outside of those that can be generated by ASE can be trained on, they must contain an
'adsorbate_info'
dictionary internally that specifies their minimal XY unit cell and their high-symmetry adsorption sites.
Model Training
To train a random forest classification model to recognise the adsorption sites on each of these surfaces, the train
method of this class can be called:
sf.train(
samples_per_site=10000,
ads_xy_noise=0.2,
ads_z_bounds=(1.0, 2.75),
n_jobs=4
)
For each surface facet passed into SurfaceFinder
, this samples positions above each adsorption site samples_per_site
times for possible perturbations and adsorption heights. The XY position is sampled from a bivariate normal distribution with its mean at the adsorption site and XY variance specified by ads_xy_noise
. The height (Z position) is sampled from a uniform distribution between ads_z_bounds
.
The local atomic environment around each sampled adsorbate position is then encoded by means of a descriptor from DScribe - a SOAP descriptor with a cutoff of 10 Å is used by default, but this can be modified by passing a descriptor
keyword argument when constructing the SurfaceFinder
. Each descriptor is matched with a label representing the surface facet and adsorption site that it represents in the format {label}_{site}
, where {label}
is one of the labels passed into SurfaceFinder
and {site}
is the ASE high-symmetry site name. The n_jobs
argument enables parallelism over the specified number of processes during descriptor creation and model training.
A random forest classifier is then trained on this data, such that any future points above a surface can be classified into a surface facet and adsorption site prediction.
Sample Visualisation
Possible realisations of the adsorption point sampling can be visualised with ASESurfaceFinder's SamplePlotter
:
from asesurfacefinder import SamplePlotter
sp = SamplePlotter(fcc111('Au', (3,3,3)),
samples_per_site=2000,
ads_xy_noise=0.2,
ads_z_bounds=(1.0, 2.75))
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,6), layout='constrained')
ax1.set_axis_off(); ax2.set_axis_off()
sp.plot(ax=ax1, rotation='0x,0y,0z')
sp.plot(ax=ax2, rotation='-75x,15y,0z')
This acts as a wrapper around ase.visualize.plot
's plot_atoms
function, sampling adsorption points within the given bounds and displaying them on the requested surface. SamplePlotter.plot()
passes through keyword arguments to this function, allowing for properties such as the rotation of each plot to be modified:
In this case, the sampled XY position noise may need to be decreased slightly, as the areas covered by the bridge and fcc sites have begun to overlap, which may reduce predictive accuracy.
Model Validation
The training can be validated before performing any predictions on real systems:
sf.validate(
samples_per_site=2000,
surf_mults=[(1,1,1), (3,3,1), (5,5,1)],
ads_xy_noise=0.25,
ads_z_bounds=(1.4, 2.7)
)
This performs the same adsorbate position sampling procedure as was used to generate local atomic environments in the training of the classifier, but different values of the sampling volume can be passed to test the limits of applicability of the current model. In this case, it is likely that there will be a few failed predictions, as the predictor will have to extrapolate to local environments outside the height and XY bounds that it was trained on.
Additionally, the surf_mults
argument can be used to scale up the unit cell of each surface facet before sampling to ensure that adsorbates on larger surface slabs are correctly predicted. In this example, since the provided facets were all 3x3x3 unit cells, this will produce sampled adsorbate positions on 3x3x3, 9x9x3 and 15x15x3 surface slabs.
ASESurfaceFinder will output a summary of these validation results once it has predicted labels for these newly generated positions and compared them with their true values. If the classifier has been trained correctly, this will simply be something like
66000/66000 sites classified correctly (accuracy = 1.0).
However, if this validation fails for any sites, ASESurfaceFinder prints them in a table where h
is the sampled adsorbate height and d
is the distance in Angstroms away from the ideal adsorption site:
63559/66000 sites classified correctly (accuracy = 0.9630151515151515).
True | Predicted
-------------------------------------------------------------------
Au_fcc111_bridge (5, 5, 1) (h = 2.48, d = 0.44) | Au_fcc111_hcp
Au_fcc111_bridge (5, 5, 1) (h = 1.42, d = 0.67) | Au_fcc111_ontop
Au_fcc111_bridge (5, 5, 1) (h = 1.60, d = 0.42) | Au_fcc111_hcp
Au_fcc111_fcc (5, 5, 1) (h = 2.43, d = 0.52) | Au_fcc111_bridge
Au_fcc111_fcc (5, 5, 1) (h = 2.40, d = 0.63) | Au_fcc111_bridge
Au_fcc111_fcc (5, 5, 1) (h = 1.98, d = 0.54) | Au_fcc111_bridge
...
This indicates that either
- the training samples were not varied enough to account for the sample space covered by the validation samples,
- the validation samples covered too wide an area and overlapped between neighbouring sites, causing samples to be mislabelled,
- the training samples accounted for too wide a sample space, causing overlap between neighbouring sites and misclassification.
In this case, we showed above that it is likely that all of these problems are arising - SamplePlotter
showed that there was overlap in the training samples between some sites of Au_fcc111
, and ads_xy_noise
in the validation samples was even greater, creating mislabelled samples. In a production use case, these parameters should be tweaked such that validation returns no errors while sampling as broad a volume above each adsorption site as possible.
Model Prediction
To predict the high-symmetry adsorption sites and surface facets of real surface/adsorbate systems, these can simply be fed into a trained SurfaceFinder
.
Taking an example system of 3 methanol molecules adsorbed to 'fcc' sites of gold FCC{111}, with a free methanol molecule above the surface:
The adsorbates and gas phase molecule can be separated from the surface and classified by their adsorption sites (or lack thereof) with the SurfaceFinder.predict()
method:
from ase.io import read
real_surface_1 = read('examples/4MeO_Au_fcc111.xyz')
slab, molecules, labels = sf.predict(real_surface_1)
This returns three outputs:
slab
: The clean surface geometry as anase.Atoms
, without any adsorbates.molecules
: A list ofase.Atoms
representing the adsorbates and gas phase molecules found on and above the surface respectively.labels
: A list of dictionaries corresponding to the entries inmolecules
containing the atoms of the respective adsorbate that are adsorbed onto the surface and their predicted sites.
For example, given the above system:
for molecule, label in zip(molecules, labels):
print(molecule, label)
Atoms(symbols='OCH3', pbc=False, tags=...) {0: {'site': 'Au_fcc111_fcc', 'bonded_elem': 'O', 'coordination': 3}}
Atoms(symbols='OCH3', pbc=False, tags=...) {0: {'site': 'Au_fcc111_fcc', 'bonded_elem': 'O', 'coordination': 3}}
Atoms(symbols='OCH3', pbc=False, tags=...) {0: {'site': 'Au_fcc111_fcc', 'bonded_elem': 'O', 'coordination': 3}}
Atoms(symbols='CHOH3', pbc=False, tags=...) {}
The first three entries are the adsorbed methanol molecules, correctly classified as being bound to the 'fcc' high-symmetry site on a gold FCC{111} surface. Each molecule's label
dictionary is keyed by the index of the atom in the corresponding molecule
which is bound; in the event that a molecule is bound to the surface by multiple atoms, it will contain en entry for each. In this case, all adsorbates are bound by their respective 0th atoms - their oxygens, indicated by the 'bonded_elem'
key.
Additionally, each label
also provides the coordination of the adsorption site as detected by ASE's connectivity tools. Since the 'fcc' site sits in between three atoms, this is reflected in the 'coordination'
key.
The fourth molecule returned is the gas phase methanol, which is not adsorbed onto the surface. This is reflected in its label
dictionary, which is empty to indicate no atoms are bound.
A more complex example with different adsorbates bound on a range of sites follows:
Running the same prediction workflow reveals carbon monoxide on an 'ontop' site, hydroxide on a 'fcc' site, and methylamine on a 'bridge' site:
real_surface_2 = read('examples/CO+H2O+NHCH3_Au_fcc111.xyz')
slab, molecules, labels = sf.predict(real_surface_2)
for molecule, label in zip(molecules, labels):
print(molecule, label)
Atoms(symbols='CO', pbc=False) {0: {'site': 'Au_fcc111_ontop', 'bonded_elem': 'C', 'coordination': 1}}
Atoms(symbols='OH', pbc=False) {0: {'site': 'Au_fcc111_fcc', 'bonded_elem': 'O', 'coordination': 3}}
Atoms(symbols='NHCH3', pbc=False) {0: {'site': 'Au_fcc111_bridge', 'bonded_elem': 'N', 'coordination': 2}}
A note on adsorbate identification
ASESurfaceFinder does not currently contain a nice, clean way of separating all adsorbates from surfaces. In the first instance, it searches for a set of per-atom 'tags' in the underlying ase.Atoms
object (accessed by atoms.get_tags()
). These are generated by ASE when using its adsorbate-adding functionality, and represent the layer of the surface that each atom is part of - adsorbates are tagged at layer 0, with increasing layer tags indicating increasing depth.
However, it is unlikely that input surface/adsorbate systems will be tagged like this since if they were generated with ASE, you would already know the sites on which the adsorbates sit. Instead, ASESurfaceFinder currently relies on separation by element. Since it knows which elements the surfaces are made of, it will simply mask these elements off to separate the adsorbates from the surface. A warning will be displayed when this is performed.
Since the majority of surface/adsorbate systems are organic molecules on metal surfaces, this approach will work most of the time. However, when dealing with surfaces that contain the same elements as adsorbates (e.g. diamond, silica, etc.), this approach will fail. A more concrete approach for adsorbate separation is being developed, but will likely require analysis of the training surfaces as a whole.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file asesurfacefinder-1.0.2.tar.gz
.
File metadata
- Download URL: asesurfacefinder-1.0.2.tar.gz
- Upload date:
- Size: 19.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 23788c1a3ac01b7198edeb634d68e8f8b7512381bed079b7ad5d2a3ad801a5b6 |
|
MD5 | b09ab75cb61c9e2f826ff1710c94091e |
|
BLAKE2b-256 | f34dae05dbbbe610b9f3cb5d23fa75de6d84d5d98ee2deec4fc15496969e29d7 |
File details
Details for the file asesurfacefinder-1.0.2-py3-none-any.whl
.
File metadata
- Download URL: asesurfacefinder-1.0.2-py3-none-any.whl
- Upload date:
- Size: 15.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 67ce8b9bf5c88555bb5a169cb75c3679e6526f47bf56bf66e1638d8803d80d50 |
|
MD5 | 2ec111d3358b2072e2448ef96bf9b84d |
|
BLAKE2b-256 | 8426b00a1eda3ceca5928c790eaa886e17634d52f936288a68f286c37c7d3176 |