Implements fingerprints (isometry invariants) of crystals based on geometry: average minimum distances (AMD) and point-wise distance distributions (PDD).
Reason this release was yanked:
bad PDD_cdist bug
Project description
average-minimum-distance: isometrically invariant crystal fingerprints
Implements fingerprints (isometry invariants) of crystal structures based on geometry: average minimum distances (AMD) and pointwise distance distributions (PDD).
- PyPI project: https://pypi.org/project/average-minimum-distance
- Documentation: https://average-minimum-distance.readthedocs.io
- Source code: https://github.com/dwiddo/average-minimum-distance
- References (jump to bib references):
- Average minimum distances of periodic point sets - foundational invariants for mapping periodic crystals. MATCH Communications in Mathematical and in Computer Chemistry, 87(3):529-559 (2022). https://doi.org/10.46793/match.87-3.529W
- Pointwise distance distributions of periodic point sets. arXiv preprint arXiv:2108.04798 (2021). https://arxiv.org/abs/2108.04798
What's amd?
The typical representation of a crystal as a motif and cell is ambiguous, as there are many ways to define the same crystal. This package implements new isometric invariants: average minimum distances (AMD) and pointwise distance distributions (PDD), which always take the same value for any two (isometrically) identical input crystals. They do this in a continuous way, so similar crystals have a small distance between their invariants.
Brief description of AMD and PDD
The pointwise distance distribution (PDD) records the environment of each atom in a unit cell by listing the distances from each atom to neighbouring atoms in order, with some extra steps to ensure independence of cell and motif. A PDD is a collection of lists with attached weights (a matrix). Two PDDs are compared by finding an optimal matching between the two sets of lists while respecting the weights (Earth Mover's distance), and when the crystals are geometrically identical (regardless of choice of motif and cell) there is always a perfect matching resulting in a distance of zero.
The average minimum distance (AMD) averages the PDD over atoms in a unit cell to make a vector, which is also the same for any choice of cell and motif. Since AMDs are just vectors, comparing by AMD is much faster than PDD, though AMD contains less information in theory.
Both AMD and PDD have a parameter k, the number of nearest neighbours to consider for each atom, which is the length of the AMD vector or the number of columns in the PDD (plus an extra column for weights of rows).
Getting started
Use pip to install average-minimum-distance:
pip install average-minimum-distance
Then import average-minimum-distance with import amd
.
amd.compare()
compares sets of crystals by AMD or PDD in one line, e.g. by PDD with k = 100:
import amd
df = amd.compare('crystals.cif', by='PDD', k=100)
A pandas DataFrame is returned of the distance matrix with names of crystals in rows and columns. It can also take two paths and compare crystals in one file with the other, for example
df = amd.compare('crystals_1.cif', 'crystals_2.cif' by='AMD', k=100)
Either first or second argument can be lists of cif paths (or file objects) which are combined in the final distance matrix.
amd.compare()
reads crystals and calculates their AMD or PDD, but throws them away. It may be faster to save these to a file (e.g. pickle
), see sections below on how to separately read, calculate and compare.
If csd-python-api
is installed, the compare function can also accept one or more CSD refcodes or other file formats instead of cifs (pass reader='ccdc'
).
Choosing a value of k
The parameter k of the invariants is the number of nearest neighbour atoms considered for each atom in the unit cell, e.g. k = 5 looks at the 5 nearest neighbours of each atom. Two crystals with the same unit molecule will have a small AMD/PDD distance for small enough k. A larger k will mean the environments of atoms in one crystal must line up with those in the other up to a larger radius to have a small AMD/PDD distance. Very large k does not mean better comparisons, as the invariants start to converge to depend only on density.
Reading crystals from a file, calculating the AMDs and PDDs
This code reads a .cif with amd.CifReader
and computes the AMDs (k = 100):
import amd
reader = amd.CifReader('path/to/file.cif')
amds = [amd.AMD(crystal, 100) for crystal in reader] # calc AMDs
Note: CifReader accepts optional arguments, e.g. for removing hydrogen and handling disorder. See the documentation for details.
To calculate PDDs, just replace amd.AMD
with amd.PDD
.
If csd-python-api
is installed, crystals can be read directly from your local copy of the CSD with amd.CSDReader
, which accepts a list of refcodes. CifReader can accept file formats other than cif by passing reader='ccdc'
.
Comparing by AMD or PDD
amd.AMD_pdist
and amd.PDD_pdist
take a list of invariants and compares them pairwise, returning a condensed distance matrix like SciPy's pdist
function.
# read and calculate AMDs and PDDs (k=100)
crystals = list(amd.CifReader('path/to/file.cif'))
amds = [amd.AMD(crystal, 100) for crystal in reader]
pdds = [amd.PDD(crystal, 100) for crystal in reader]
amd_cdm = amd.AMD_pdist(amds) # compare a list of AMDs pairwise
pdd_cdm = amd.PDD_pdist(pdds) # compare a list of PDDs pairwise
# Use SciPy's squareform for a symmetric 2D distance matrix
from scipy.distance.spatial import squareform
amd_dm = squareform(amd_cdm)
Note: if you want both AMDs and PDDs like above, it's faster to compute the PDDs first and use amd.PDD_to_AMD()
rather than computing both from scratch.
The default metric for comparison is chebyshev
(L-infinity), though it can be changed to anything accepted by SciPy's pdist
, e.g. euclidean
.
If you have two sets of crystals and want to compare all crystals in one to the other, use amd.AMD_cdist
or amd.PDD_cdist
.
set1 = amd.CifReader('set1.cif')
set2 = amd.CifReader('set2.cif')
amds1 = [amd.AMD(crystal, 100) for crystal in set1]
amds2 = [amd.AMD(crystal, 100) for crystal in set2]
# dm[i][j] = distance(amds1[i], amds2[j])
dm = amd.AMD_cdist(amds)
Example: PDD-based dendrogram
This example compares some crystals in a cif by PDD (k = 100) and plots a single linkage dendrogram:
import amd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy
crystals = list(amd.CifReader('crystals.cif'))
names = [crystal.name for crystal in crystals]
pdds = [amd.PDD(crystal, 100) for crystal in crystals]
cdm = amd.PDD_pdist(pdds)
Z = hierarchy.linkage(cdm, 'single')
dn = hierarchy.dendrogram(Z, labels=names)
plt.show()
Example: Finding n nearest neighbours in one set from another
This example finds the 10 nearest PDD-neighbours in set 2 for every crystal in set 1.
import numpy as np
import amd
n = 10
df = amd.compare('set1.cif', 'set2.cif', k=100)
dm = df.values
# Uses np.argpartiton (partial argsort) and np.take_along_axis to find
# nearest neighbours of each item in set1. Works for any distance matrix.
nn_inds = np.array([np.argpartition(row, n)[:n] for row in dm])
nn_dists = np.take_along_axis(dm, nn_inds, axis=-1)
sorted_inds = np.argsort(nn_dists, axis=-1)
nn_inds = np.take_along_axis(nn_inds, sorted_inds, axis=-1)
nn_dists = np.take_along_axis(nn_dists, sorted_inds, axis=-1)
for i in range(len(set1)):
print('neighbours of', df.index[i])
for j in range(n):
print('neighbour', j+1, df.columns[nn_inds[i][j]], 'dist:', nn_dists[i][j])
Cite us
Use the following bib references to cite AMD or PDD.
Average minimum distances of periodic point sets - foundational invariants for mapping periodic crystals. MATCH Communications in Mathematical and in Computer Chemistry, 87(3), 529-559 (2022). https://doi.org/10.46793/match.87-3.529W.
@article{10.46793/match.87-3.529W,
title = {Average Minimum Distances of periodic point sets - foundational invariants for mapping periodic crystals},
author = {Widdowson, Daniel and Mosca, Marco M and Pulido, Angeles and Kurlin, Vitaliy and Cooper, Andrew I},
journal = {MATCH Communications in Mathematical and in Computer Chemistry},
doi = {10.46793/match.87-3.529W},
volume = {87},
number = {3},
pages = {529-559},
year = {2022}
}
Pointwise distance distributions of periodic point sets. arXiv preprint arXiv:2108.04798 (2021). https://arxiv.org/abs/2108.04798.
@misc{arXiv:2108.04798,
author = {Widdowson, Daniel and Kurlin, Vitaliy},
title = {Pointwise distance distributions of periodic point sets},
year = {2021},
eprint = {arXiv:2108.04798},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file average-minimum-distance-1.3.1.tar.gz
.
File metadata
- Download URL: average-minimum-distance-1.3.1.tar.gz
- Upload date:
- Size: 33.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0baa87bef11f033a8da5651d686bf7ad1d196d317895879f26e63a1fcc865ebd |
|
MD5 | 501ec61c3c8c85092a72f234e20378ac |
|
BLAKE2b-256 | 2da9fb8770bff0aec56da344d8e955d1b548b44c40d5d45314b4b3e05083e3ff |
File details
Details for the file average_minimum_distance-1.3.1-py3-none-any.whl
.
File metadata
- Download URL: average_minimum_distance-1.3.1-py3-none-any.whl
- Upload date:
- Size: 32.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b8256359900aea1b482d09742af4d22cd3e04e8c2273dff6118d0c718ffb5611 |
|
MD5 | 5c653080b9231bec610e67c06dbf59d9 |
|
BLAKE2b-256 | 6842977395539736fad152c1c4fcea54da805a326163fc8c398887ceb39bb519 |