Skip to main content

Implements fingerprints (isometry invariants) of crystals based on geometry: average minimum distances (AMD) and point-wise distance distributions (PDD).

Reason this release was yanked:

bad PDD_cdist bug

Project description

average-minimum-distance: isometrically invariant crystal fingerprints

PyPI Status Read the Docs Build Status MATCH Paper CC-0 license

Implements fingerprints (isometry invariants) of crystal structures based on geometry: average minimum distances (AMD) and pointwise distance distributions (PDD).

What's amd?

The typical representation of a crystal as a motif and cell is ambiguous, as there are many ways to define the same crystal. This package implements new isometric invariants: average minimum distances (AMD) and pointwise distance distributions (PDD), which always take the same value for any two (isometrically) identical input crystals. They do this in a continuous way, so similar crystals have a small distance between their invariants.

Brief description of AMD and PDD

The pointwise distance distribution (PDD) records the environment of each atom in a unit cell by listing the distances from each atom to neighbouring atoms in order, with some extra steps to ensure independence of cell and motif. A PDD is a collection of lists with attached weights (a matrix). Two PDDs are compared by finding an optimal matching between the two sets of lists while respecting the weights (Earth Mover's distance), and when the crystals are geometrically identical (regardless of choice of motif and cell) there is always a perfect matching resulting in a distance of zero.

The average minimum distance (AMD) averages the PDD over atoms in a unit cell to make a vector, which is also the same for any choice of cell and motif. Since AMDs are just vectors, comparing by AMD is much faster than PDD, though AMD contains less information in theory.

Both AMD and PDD have a parameter k, the number of nearest neighbours to consider for each atom, which is the length of the AMD vector or the number of columns in the PDD (plus an extra column for weights of rows).

Getting started

Use pip to install average-minimum-distance:

pip install average-minimum-distance

Then import average-minimum-distance with import amd.

amd.compare() compares sets of crystals by AMD or PDD in one line, e.g. by PDD with k = 100:

import amd
df = amd.compare('crystals.cif', by='PDD', k=100)

A pandas DataFrame is returned of the distance matrix with names of crystals in rows and columns. It can also take two paths and compare crystals in one file with the other, for example

df = amd.compare('crystals_1.cif', 'crystals_2.cif' by='AMD', k=100)

Either first or second argument can be lists of cif paths (or file objects) which are combined in the final distance matrix.

amd.compare() reads crystals and calculates their AMD or PDD, but throws them away. It may be faster to save these to a file (e.g. pickle), see sections below on how to separately read, calculate and compare.

If csd-python-api is installed, the compare function can also accept one or more CSD refcodes or other file formats instead of cifs (pass reader='ccdc').

Choosing a value of k

The parameter k of the invariants is the number of nearest neighbour atoms considered for each atom in the unit cell, e.g. k = 5 looks at the 5 nearest neighbours of each atom. Two crystals with the same unit molecule will have a small AMD/PDD distance for small enough k. A larger k will mean the environments of atoms in one crystal must line up with those in the other up to a larger radius to have a small AMD/PDD distance. Very large k does not mean better comparisons, as the invariants start to converge to depend only on density.

Reading crystals from a file, calculating the AMDs and PDDs

This code reads a .cif with amd.CifReader and computes the AMDs (k = 100):

import amd
reader = amd.CifReader('path/to/file.cif')
amds = [amd.AMD(crystal, 100) for crystal in reader]  # calc AMDs

Note: CifReader accepts optional arguments, e.g. for removing hydrogen and handling disorder. See the documentation for details.

To calculate PDDs, just replace amd.AMD with amd.PDD.

If csd-python-api is installed, crystals can be read directly from your local copy of the CSD with amd.CSDReader, which accepts a list of refcodes. CifReader can accept file formats other than cif by passing reader='ccdc'.

Comparing by AMD or PDD

amd.AMD_pdist and amd.PDD_pdist take a list of invariants and compares them pairwise, returning a condensed distance matrix like SciPy's pdist function.

# read and calculate AMDs and PDDs (k=100)
crystals = list(amd.CifReader('path/to/file.cif'))
amds = [amd.AMD(crystal, 100) for crystal in reader]
pdds = [amd.PDD(crystal, 100) for crystal in reader]

amd_cdm = amd.AMD_pdist(amds) # compare a list of AMDs pairwise
pdd_cdm = amd.PDD_pdist(pdds) # compare a list of PDDs pairwise

# Use SciPy's squareform for a symmetric 2D distance matrix
from scipy.distance.spatial import squareform
amd_dm = squareform(amd_cdm)

Note: if you want both AMDs and PDDs like above, it's faster to compute the PDDs first and use amd.PDD_to_AMD() rather than computing both from scratch.

The default metric for comparison is chebyshev (L-infinity), though it can be changed to anything accepted by SciPy's pdist, e.g. euclidean.

If you have two sets of crystals and want to compare all crystals in one to the other, use amd.AMD_cdist or amd.PDD_cdist.

set1 = amd.CifReader('set1.cif')
set2 = amd.CifReader('set2.cif')
amds1 = [amd.AMD(crystal, 100) for crystal in set1]
amds2 = [amd.AMD(crystal, 100) for crystal in set2]

# dm[i][j] = distance(amds1[i], amds2[j])
dm = amd.AMD_cdist(amds)

Example: PDD-based dendrogram

This example compares some crystals in a cif by PDD (k = 100) and plots a single linkage dendrogram:

import amd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy

crystals = list(amd.CifReader('crystals.cif'))
names = [crystal.name for crystal in crystals]
pdds = [amd.PDD(crystal, 100) for crystal in crystals]
cdm = amd.PDD_pdist(pdds)
Z = hierarchy.linkage(cdm, 'single')
dn = hierarchy.dendrogram(Z, labels=names)
plt.show()

Example: Finding n nearest neighbours in one set from another

This example finds the 10 nearest PDD-neighbours in set 2 for every crystal in set 1.

import numpy as np
import amd

n = 10
df = amd.compare('set1.cif', 'set2.cif', k=100)
dm = df.values

# Uses np.argpartiton (partial argsort) and np.take_along_axis to find 
# nearest neighbours of each item in set1. Works for any distance matrix.
nn_inds = np.array([np.argpartition(row, n)[:n] for row in dm])
nn_dists = np.take_along_axis(dm, nn_inds, axis=-1)
sorted_inds = np.argsort(nn_dists, axis=-1)
nn_inds = np.take_along_axis(nn_inds, sorted_inds, axis=-1)
nn_dists = np.take_along_axis(nn_dists, sorted_inds, axis=-1)

for i in range(len(set1)):
    print('neighbours of', df.index[i])
    for j in range(n):
        print('neighbour', j+1, df.columns[nn_inds[i][j]], 'dist:', nn_dists[i][j])

Cite us

Use the following bib references to cite AMD or PDD.

Average minimum distances of periodic point sets - foundational invariants for mapping periodic crystals. MATCH Communications in Mathematical and in Computer Chemistry, 87(3), 529-559 (2022). https://doi.org/10.46793/match.87-3.529W.

@article{10.46793/match.87-3.529W,
  title = {Average Minimum Distances of periodic point sets - foundational invariants for mapping periodic crystals},
  author = {Widdowson, Daniel and Mosca, Marco M and Pulido, Angeles and Kurlin, Vitaliy and Cooper, Andrew I},
  journal = {MATCH Communications in Mathematical and in Computer Chemistry},
  doi = {10.46793/match.87-3.529W},
  volume = {87},
  number = {3},
  pages = {529-559},
  year = {2022}
}

Pointwise distance distributions of periodic point sets. arXiv preprint arXiv:2108.04798 (2021). https://arxiv.org/abs/2108.04798.

@misc{arXiv:2108.04798,
  author = {Widdowson, Daniel and Kurlin, Vitaliy},
  title = {Pointwise distance distributions of periodic point sets},
  year = {2021},
  eprint = {arXiv:2108.04798},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

average-minimum-distance-1.3.1.tar.gz (33.1 kB view details)

Uploaded Source

Built Distribution

average_minimum_distance-1.3.1-py3-none-any.whl (32.6 kB view details)

Uploaded Python 3

File details

Details for the file average-minimum-distance-1.3.1.tar.gz.

File metadata

File hashes

Hashes for average-minimum-distance-1.3.1.tar.gz
Algorithm Hash digest
SHA256 0baa87bef11f033a8da5651d686bf7ad1d196d317895879f26e63a1fcc865ebd
MD5 501ec61c3c8c85092a72f234e20378ac
BLAKE2b-256 2da9fb8770bff0aec56da344d8e955d1b548b44c40d5d45314b4b3e05083e3ff

See more details on using hashes here.

File details

Details for the file average_minimum_distance-1.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for average_minimum_distance-1.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b8256359900aea1b482d09742af4d22cd3e04e8c2273dff6118d0c718ffb5611
MD5 5c653080b9231bec610e67c06dbf59d9
BLAKE2b-256 6842977395539736fad152c1c4fcea54da805a326163fc8c398887ceb39bb519

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page