Skip to main content

Atom-HiFi: atomistic high-fidelity representative-set selection framework

Project description

Atom-HiFi

Atomistic High-Fidelity representative-set selection framework.

Applications include:

  • MLIP training-set curation and active-learning loops
  • Chemical motif identification and distribution analysis
  • Diversity-aware structure sampling from large databases

What is Atom-HiFi?

Atom-HiFi finds the smallest subset S of a structure library that achieves high Fidelity — meaning S covers the library's atomic-environment diversity efficiently, without redundancy. Agnostic to the downstream task.


Key concepts

Fidelity = L / R

Fidelity is the single optimisation objective. Like a HiFi audio system, it has two channels — L (Left) and R (Right) — whose ratio is maximised. High Fidelity means the selection is both faithful to the library distribution (high Likeness) and compact (low Redundancy).

L — Likeness measures how faithfully S reproduces the library's atomic-environment distribution. Each atom is assigned to a microstate (Voronoi cell in whitened descriptor space from k-means); L is the Shannon entropy ratio over those populations:

L = H(sub) / H(lib)      H = -Σ p_i ln p_i

Shannon entropy H measures distributional diversity — how evenly the population is spread across microstates. L = 1: S perfectly reproduces the library's diversity. L < 1: some environments are under-represented; e.g. L = 0.95 means S retains 95% of the library's distributional diversity.

R — Redundancy measures how many atoms are packed per occupied microstate, relative to the full library:

R = (N_sub / k_occ^sub) / (N_lib / k_occ^lib)

R = 1: same atoms-per-microstate density as the full library (no compression). R < 1: redundancy has been removed; e.g. R = 0.4 means 60% of redundant atoms are eliminated while the occupied microstate coverage is preserved.

The scan sweeps a bandwidth c (scaling factor on ε_noise) and finds c* that maximises Fidelity subject to L ≥ L_TOL (default 0.90). The optimal c* sits at the elbow of the L/R curve — the point where further reducing redundancy begins to cost meaningful distributional diversity.

ED-SOAP descriptor

Embedded Double SOAP — two concatenated SOAP power-spectrum vectors per atom: one short-range (bonding geometry) and one long-range (coordination shell), normalised by a system-specific lengthscale. No GPU required. The full parameter set is exposed in hifi_workflow_tutorial.py under the EDS_* variables.


Installation

Step 1 — install decaf (Descriptor Embedding and Clustering for Atomistic-environment Framework — the clustering backend; not on PyPI):

pip install git+https://gitlab.mpcdf.mpg.de/klai/decaf.git

Step 2 — install Atom-HiFi:

pip install atom-hifi

Python ≥ 3.9 required.


Quick start

pip install atom-hifi installs the atom-hifi command. Write a starter config, edit it, and run:

atom-hifi init                 # writes a commented config.yaml
# edit config.yaml (at minimum: paths.lib_path, paths.focus_elements)
atom-hifi run config.yaml 2>&1 | tee run.out

The generated config.yaml documents every setting inline. The minimum to edit:

paths:
  lib_path: train_structs.xyz   # ASE-readable structure library
  focus_elements: [Ni, O]       # elements to cluster on
  output_dir: fr_results
descriptor:
  kind: eds                     # 'eds' or 'ace'

Python API / custom descriptors

The CLI supports the eds and ace descriptors. A custom descriptor is a Python callable and is supplied via the Python API. hifi_workflow_tutorial.py is the annotated example (included in the repo; pip-only users can fetch it):

curl -O https://gitlab.mpcdf.mpg.de/yhsong/atom-hifi/-/raw/main/hifi_workflow_tutorial.py

Edit its top-level variables (including DESCRIPTOR_FN) and run python hifi_workflow_tutorial.py, or call the runner directly:

from atom_hifi.runner import run
run({'paths': {'lib_path': 'train_structs.xyz', 'focus_elements': ['Ni', 'O']},
     'descriptor': {'kind': 'custom', 'custom_fn': my_descriptor_fn}})

Output files

File Description
representatives.xyz Selected representative structures
fine_scan.out L, R, F (=L/R), |S|, atoms for every fine-scan point
hifi_final.png Coarse + fine Fidelity (F = L/R) scan diagnostic plot
learning_curve.png AL loop convergence (only with RUN_LOOP=True)
eps_noise_raw.npz Cached per-element ε_noise values
desc_lib.pkl Cached per-structure descriptors
surroundings_{el}.xyz Per-group coordination spheres (EXTRACT_SURROUNDINGS=True)

Configuration reference

All settings live in config.yaml (run atom-hifi init to generate a fully commented template). Keys are grouped:

Group Keys
paths lib_path, patient_path, focus_elements, output_dir
descriptor kind, eds.{lengthscale, s_cut, s_nmax, s_lmax, l_cut, l_nmax, l_lmax, periodic, r_cut}, ace.{model_path, device, r_cut}
selection method (mu_tiebreak recommended)
scan l_tol, n_coarse, n_fine, n_jobs, c_factor_range
eps_noise per_species, temperature (K; sets σ_thermal ∝ √T/√mass for ε_noise calibration)
loop / grid / nsga2 run + per-stage tuning
refit delta, grid_point
output delta_pick, extract_surroundings

Unknown keys are rejected. The same configuration can be passed as a nested dict to atom_hifi.runner.run(...); hifi_workflow_tutorial.py is the annotated Python-API equivalent.


Advanced usage

Active-learning loop (RUN_LOOP=True)

Iteratively expands the training pool by sampling batches from the full library. Inner iterations use a coarse scan only; one final fine scan runs at the end. Set INITIAL_SAMPLE and LOOP_SKIP_FINE_SCAN to control the initial pool size and inner-scan resolution.

Per-element ND grid scan (RUN_GRID_SCAN=True)

Sweeps independent c-factors per focus element on a Cartesian grid, reusing cached per-element DECAF fits from the 1-D scan. Cost is O(n^N_el) cover evaluations instead of O(n^N_el × N_el) DECAF fits — tractable for N_el ≤ 3–4. Results in scan_grid.csv and scan_grid_report.png.

NSGA-II Pareto optimisation (RUN_NSGA2=True)

Stochastic multi-objective optimisation of per-element c-factors via NSGA-II (requires pymoo). Use when the grid is too large (N_el ≥ 4) or you want a continuous Pareto front. Results in pareto_front.csv and three diagnostic PNGs.

Representative environment extraction (EXTRACT_SURROUNDINGS=True)

Exports the local coordination sphere around the centroid-closest atom of each DECAF group. Two modes: 'sphere' (non-periodic ASE Atoms cluster) and 'full_structure' (original cell with center/neighbour/rest tags). Output: surroundings_{el}.xyz per focus element.


Citation

If you use Atom-HiFi in your research, please cite:

[paper in preparation — citation will be added upon publication]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atom_hifi-0.6.0.tar.gz (91.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atom_hifi-0.6.0-py3-none-any.whl (79.4 kB view details)

Uploaded Python 3

File details

Details for the file atom_hifi-0.6.0.tar.gz.

File metadata

  • Download URL: atom_hifi-0.6.0.tar.gz
  • Upload date:
  • Size: 91.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for atom_hifi-0.6.0.tar.gz
Algorithm Hash digest
SHA256 a00f57944fac4c533718f83b4130ad3215c78d8421c97b359590026d50c9f48c
MD5 3577a707d6909fbd687301f492fe2c2e
BLAKE2b-256 49bae88320f4c242190f44b1b330ee17d749aeb92f61effded6ea2bc643c0ffe

See more details on using hashes here.

File details

Details for the file atom_hifi-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: atom_hifi-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 79.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for atom_hifi-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c59f6225a4f628ae63d762fbb3ee3db49c009da83cf0a59f370b74bcce25b4da
MD5 ec8d5fc42ecc5b31906f40331693215a
BLAKE2b-256 d2968f8eab0d6bffee53030c9b7b2dabac0f38010715c11c4d81c8f527ce5039

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page