Skip to main content

Atom-HiFi: atomistic high-fidelity representative-set selection framework

Project description

Atom-HiFi

Atomistic High-Fidelity representative-set selection framework.

Applications include:

  • MLIP training-set curation and active-learning loops
  • Chemical motif identification and distribution analysis
  • Diversity-aware structure sampling from large databases

What is Atom-HiFi?

Atom-HiFi finds the smallest subset S of a structure library that achieves high Fidelity — meaning S covers the library's atomic-environment diversity efficiently, without redundancy. Agnostic to the downstream task.


Key concepts

Fidelity = L / R

Fidelity is the single optimisation objective. Like a HiFi audio system, it has two channels — L (Left) and R (Right) — whose ratio is maximised. High Fidelity means the selection is both faithful to the library distribution (high Likeness) and compact (low Redundancy).

L — Likeness measures how faithfully S reproduces the library's atomic-environment distribution. Each atom is assigned to a microstate (Voronoi cell in whitened descriptor space from k-means); L is the Shannon entropy ratio over those populations:

L = H(sub) / H(lib)      H = -Σ p_i ln p_i

Shannon entropy H measures distributional diversity — how evenly the population is spread across microstates. L = 1: S perfectly reproduces the library's diversity. L < 1: some environments are under-represented; e.g. L = 0.95 means S retains 95% of the library's distributional diversity.

R — Redundancy measures how many atoms are packed per occupied microstate, relative to the full library:

R = (N_sub / k_occ^sub) / (N_lib / k_occ^lib)

R = 1: same atoms-per-microstate density as the full library (no compression). R < 1: redundancy has been removed; e.g. R = 0.4 means 60% of redundant atoms are eliminated while the occupied microstate coverage is preserved.

The scan sweeps a bandwidth c (scaling factor on ε_noise) and finds c* that maximises Fidelity subject to L ≥ L_TOL (default 0.90). The optimal c* sits at the elbow of the L/R curve — the point where further reducing redundancy begins to cost meaningful distributional diversity.

ED-SOAP descriptor

Embedded Double SOAP — two concatenated SOAP power-spectrum vectors per atom: one short-range (bonding geometry) and one long-range (coordination shell), normalised by a system-specific lengthscale. No GPU required. The full parameter set is exposed in hifi_workflow_tutorial.py under the EDS_* variables.


Installation

Step 1 — install decaf (Descriptor Embedding and Clustering for Atomistic-environment Framework — the clustering backend; not on PyPI):

pip install git+https://gitlab.mpcdf.mpg.de/klai/decaf.git

Step 2 — install Atom-HiFi:

pip install atom-hifi

Python ≥ 3.9 required.


Quick start

pip install atom-hifi installs the atom-hifi command. Write a starter config, edit it, and run:

atom-hifi init                 # writes a commented config.yaml
# edit config.yaml (at minimum: paths.lib_path, paths.focus_elements)
atom-hifi run config.yaml 2>&1 | tee run.out

The generated config.yaml documents every setting inline. The minimum to edit:

paths:
  lib_path: train_structs.xyz   # ASE-readable structure library
  focus_elements: [Ni, O]       # elements to cluster on
  output_dir: fr_results
descriptor:
  kind: eds                     # 'eds' or 'ace'

Python API / custom descriptors

The CLI supports the eds and ace descriptors. A custom descriptor is a Python callable and is supplied via the Python API. hifi_workflow_tutorial.py is the annotated example (included in the repo; pip-only users can fetch it):

curl -O https://gitlab.mpcdf.mpg.de/yhsong/atom-hifi/-/raw/main/hifi_workflow_tutorial.py

Edit its top-level variables (including DESCRIPTOR_FN) and run python hifi_workflow_tutorial.py, or call the runner directly:

from atom_hifi.runner import run
run({'paths': {'lib_path': 'train_structs.xyz', 'focus_elements': ['Ni', 'O']},
     'descriptor': {'kind': 'custom', 'custom_fn': my_descriptor_fn}})

Output files

File Description
representatives.xyz Selected representative structures
fine_scan.out L, R, F (=L/R), |S|, atoms for every fine-scan point
hifi_final.png Coarse + fine Fidelity (F = L/R) scan diagnostic plot
learning_curve.png AL loop convergence (only with RUN_LOOP=True)
eps_noise_raw.npz Cached per-element ε_noise values
desc_lib.pkl Cached per-structure descriptors
surroundings_{el}.xyz Per-group coordination spheres (EXTRACT_SURROUNDINGS=True)

Configuration reference

All settings live in config.yaml (run atom-hifi init to generate a fully commented template). Keys are grouped:

Group Keys
paths lib_path, patient_path, focus_elements, output_dir
descriptor kind, eds.{lengthscale, s_cut, s_nmax, s_lmax, l_cut, l_nmax, l_lmax, periodic, r_cut}, ace.{model_path, device, r_cut}
selection method (mu_tiebreak recommended)
scan l_tol, n_coarse, n_fine, n_jobs, c_factor_range
eps_noise per_species, temperature (K; sets σ_thermal ∝ √T/√mass for ε_noise calibration)
loop / grid / nsga2 run + per-stage tuning
refit delta, grid_point
output delta_pick, extract_surroundings

Unknown keys are rejected. The same configuration can be passed as a nested dict to atom_hifi.runner.run(...); hifi_workflow_tutorial.py is the annotated Python-API equivalent.


Advanced usage

Active-learning loop (RUN_LOOP=True)

Iteratively expands the training pool by sampling batches from the full library. Inner iterations use a coarse scan only; one final fine scan runs at the end. Set INITIAL_SAMPLE and LOOP_SKIP_FINE_SCAN to control the initial pool size and inner-scan resolution.

Per-element ND grid scan (RUN_GRID_SCAN=True)

Sweeps independent c-factors per focus element on a Cartesian grid, reusing cached per-element DECAF fits from the 1-D scan. Cost is O(n^N_el) cover evaluations instead of O(n^N_el × N_el) DECAF fits — tractable for N_el ≤ 3–4. Results in scan_grid.csv and scan_grid_report.png.

NSGA-II Pareto optimisation (RUN_NSGA2=True)

Stochastic multi-objective optimisation of per-element c-factors via NSGA-II (requires pymoo). Use when the grid is too large (N_el ≥ 4) or you want a continuous Pareto front. Results in pareto_front.csv and three diagnostic PNGs.

Representative environment extraction (EXTRACT_SURROUNDINGS=True)

Exports the local coordination sphere around the centroid-closest atom of each DECAF group. Two modes: 'sphere' (non-periodic ASE Atoms cluster) and 'full_structure' (original cell with center/neighbour/rest tags). Output: surroundings_{el}.xyz per focus element.


Citation

If you use Atom-HiFi in your research, please cite:

[paper in preparation — citation will be added upon publication]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atom_hifi-0.5.1.tar.gz (86.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atom_hifi-0.5.1-py3-none-any.whl (75.3 kB view details)

Uploaded Python 3

File details

Details for the file atom_hifi-0.5.1.tar.gz.

File metadata

  • Download URL: atom_hifi-0.5.1.tar.gz
  • Upload date:
  • Size: 86.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for atom_hifi-0.5.1.tar.gz
Algorithm Hash digest
SHA256 794e94a29160663f347e4363c9ca3daa616c5d9840655fb297d1bfc8ce551148
MD5 3a12c8af6c53b28668ccaef005432bf9
BLAKE2b-256 53aae6d32f7a35a04a37b83a527af83a2d6cde282480726217b2198083bb90f1

See more details on using hashes here.

File details

Details for the file atom_hifi-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: atom_hifi-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 75.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for atom_hifi-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bcd886b91dce224b324993c03ea547b457a441098afe1ddbc625e8df530c480f
MD5 92d6d1e50030ff2065ff7ffa85faa437
BLAKE2b-256 168d1f6019b10033ccd6698a58560ca8063413f981450f7f8a8e312a6e34c500

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page