Atom-HiFi: atomistic high-fidelity representative-set selection framework
Project description
Atom-HiFi
Atomistic High-Fidelity representative-set selection framework.
Applications include:
- MLIP training-set curation and active-learning loops
- Chemical motif identification and distribution analysis
- Diversity-aware structure sampling from large databases
What is Atom-HiFi?
Atom-HiFi finds the smallest subset S of a structure library that achieves high Fidelity — meaning S covers the library's atomic-environment diversity efficiently, without redundancy. Agnostic to the downstream task.
Key concepts
Fidelity = L / R
Fidelity is the single optimisation objective. Like a HiFi audio system, it has two channels — L (Left) and R (Right) — whose ratio is maximised. High Fidelity means the selection is both faithful to the library distribution (high Likeness) and compact (low Redundancy).
L — Likeness measures how faithfully S reproduces the library's atomic-environment distribution. Each atom is assigned to a microstate (Voronoi cell in whitened descriptor space from k-means); L is the Shannon entropy ratio over those populations:
L = H(sub) / H(lib) H = -Σ p_i ln p_i
Shannon entropy H measures distributional diversity — how evenly the population is spread across microstates. L = 1: S perfectly reproduces the library's diversity. L < 1: some environments are under-represented; e.g. L = 0.95 means S retains 95% of the library's distributional diversity.
R — Redundancy measures how many atoms are packed per occupied microstate, relative to the full library:
R = (N_sub / k_occ^sub) / (N_lib / k_occ^lib)
R = 1: same atoms-per-microstate density as the full library (no compression). R < 1: redundancy has been removed; e.g. R = 0.4 means 60% of redundant atoms are eliminated while the occupied microstate coverage is preserved.
The scan sweeps a bandwidth c (scaling factor on ε_noise) and finds c* that maximises Fidelity subject to L ≥ L_TOL (default 0.90). The optimal c* sits at the elbow of the L/R curve — the point where further reducing redundancy begins to cost meaningful distributional diversity.
ED-SOAP descriptor
Embedded Double SOAP — two concatenated SOAP power-spectrum vectors per atom: one short-range
(bonding geometry) and one long-range (coordination shell), normalised by a
system-specific lengthscale. No GPU required. The full parameter set is
exposed in hifi_workflow_tutorial.py under the EDS_* variables.
Installation
Step 1 — install decaf (Descriptor Embedding and Clustering for
Atomistic-environment Framework — the clustering backend; not on PyPI):
pip install git+https://gitlab.mpcdf.mpg.de/klai/decaf.git
Step 2 — install Atom-HiFi:
pip install atom-hifi
Python ≥ 3.9 required.
Quick start
pip install atom-hifi installs the atom-hifi command. Write a starter config,
edit it, and run:
atom-hifi init # writes a commented config.yaml
# edit config.yaml (at minimum: paths.lib_path, paths.focus_elements)
atom-hifi run config.yaml 2>&1 | tee run.out
The generated config.yaml documents every setting inline. The minimum to edit:
paths:
lib_path: train_structs.xyz # ASE-readable structure library
focus_elements: [Ni, O] # elements to cluster on
output_dir: fr_results
descriptor:
kind: eds # 'eds' or 'ace'
Python API / custom descriptors
The CLI supports the eds and ace descriptors. A custom descriptor is a
Python callable and is supplied via the Python API. hifi_workflow_tutorial.py
is the annotated example (included in the repo; pip-only users can fetch it):
curl -O https://gitlab.mpcdf.mpg.de/yhsong/atom-hifi/-/raw/main/hifi_workflow_tutorial.py
Edit its top-level variables (including DESCRIPTOR_FN) and run
python hifi_workflow_tutorial.py, or call the runner directly:
from atom_hifi.runner import run
run({'paths': {'lib_path': 'train_structs.xyz', 'focus_elements': ['Ni', 'O']},
'descriptor': {'kind': 'custom', 'custom_fn': my_descriptor_fn}})
Output files
| File | Description |
|---|---|
representatives.xyz |
Selected representative structures |
fine_scan.out |
L, R, F (=L/R), |S|, atoms for every fine-scan point |
hifi_final.png |
Coarse + fine Fidelity (F = L/R) scan diagnostic plot |
learning_curve.png |
AL loop convergence (only with RUN_LOOP=True) |
eps_noise_raw.npz |
Cached per-element ε_noise values |
desc_lib.pkl |
Cached per-structure descriptors |
surroundings_{el}.xyz |
Per-group coordination spheres (EXTRACT_SURROUNDINGS=True) |
Configuration reference
All settings live in config.yaml (run atom-hifi init to generate a fully
commented template). Keys are grouped:
| Group | Keys |
|---|---|
| paths | lib_path, patient_path, focus_elements, output_dir |
| descriptor | kind, eds.{lengthscale, s_cut, s_nmax, s_lmax, l_cut, l_nmax, l_lmax, periodic, r_cut}, ace.{model_path, device, r_cut} |
| selection | method (mu_tiebreak recommended) |
| scan | l_tol, n_coarse, n_fine, n_jobs, c_factor_range |
| eps_noise | per_species, temperature (K; sets σ_thermal ∝ √T/√mass for ε_noise calibration) |
| loop / grid / nsga2 | run + per-stage tuning |
| refit | delta, grid_point |
| output | delta_pick, extract_surroundings |
Unknown keys are rejected. The same configuration can be passed as a nested dict
to atom_hifi.runner.run(...); hifi_workflow_tutorial.py is the annotated
Python-API equivalent.
Advanced usage
Active-learning loop (RUN_LOOP=True)
Iteratively expands the training pool by sampling batches from the full library.
Inner iterations use a coarse scan only; one final fine scan runs at the end.
Set INITIAL_SAMPLE and LOOP_SKIP_FINE_SCAN to control the initial pool size
and inner-scan resolution.
Per-element ND grid scan (RUN_GRID_SCAN=True)
Sweeps independent c-factors per focus element on a Cartesian grid, reusing
cached per-element DECAF fits from the 1-D scan. Cost is O(n^N_el) cover
evaluations instead of O(n^N_el × N_el) DECAF fits — tractable for N_el ≤ 3–4.
Results in scan_grid.csv and scan_grid_report.png.
NSGA-II Pareto optimisation (RUN_NSGA2=True)
Stochastic multi-objective optimisation of per-element c-factors via NSGA-II
(requires pymoo). Use when the grid is too large (N_el ≥ 4) or you want a
continuous Pareto front. Results in pareto_front.csv and three diagnostic
PNGs.
Representative environment extraction (EXTRACT_SURROUNDINGS=True)
Exports the local coordination sphere around the centroid-closest atom of each
DECAF group. Two modes: 'sphere' (non-periodic ASE Atoms cluster) and
'full_structure' (original cell with center/neighbour/rest tags). Output:
surroundings_{el}.xyz per focus element.
Citation
If you use Atom-HiFi in your research, please cite:
[paper in preparation — citation will be added upon publication]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file atom_hifi-0.6.0.tar.gz.
File metadata
- Download URL: atom_hifi-0.6.0.tar.gz
- Upload date:
- Size: 91.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a00f57944fac4c533718f83b4130ad3215c78d8421c97b359590026d50c9f48c
|
|
| MD5 |
3577a707d6909fbd687301f492fe2c2e
|
|
| BLAKE2b-256 |
49bae88320f4c242190f44b1b330ee17d749aeb92f61effded6ea2bc643c0ffe
|
File details
Details for the file atom_hifi-0.6.0-py3-none-any.whl.
File metadata
- Download URL: atom_hifi-0.6.0-py3-none-any.whl
- Upload date:
- Size: 79.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c59f6225a4f628ae63d762fbb3ee3db49c009da83cf0a59f370b74bcce25b4da
|
|
| MD5 |
ec8d5fc42ecc5b31906f40331693215a
|
|
| BLAKE2b-256 |
d2968f8eab0d6bffee53030c9b7b2dabac0f38010715c11c4d81c8f527ce5039
|