Wyckoff Transformer is a machine learning model for generating crystal structures which are symmetric by design.
Project description
Wyckoff Transformer: Generation of Symmetric Crystals [ICML 2025]
Installation (PyPI)
WyFormer is published on PyPI. Be mindful of your PyTorch situation, and install:
pip install wyckoff-transformer
Inference
The pre-trained models are published on HuggingFace. To use a HuggingFace model run:
wyformer-generate <output-file.json.gz> --hf-model <model-name>
Using a local model directory (must contain best_model_params.pt, config.yaml, and wyckoff_processor.json):
wyformer-generate <output-file.json.gz> --model-path runs/<run-id>
See this repository for a demo of standalone inference using the PyPI package as a library.
WyFormer Generated Datasets
If you just need the generated datasets for benchmarking, they are available at Figshare, including both original and DFT: WyFormer; WyFormer, DiffCSP(++), SymmCD, MiAD, WyCryst, CrystalFormer.
Abstract
Crystal symmetry plays a fundamental role in determining its physical, chemical, and electronic properties such as electrical and thermal conductivity, optical and polarization behavior, and mechanical strength. Almost all known crystalline materials have internal symmetry. However, this is often inadequately addressed by existing generative models, making the consistent generation of stable and symmetrically valid crystal structures a significant challenge. We introduce WyFormer, a generative model that directly tackles this by formally conditioning on space group symmetry. It achieves this by using Wyckoff positions as the basis for an elegant, compressed, and discrete structure representation. To model the distribution, we develop a permutation-invariant autoregressive model based on the Transformer encoder and an absence of positional encoding. Extensive experimentation demonstrates WyFormer's compelling combination of attributes: it achieves best-in-class symmetry-conditioned generation, incorporates a physics-motivated inductive bias, produces structures with competitive stability, predicts material properties with competitive accuracy even without atomic coordinates, and exhibits unparalleled inference speed.
Local development & training
Installation
- Clone the repository
- Run
uv venv --python 3.12 - Install the dependencies, including torch. There are several options:
- Manually install torch with your local flavour, e.g.,
uv pip install torch --index-url https://download.pytorch.org/whl/cu130, then runuv pip install -e - Configure
uv.tomlwith your desired indices, seeuv.toml.localanduv.toml.cpu
wandblibrary is used extensively and must be installed. Logging can be disabled viaWANDB_MODE=disabled. Otherwise, log into Wandb. Internally, we useWANDB_ENTITY=symmetry-advantage.
Running a pilot model
To verify that the installation is working, run a pilot model. Next token prediction:
python scripts/cache_a_dataset.py mp_20
python scripts/tokenise_a_dataset.py mp_20 yamls/tokenisers/mp_20_sg_multiplicity.yaml --new-tokenizer
python scripts/train.py yamls/models/NextToken/v6/base_sg.yaml mp_20 cuda --pilot
This will train a model, and save the results in the runs folder. The files are:
best_model_params.pt- the model weights chosen by the validation lossconfig.yaml- the configuration used for trainingwyckoff_processor.json- tokenizers and preprocessing metadata (token engineers)generated_wp_no_calibration.json.gz- Wyckoff representation of the generated structures (if configured to evaluate generation)generated_wp_temperature_calibration.json.gz- Wyckoff representation of the generated structures, with the temperature calibration applied (if configured to evaluate generation)
Training Data Preprocessing
The available datasets correspond to the folders in data and cdvae/data. Dataset idetifiers are the folder names, they are used throught the project. Note that some of the folders are symlinks.
Available datasets (in GitHub): alex_mp_20, mp_20, mp_20_biternary (binary and ternary structures from MP-20), mpts_52, carbon_24, perov_5. It is also possible to download and use matbench_discovery_mp_2022 notebook and matbench_discovery_mp_trj_full notebook.
For any data to be used for training, we need to do two preprocessing steps.
Compute and cache symmetry information
python scripts/cache_a_dataset.py <dataset-name>
This will create a pickled representaiton of the dataset in cache/<dataset-name>/data.pkl.gz. The script supports setting symmetry tolerance and this is not done automatically, the datasets which include tolerance in their name were obtained by manually using the command-line option.
Tokenization
The tokenization script serves two purposes: it produces the mapping from the real data to token ids, and saves the resulting tensors. To produce a new tokenizer:
python scripts/tokenise_a_dataset.py <dataset-name> <path-to-tokenizer-yaml> --new-tokenizer
Tokenizer configs are stored in yamls/tokenisers. The processor is saved to cache/<dataset-names>/tokenisers/**.json, preserving the folder structure of the config.
Alternatively, you can use a cached tokeniser. This is important when a model that was trained on one dataset is applied to a different dataset.
python scripts/tokenise_a_dataset.py <dataset-name> <path-to-tokenizer-yaml> --tokenizer-path cache/<dataset-names>/tokenisers/<tokenizer-name>.json
Training
python scripts/train.py <path-to-model-yaml> <dataset-name> <device>
The model weights are saved to runs/<run-id>, and to WanDB, along with the processor metadata. See here for the list of configs. Adding --pilot will run the model for a small number of epochs.
Preparing Representative Checkpoints
To train and prepare representative checkpoints for datasets like alex_mp_20 or mp_20, you can follow this end-to-end pipeline. Please replace <dataset-name> with your target dataset (e.g., alex_mp_20 or mp_20).
First, cache and tokenize the dataset:
python scripts/cache_a_dataset.py <dataset-name>
python scripts/tokenise_a_dataset.py <dataset-name> yamls/tokenisers/<dataset-name>_sg_multiplicity.yaml --new-tokenizer
Before training, ensure that your model configuration file points to the correct tokenizer. For example, in yamls/models/NextToken/v6/base_sg_schedule_free.yaml, update the tokenizer name to match your dataset:
tokeniser:
name: <dataset-name>_sg_multiplicity
Then, initiate the training run:
python scripts/train.py yamls/models/NextToken/v6/base_sg_schedule_free.yaml <dataset-name> <device>
Generating structures
Wyckoff representations
Wyckoff representations are produced and stored in WanDB during model training. The wyformer-generate CLI (installed with the package) generates them from a trained model. Using a HuggingFace model:
wyformer-generate <output-file> --hf-model SymmetryAdvantage/<model-name>
Using a W&B run:
wyformer-generate <output-file> --wandb-run <wandb-id> --use-cached-tensors
Using a local model directory (must contain best_model_params.pt, config.yaml, and wyckoff_processor.json):
wyformer-generate <output-file> --model-path runs/<run-id>
Note that the code does not automatically download Wandb artifacts, you need to do it manually, and place them in the runs folder when restoring via run ID.
To constrain generation to specific elements:
wyformer-generate <output-file> --hf-model SymmetryAdvantage/<model-name> --required-elements Li-S --allowed-elements Li-S-P-O
To override the space group distribution from a cached dataset:
wyformer-generate <output-file> --hf-model SymmetryAdvantage/<model-name> --sg-dist mp_20
3D Structures
There are two ways to generate 3D structures from Wyckoff representations: DiffCSP++ and CHGNet. They later can be relaxed with CHGNet and/or DFT.
DiffCSP++
Wyckoffs can be relaxed with modified DiffCSP++ code
CrySPR + MACE
CrySPR scheme using pyxtal and a MACE ML force field is integrated directly into the package. Install the optional extra first:
pip install "wyckoff-transformer[relax]"
Then run from the command line:
wyformer-cryspr WyckoffTransformer_mp_20.json \
--model https://github.com/ACEsuit/mace-foundations/releases/download/mace_mp_0/2023-12-10-mace-128-L0_energy_epoch-249.model \
--output-dir results/ --start 0 --end 1000
# model_name defaults to the model file stem
head results/2023-12-10-mace-128-L0_energy_epoch-249_results.csv
model,id,formula,energy,energy_per_atom
...
2023-12-10-mace-128-L0_energy_epoch-249,35,H6O8Si2,-97.98,...
URL-based models are downloaded once and cached in ~/.cache/wyckoff_transformer/mace_models/. A local path is accepted too: --model /path/to/model.model.
Output layout is identical to the CHGNet variant below. Key options:
--n-trials N— number of random PyXtal trials per structure (default 6)--fmax F— force convergence criterion in eV/Å (default 0.01)--model-name NAME— label for the results CSV (default: model file stem)--device auto|cpu|cuda— PyTorch device selection (default: auto)
CHGNet relaxation
The structures from all models can be optionally relaxed with CHGNet.
$ cp scripts/cryspr_chgnet.py mp_20/WyckoffLLM-naive/DiffCSP++/parsed_materials_10000_pyxtal.json_structures.json.gz /your/working/dir/
$ cd /your/working/dir/
$ gzip -d parsed_materials_10000_pyxtal.json_structures.json.gz
$ python ./cryspr_chgnet.py 0 -1 ./parsed_materials_10000_pyxtal.json_structures.json wylm-dcpp
$ head wylm-dcpp_id_formula_energy.csv
model,id,formula,energy
wylm-dcpp,0,Y4Ho4Ir8,-128.26178
wylm-dcpp,1,La4Cu4Si8,-89.13443
wylm-dcpp,2,Rb3Br3,-20.12497
wylm-dcpp,3,Gd1Ni2,-26.83080
wylm-dcpp,4,Pb3O18,-96.37995
wylm-dcpp,5,Tl2Au2S4Br4,-41.40260
wylm-dcpp,6,Ru1As2In3Ce3,-49.81073
wylm-dcpp,7,Pd4As4Ni4Se4,-73.34074
wylm-dcpp,8,Na4Lu4F16,-149.18192
- Output files are in the same format as above (CrySPR + CHGNet).
${reduced_formula}_${full_formula}_cell+pos.cifis the CHGNet relaxed structure.
DFT relaxation
We followed the Materials Project protocol, atomate2.vasp.flows.mp.MPGGADoubleRelaxStaticMaker. There isn't much to add, as the rest of the details of running DFT, unfortunately, depend on the HPC setup, and VASP is not open source. Here is the code to run at ASPIRE2.
Generated Data Analysis
Storage
Public Figshare
Most analyzed datasets in a uniform format are available at Figshare.
Private Dropbox
The raw files are stored in a private Dropbox. To pull:
rclone copy "NUS_Dropbox:/Nikita Kazeev/Wyckoff Transformer data/generated.tar.gz" . --progress
tar -xvf generated.tar.gz
To push:
tar --use-compress-program=pigz -cvf generated.tar.gz generated
rclone copy generated.tar.gz "NUS_Dropbox:/Nikita Kazeev/Wyckoff Transformer data/" --progress
Tar is used to handle the large number of small files, and pigz is used to speed up the compression. The Dropbox folder is private, if you are a collaborator, please contact for access.
Preprocessing
In order to be analyzed the data must be preprocessed and cached. To preprocess all generated datasets in generated/datasets.yaml:
uv run python scripts/cache_generated_datasets.py
It supports filtering by dataset and transformations, e. g.:
uv run python scripts/cache_generated_datasets.py --dataset mp_20 --transformations DiffCSP++ DFT
Completing this step will enable loading the data with evaluation.generated_dataset.GeneratedDataset.from_cache
Metric computation
The ICML 2025 results were computed by the notebooks in ICML_eval. They include, but not limited to the following metrics:
- S.U.N. - the fraction of stable, unique, and novel structures.
- S.S.U.N. - the fraction of symmetric, stable, unique, and novel structures.
- Space Group $\chi^2$ - the $\chi^2$ statistic of the space group distribution between the generated and the test set.
- P1 - the fraction of generated structures that lack internal symmetries, i. e. belong to space group P1.
- Property similarity and naive validity from Xie et al.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wyckoff_transformer-1.0.6.tar.gz.
File metadata
- Download URL: wyckoff_transformer-1.0.6.tar.gz
- Upload date:
- Size: 257.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d32e0d5023f78a2bb3f347d2a2315aa84f7baff1156c93c69ae733150c49472
|
|
| MD5 |
d01a4f2291028156df2fb01b6578df0d
|
|
| BLAKE2b-256 |
cbb727d99e8bbca8602ed455b0241e7002e77e5e3eb6195b2cfa7b8aa39ae575
|
Provenance
The following attestation bundles were made for wyckoff_transformer-1.0.6.tar.gz:
Publisher:
python-publish.yml on SymmetryAdvantage/WyckoffTransformer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wyckoff_transformer-1.0.6.tar.gz -
Subject digest:
1d32e0d5023f78a2bb3f347d2a2315aa84f7baff1156c93c69ae733150c49472 - Sigstore transparency entry: 1310929961
- Sigstore integration time:
-
Permalink:
SymmetryAdvantage/WyckoffTransformer@356f0855279094b9e5347fc408c61b95005f2b6c -
Branch / Tag:
refs/tags/v1.0.6 - Owner: https://github.com/SymmetryAdvantage
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@356f0855279094b9e5347fc408c61b95005f2b6c -
Trigger Event:
release
-
Statement type:
File details
Details for the file wyckoff_transformer-1.0.6-py3-none-any.whl.
File metadata
- Download URL: wyckoff_transformer-1.0.6-py3-none-any.whl
- Upload date:
- Size: 303.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb4cf90647af78dabe689ef953a4e67871965d9ca888aea6636cc583a476f804
|
|
| MD5 |
f75da5be0802be310eb4c9500ce50bf7
|
|
| BLAKE2b-256 |
b512d78bc3ccde5c7273bb01c2629f7ced8b85a57e2108cd6df5815d13baaf12
|
Provenance
The following attestation bundles were made for wyckoff_transformer-1.0.6-py3-none-any.whl:
Publisher:
python-publish.yml on SymmetryAdvantage/WyckoffTransformer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wyckoff_transformer-1.0.6-py3-none-any.whl -
Subject digest:
eb4cf90647af78dabe689ef953a4e67871965d9ca888aea6636cc583a476f804 - Sigstore transparency entry: 1310930067
- Sigstore integration time:
-
Permalink:
SymmetryAdvantage/WyckoffTransformer@356f0855279094b9e5347fc408c61b95005f2b6c -
Branch / Tag:
refs/tags/v1.0.6 - Owner: https://github.com/SymmetryAdvantage
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@356f0855279094b9e5347fc408c61b95005f2b6c -
Trigger Event:
release
-
Statement type: