A collection of benchmarking problems and datasets for testing the performance of advanced optimization algorithms in the field of materials science and chemistry.

These details have not been verified by PyPI

Project links

Project description

matsci-opt-benchmarks (WIP)

A collection of benchmarking problems and datasets for testing the performance of advanced optimization algorithms in the field of materials science and chemistry for a variety of "hard" problems involving one or several of: constraints, heteroskedasticity, multiple objectives, multiple fidelities, and high-dimensionality.

There are already materials-science-specific resources related to datasets, surrogate models, and benchmarks out there:

Matbench focuses on materials property prediction using composition and/or crystal structure
Olympus focuses on small datasets generated via experimental self-driving laboratories
Foundry focuses on delivering ML-ready datasets in materials science and chemistry
Matbench-genmetrics focuses on generative modeling for crystal structure using metrics inspired by guacamol and CDVAE

In March 2021, pymatgen reorganized the code into namespace packages, which makes it easier to distribute a collection of related subpackages and modules under an umbrella project. Tangent to that, PyScaffold is a project generator for high-quality Python packages, ready to be shared on PyPI and installable via pip; coincidentally, it also supports namespace package configurations. My plan for this repository is to host pip-installable packages that allow for loading datasets, surrogate models, and benchmarks for recent manuscripts I've written. It is primarily intended as a convenience for me, with a secondary benefit of adding value to the community. I will look into hosting the datasets via Foundry and using the surrogate model API via Olympus. I will likely do logging to a MongoDB database via Atlas and later take a snapshot of the dataset for Foundry. Initially, I will probably use a basic scikit-learn model, such as RandomForestRegressor or GradientBoostingRegressor, along with cross-validated hyperparameter optimization via RandomizedSearchCV or HalvingRandomSearchCV for the surrogate model.

What will really differentiate the contribution of this repository is the modeling of non-Gaussian, heteroskedastic noise, where the noise can be a complex function of the input parameters. This is contrasted with Gaussian homoskedastic noise, where the noise for a given parameter is both Gaussian and fixed [Wikipedia].

My goal is to win a "Turing test" of sorts for the surrogate model, where the model is indistinguishable from the true, underlying objective function.

To accomplish this, I plan to:

run repeats for every set of parameters and fit separate models for quantiles of the noise distribution
Get a large enough quasi-random sampling of the search space to accurately model intricate interactions between parameters (i.e., the response surface)
Train a classification model that short-circuits the regression model: return NaN values for inaccessible regions of objective functions and return the regression model values for accessible regions

My plans for implementation include:

packing fraction of a random 3D packing of spheres as a function of the number of spheres, 6 parameters that define three separate truncated log-normal distributions, and 3 parameters that define the weight fractions [code] [paper1] [paper2] [data]
discrete intensity vs. wavelength spectra (measured experimentally via a spectrophotometer) as a function of red, green, and blue LED powers and three sensor settings: number of integration steps, integration time per step, and signal gain [code] [paper]
Two error metrics (RMSE and MAE) and two hardware performance metrics (runtime and memory) of a CrabNet regression model trained on the Matbench experimental band gap dataset as a function of 23 CrabNet hyperparameters reframed as a composition-based optimization task [code] [paper]

Quick start

pip install matsci-opt-benchmarks

Not implemented yet

from matsci_opt_benchmarks.core import MatSciOpt

mso = MatSciOpt(dataset="crabnet_hyperparameter")
# mso = MatSciOpt(dataset="particle_packing")

print(mso.features)
#

results = mso.predict(parameterization)

Generate benchmark from existing dataset

import pandas as pd
from matsci_opt_benchmarks.core import Benchmark

# load dataset
dataset_name = "dummy"
dataset_path = f"data/external/{dataset_name}.csv"
dataset = pd.read_csv(...)

# define inputs/outputs (and parameter types? if so, then Ax-like dict)
parameter_names = [...]
output_names = [...]

X = dataset[parameters]
y = dataset[outputs]

bench = Benchmark()
bench.fit(X=X, Y=y)
y_pred = bench.predict(X.head(5))
print(y_pred)
# [[...], [...], ...]

bench.save(fpath=f"models/{dataset_name}")
bench.upload(zenodo_id=zenodo_id)

# upload to HuggingFace
...

Installation

In order to set up the necessary environment:

review and uncomment what you need in environment.yml and create an environment matsci-opt-benchmarks with the help of conda:
```
conda env create -f environment.yml
```
activate the new environment with:
```
conda activate matsci-opt-benchmarks
```

NOTE: The conda environment will have matsci-opt-benchmarks installed in editable mode. Some changes, e.g. in setup.cfg, might require you to run pip install -e . again.

Optional and needed only once after git clone:

install several pre-commit git hooks with:
```
pre-commit install
# You might also want to run `pre-commit autoupdate`
```
and checkout the configuration under .pre-commit-config.yaml. The -n, --no-verify flag of git commit can be used to deactivate pre-commit hooks temporarily.
install nbstripout git hooks to remove the output cells of committed notebooks with:
```
nbstripout --install --attributes notebooks/.gitattributes
```
This is useful to avoid large diffs due to plots in your notebooks. A simple nbstripout --uninstall will revert these changes.

Then take a look into the scripts and notebooks folders.

Dependency Management & Reproducibility

Always keep your abstract (unpinned) dependencies updated in environment.yml and eventually in setup.cfg if you want to ship and install your package via pip later on.
Create concrete dependencies as environment.lock.yml for the exact reproduction of your environment with:
```
conda env export -n matsci-opt-benchmarks -f environment.lock.yml
```
For multi-OS development, consider using --no-builds during the export.
Update your current environment with respect to a new environment.lock.yml using:
```
conda env update -f environment.lock.yml --prune
```

Project Organization

├── AUTHORS.md              <- List of developers and maintainers.
├── CHANGELOG.md            <- Changelog to keep track of new features and fixes.
├── CONTRIBUTING.md         <- Guidelines for contributing to this project.
├── Dockerfile              <- Build a docker container with `docker build .`.
├── LICENSE.txt             <- License as chosen on the command-line.
├── README.md               <- The top-level README for developers.
├── configs                 <- Directory for configurations of model & application.
├── data
│   ├── external            <- Data from third party sources.
│   ├── interim             <- Intermediate data that has been transformed.
│   ├── processed           <- The final, canonical data sets for modeling.
│   └── raw                 <- The original, immutable data dump.
├── docs                    <- Directory for Sphinx documentation in rst or md.
├── environment.yml         <- The conda environment file for reproducibility.
├── models                  <- Trained and serialized models, model predictions,
│                              or model summaries.
├── notebooks               <- Jupyter notebooks. Naming convention is a number (for
│                              ordering), the creator's initials and a description,
│                              e.g. `1.0-fw-initial-data-exploration`.
├── pyproject.toml          <- Build configuration. Don't change! Use `pip install -e .`
│                              to install for development or to build `tox -e build`.
├── references              <- Data dictionaries, manuals, and all other materials.
├── reports                 <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures             <- Generated plots and figures for reports.
├── scripts                 <- Analysis and production scripts which import the
│                              actual PYTHON_PKG, e.g. train_model.
├── setup.cfg               <- Declarative configuration of your project.
├── setup.py                <- [DEPRECATED] Use `python setup.py develop` to install for
│                              development or `python setup.py bdist_wheel` to build.
├── src
│   └── particle_packing    <- Actual Python package where the main functionality goes.
│   └── crabnet_hyperparameter <- Actual Python package where the main functionality goes.
├── tests                   <- Unit tests which can be run with `pytest`.
├── .coveragerc             <- Configuration for coverage reports of unit tests.
├── .isort.cfg              <- Configuration for git hook that sorts imports.
└── .pre-commit-config.yaml <- Configuration of pre-commit git hooks.

Note

This project has been set up using PyScaffold 4.3.1 and the dsproject extension 0.7.2.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.3

Jul 8, 2023

0.2.2

Mar 3, 2023

0.2.1

Mar 3, 2023

0.2.0

Mar 1, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matsci-opt-benchmarks-0.2.3.tar.gz (42.9 MB view details)

Uploaded Jul 8, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

matsci_opt_benchmarks-0.2.3-py3-none-any.whl (533.6 kB view details)

Uploaded Jul 8, 2023 Python 3

File details

Details for the file matsci-opt-benchmarks-0.2.3.tar.gz.

File metadata

Download URL: matsci-opt-benchmarks-0.2.3.tar.gz
Upload date: Jul 8, 2023
Size: 42.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for matsci-opt-benchmarks-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`86e47078dda7fc8aef1da823718995da1d62dafe8b54c4d1d8f5ec97d0910bed`
MD5	`bd2323b008a00ccd50a3204d075886e8`
BLAKE2b-256	`b8d9d20d8185c11a9cb034876c3443ca3194002b4d27feb99e862e779257c8e3`

See more details on using hashes here.

File details

Details for the file matsci_opt_benchmarks-0.2.3-py3-none-any.whl.

File metadata

Download URL: matsci_opt_benchmarks-0.2.3-py3-none-any.whl
Upload date: Jul 8, 2023
Size: 533.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for matsci_opt_benchmarks-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1a6ea1166c0efe85d97dfc6192530fd2ba5bee61da04477fa8893dc37dcabe0b`
MD5	`39c1fc76a4340e0b0fb7636bf9b7f19d`
BLAKE2b-256	`289f384ab0dd232c4662eeb2e6bd115092b5a513ea3f8f627c1991218dfda1c3`

See more details on using hashes here.

matsci-opt-benchmarks 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

matsci-opt-benchmarks (WIP)

Quick start

Generate benchmark from existing dataset

Installation

Dependency Management & Reproducibility

Project Organization

Note

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes