Machine learning tools for computational chemistry and condensed matter physics

These details have not been verified by PyPI

Project links

Project description

cmlkit 🐫🧰

PyPI - Python Version

Publications: repbench: Langer, Gößmann, Rupp (2020)

Plugins: cscribe 🐫🖋️ | mortimer 🎩⏰ | skrrt 🚗💨

cmlkit is an extensible python package providing clean and concise infrastructure to specify, tune, and evaluate machine learning models for computational chemistry and condensed matter physics. Intended as a common foundation for more specialised systems, not a monolithic user-facing tool, it wants to help you build your own tools! ✨

If you use this code in any scientific work, please mention it in the publication, cite the paper and let me know. Thanks! 🐫

What exactly is `cmlkit`?

💡 A tutorial introduction to cmlkit courtesy of the NOMAD Analytics Toolkit 💡

Sidenote: If you've come across this from outside the "ML for materials and chemistry" world, this will unfortunately be of limited use for you! However, if you're interested in ML infrastructure in general, please take a look at engine and tune, which are not specific to this domain and might be of interest.

Features

Reasonably clean, composable, modern codebase with little magic ✨

Representations

cmlkit provides a unified interface for:

Many-Body Tensor Representation by Huo, Rupp (2017) (qmmlpack and dscribe implementation)
Smooth Overlap of Atomic Positions representaton by Bartók, Kondor, Csányi (2013) (quippy‡ and dscribe implementations)
Symmetry Functions representation by Behler (2011) (RuNNer and dscribe implementation), with a semi-automatic parametrisation scheme taken from Gastegger et al. (2018).

‡ The quippy interface was written for an older version that didn't support python3.

Regression methods

Kernel Ridge Regression as implemented in qmmlpack (supporting both global and local/atomic representations)

Hyper-parameter tuning

Robust multi-core support (i.e. it can automatically kill timed out external code, even if it ignores SIGTERM)
No mongodb required
Extensions to the hyperopt priors (uniform log grids)
Resumable/recoverable runs backed by a readable, atomically written history of the optimisation (backed by son)
Search spaces can be defined entirely in text, i.e. they're easily writeable, portable and serialisable
Possibility to implement multi-step optimisation (experimental at the moment)
Extensible with custom loss functions or training loops

Various

Automated loading of datasets by name
Seamless conversion of properties into per-atom or per-system quantities. Models can do this automatically!
Plugin system! ☢️ Isolate one-off nightmares! ☢️
Canonical, stable hashes of models and datasets!
Automatically train models and compute losses!

But what... is it?

At its core, cmlkit defines a unified dict-based format to specify model components, which can be straightforwardly read and written as yaml. Model components are implemented as pure-ish functions, which is conceptually satisfying and opens the door to easy pipelining and caching. Using this format, cmlkit provides interfaces to many representations and a fast kernel ridge regression implementation.

Here is an example for a SOAP+KRR model:

model:
  per: cell
  regression:
    krr:               # regression method: kernel ridge regression
      kernel:
        kernel_atomic: # soap is a local representation, so we use the appropriate kernel
          kernelf:
            gaussian:  # gaussian kernel
              ls: 80   # ... with length scale 80
      nl: 1.0e-07      # regularisation parameter
  representation:
    ds_soap:           # SOAP representation (dscribe implementation via plugin)
      cutoff: 3	
      elems: [8, 13, 31, 49]
      l_max: 8
      n_max: 2
      sigma: 0.5

Having a canonical model format allows cmlkit to provide a quite pleasant interface to hyperopt. The same mechanism also enables a simple plugin system, making cmlkit easily exensible, so you can isolate one-off task-specific code into separate projects without any problems, while making use of a solid, if opionated, foundation.

For a gentle, detailed tour please check out the tutorial.

Caveats 😬

Okay then, what are the rough parts?

cmlkit is very inconvenient for interactive and non-automated use: Models cannot be saved and caching is not enabled yet, so all computations (representation, kernel matrices, etc.) must be re-run from scratch upon restart. This is not a problem during HP optimisation, as there the point is to try different models, but it is annoying for exploring a single model in detail. Fixing this is an active consideration, though! After all, the code is written with caching in mind.
cmlkit is and will remain "scientific research software", i.e. it is prone to somewhat haphazard development practices and periods of hibernation. I'll do my best to avoid breaking changes and abandonement, but you know how it is!
cmlkit is currently in an "alpha" state. While it's pretty stable and well-tested for some specific usecases (like writing a large-scale benchmarking paper), it's not tested for more everyday use. There's also some internal loose ends that need to be tied up.
cmlkit is not particularly user friendly at the moment, and expects its users to be python developers. See below for notes on documentation! 😀

Installation and friends

cmlkit is available via pip:

pip install cmlkit

You can also clone this repository! I'd suggest having a look into the codebase in any case, as there is currently no external documentation.

If you want to do any "real" work with cmlkit, you'll need to install qmmlpack on the development branch. It's fairly straightforward!

In order to compute representations with dscribe, you should install the cscribe plugin:

pip install cscribe

You need to also export CML_PLUGINS=cscribe.

To setup the quippy and RuNNer interface please consult the readmes in cmlkit/representation/soap and cmlkit/representation/sf.

For details on environment variables and such things, please consult the readme in the cmlkit folder.

"Frequently" Asked Questions

Where is the documentation?

At the moment, I don't think it's feasible for me to maintain separate written docs, and I believe that purely auto-generated docs are basically a worse version of just looking at the formatted source on Github or in your text editor. So I highly encourage to take a look there!

Most submodules in cmlkit have their own README.md documenting what's going on in them, and all "outside facing" classes have extensive docstrings. I hope that's sufficient! Please feel free to file an issue if you have any questions.

I don't work in computational chemistry/condensed matter physics. Should I care?

The short answer is regrettably probably no.

However, I think the architecture of this library is quite neat, so maybe it can provide some marginally interesting reading. The tune component is very general and provides, in my opinion, a delightfully clean interface to hyperopt. The engine is also rather general and provides a nice way to serialise specific kinds of python objects to yaml.

Why should I use this?

Well, maybe if you:

need to use any of the libraries mentioned above, especially if you want to use them in the same project with the same infrastructure,
are tired of plain hyperopt,
would like to be able to save your model parameters in a readable format,
think it's neat?

My goal with this is to make it slightly easier for you to build up your own infrastructure for studying models and applications in our field! If you're just starting out, just take a look around!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.0a26 pre-release

Mar 25, 2022

2.0.0a25 pre-release

Mar 24, 2022

2.0.0a24 pre-release

Jan 27, 2022

2.0.0a23 pre-release

Apr 24, 2021

2.0.0a22 pre-release

Oct 7, 2020

2.0.0a21 pre-release

Jun 15, 2020

2.0.0a20 pre-release

Mar 30, 2020

2.0.0a19 pre-release

Mar 30, 2020

2.0.0a18 pre-release

Dec 5, 2019

2.0.0a17 pre-release

Sep 10, 2019

2.0.0a16 pre-release

Aug 8, 2019

2.0.0a15 pre-release

Aug 7, 2019

2.0.0a14 pre-release

Jul 16, 2019

2.0.0a13 pre-release

Jul 14, 2019

2.0.0a12 pre-release

Jun 18, 2019

2.0.0a11 pre-release

Jun 12, 2019

2.0.0a10 pre-release

Jun 12, 2019

2.0.0a9 pre-release

Jun 12, 2019

2.0.0a8 pre-release

Jun 10, 2019

2.0.0a7 pre-release

Jun 10, 2019

2.0.0a6 pre-release

Jun 9, 2019

2.0.0a5 pre-release

Jun 4, 2019

2.0.0a4 pre-release

Jun 3, 2019

2.0.0a3 pre-release

Jun 3, 2019

2.0.0a2 pre-release

Jun 3, 2019

2.0.0a1 pre-release

Jun 3, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cmlkit-2.0.0a26.tar.gz (83.7 kB view details)

Uploaded Mar 25, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cmlkit-2.0.0a26-py3-none-any.whl (108.1 kB view details)

Uploaded Mar 25, 2022 Python 3

File details

Details for the file cmlkit-2.0.0a26.tar.gz.

File metadata

Download URL: cmlkit-2.0.0a26.tar.gz
Upload date: Mar 25, 2022
Size: 83.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.13 CPython/3.7.2 Darwin/18.7.0

File hashes

Hashes for cmlkit-2.0.0a26.tar.gz
Algorithm	Hash digest
SHA256	`e83ede322743b995684b709baef4f350c128f69737af08c6b4fbf65530e965ea`
MD5	`9e78afc1e88ae6d0f6947155c8e215cc`
BLAKE2b-256	`ea8062ec14536f7954f7890534061b4524952b35316e698fa931c5cfacf900d0`

See more details on using hashes here.

File details

Details for the file cmlkit-2.0.0a26-py3-none-any.whl.

File metadata

Download URL: cmlkit-2.0.0a26-py3-none-any.whl
Upload date: Mar 25, 2022
Size: 108.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.13 CPython/3.7.2 Darwin/18.7.0

File hashes

Hashes for cmlkit-2.0.0a26-py3-none-any.whl
Algorithm	Hash digest
SHA256	`144762b19f52fbf4aa9d5184ffe3609d10e748281a299a44a15dab45e66f20d8`
MD5	`8a4042285a0e868375322f4f670a53d4`
BLAKE2b-256	`1dbcba56b480dc677c9d92957410f089a7cc7bfe1dde5ef6ce7b2ba84d427743`

See more details on using hashes here.

cmlkit 2.0.0a26

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cmlkit 🐫🧰

What exactly is `cmlkit`?

Features

Representations

Regression methods

Hyper-parameter tuning

Various

But what... is it?

Caveats 😬

Installation and friends

"Frequently" Asked Questions

Where is the documentation?

I don't work in computational chemistry/condensed matter physics. Should I care?

Why should I use this?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

cmlkit 2.0.0a26

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cmlkit 🐫🧰

What exactly is cmlkit?

Features

Representations

Regression methods

Hyper-parameter tuning

Various

But what... is it?

Caveats 😬

Installation and friends

"Frequently" Asked Questions

Where is the documentation?

I don't work in computational chemistry/condensed matter physics. Should I care?

Why should I use this?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

What exactly is `cmlkit`?