Skip to main content

Language models for astrochemistry

Project description

PyPI Status Python Version License

Read the documentation at https://astrochem_embedding.readthedocs.io/ Tests Codecov

pre-commit Black

Features

The goal of this project is to provide off the shelf language models that work for studies in astrochemistry; the needs for general molecule discovery/chemistry are different from astrochemistry, such as the emphasis on transient (e.g. open-shell) molecules and isotopologues.

To support these aspects, we provide here light-weight language models (currently just a GRU seq2seq model) based off of SELFIES syntax and PyTorch. Elements of this project are designed to strike a balance between research agility and use for production, and a lot of emphasis is placed on reproducibility using PyTorch Lightning and a general user interface that doesn’t force the user to know how to develop neural networks.

The current highlight of this package is the VICGAE, or variance-invariance-covariance regularized GRU autoencoder (I guess probably VICGRUAE is more accurate?). I intend to write this up in a more detailed form in the near future, but the basic premise is this: a pair of GRUs form a seq2seq model, whose task is to complete SELFIES strings based off of randomly masked tokens within the molecule. To improve chemical representation learning, the VIC regularization uses self-supervision to ensure the token embeddings are chemically descriptive: we encourage variance (e.g. [CH2] is different from [OH]), invariance (e.g. isotopic substitution should give more or less the same molecule), and covariance (i.e. minimizing information sharing between embedding dimensions). While the GRU does the actual SELFIES reconstruction, the VIC regularization is done at the token embedding level.

This has been tested on a few simple comparisons with cosine similarity, comparing isotopic substitution, element substitution (i.e. C/Si/Ge), and functional group replacement; things seem to work well for these simple cases.

Requirements

This package requires Python 3.8+, as it uses some decorators only available after 3.7.

Installation

The simplest way to get astrochem_embedding is through PyPI:

$ pip install astrochem_embedding

If you’re interested in development, want to train your own model, or make sure you can take advantage of GPU acceleration, I recommend using conda for your environment specification:

$ conda create -n astrochem_embedding python=3.8
$ conda activate astrochem_embedding
$ pip install poetry
$ poetry install
$ conda install -c pytorch torch torchvision cudatoolkit=11.3

Usage

The quickest way to get started is by loading a pre-trained model:

>>> from astrochem_embedding import VICGAE
>>> import torch
>>> model = VICGAE.from_pretrained()
>>> model.embed_smiles("c1ccccc1")

will return a torch.Tensor. For now the general interface doesn’t support batching SMILES just yet, and so to operate on many SMILES strings would simply require looping:

>>> smiles = ["c1ccccc1", "[C]#N", "[13c]1ccccc1"]
>>> embeddings = torch.stack([model.embed_smiles(s) for s in smiles])
# optionally convert back to NumPy arrays
>>> numpy_embeddings = embeddings.numpy()

Project Structure

The project filestructure is laid out as such:

├── CITATION.cff
├── codecov.yml
├── CODE_OF_CONDUCT.rst
├── CONTRIBUTING.rst
├── data
│   ├── external
│   ├── interim
│   ├── processed
│   └── raw
├── docs
│   ├── codeofconduct.rst
│   ├── conf.py
│   ├── contributing.rst
│   ├── index.rst
│   ├── license.rst
│   ├── reference.rst
│   ├── requirements.txt
│   └── usage.rst
├── environment.yml
├── models
├── notebooks
│   ├── dev
│   ├── exploratory
│   └── reports
├── noxfile.py
├── poetry.lock
├── pyproject.toml
├── README.rst
├── scripts
│   └── train.py
└── src
   └── astrochem_embedding
      ├── __init__.py
      ├── layers
      │   ├── __init__.py
      │   ├── layers.py
      │   └── tests
      │       ├── __init__.py
      │       └── test_layers.py
      ├── __main__.py
      ├── models
      │   ├── __init__.py
      │   ├── models.py
      │   └── tests
      │       ├── __init__.py
      │       └── test_models.py
      ├── pipeline
      │   ├── data.py
      │   ├── __init__.py
      │   ├── tests
      │   │   ├── __init__.py
      │   │   ├── test_data.py
      │   │   └── test_transforms.py
      │   └── transforms.py
      └── utils.py

A brief summary of what each folder is designed for:

  1. data contains copies of the data used for this project. It is recommended to form a pipeline whereby the raw data is preprocessed, serialized to interim, and when ready for analysis, placed into processed.

  2. models contains serialized weights intended for distribution, and/or testing.

  3. notebooks contains three subfolders: dev is for notebook based development, exploratory for data exploration, and reports for making figures and visualizations for writeup.

  4. scripts contains files that meant for headless routines, generally those with long compute times such as model training and data cleaning.

  5. src/astrochem_embedding contains the common code base for this project.

Code development

All of the code used for this project should be contained in src/astrochem_embedding, at least in terms of the high-level functionality (i.e. not scripts), and is intended to be a standalone Python package.

The package is structured to match the abstractions for deep learning, specifically PyTorch, PyTorch Lightning, and Weights and Biases, by separating parts of data structures and processing and model/layer development.

Some concise tenets for development

  • Write unit tests as you go.

  • Commit changes, and commit frequently. Write semantic git commits!

  • Formatting is done with black; don’t fuss about it 😃

  • For new Python dependencies, use poetry add <package>.

  • For new environment dependencies, use conda env export -f environment.yml.

Notes on best practices, particularly regarding CI/CD, can be found in the extensive documentation for the Hypermodern Python Cookiecutter repository.

License

Distributed under the terms of the MIT license, Language models for astrochemistry is free and open source software.

Issues

If you encounter any problems, please file an issue along with a detailed description.

Credits

This project was generated from @laserkelvin’s PyTorch Project Cookiecutter, a fork of @cjolowicz’s Hypermodern Python Cookiecutter template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

astrochem_embedding-0.2.0.tar.gz (2.8 MB view details)

Uploaded Source

Built Distribution

astrochem_embedding-0.2.0-py3-none-any.whl (487.8 kB view details)

Uploaded Python 3

File details

Details for the file astrochem_embedding-0.2.0.tar.gz.

File metadata

  • Download URL: astrochem_embedding-0.2.0.tar.gz
  • Upload date:
  • Size: 2.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for astrochem_embedding-0.2.0.tar.gz
Algorithm Hash digest
SHA256 91a5ff9fe94ee060553900c05f8841e87765225f3871437498aea578d11d4203
MD5 51fc3447cdacc4498e797d3ac6880cb6
BLAKE2b-256 c873bf1387d2f94f408a699ff38619368b513ef3f11889b67efc8872d4a438dd

See more details on using hashes here.

File details

Details for the file astrochem_embedding-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for astrochem_embedding-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b9691a93c57877218be8b53e1c2d48d554c9a31e48c3765686f1fdae380d5d8c
MD5 22e6e0ef247a106fc35003190ffcb034
BLAKE2b-256 75b0ef0d58dfe13e47cb920eb04a088b4e5ba2f369b843c378030f16c0a5002e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page