Implementation of the 'Gotta be SAFE: a new framework for molecular design' paper
Project description
:safety_vest: SAFE
Sequential Attachment-based Fragment Embedding (SAFE) is a novel molecular line notation that represents molecules as an unordered sequence of fragment blocks to improve molecule design using generative models.
Paper | Docs | 🤗 Model | 🤗 Training Dataset
Overview of SAFE
SAFE is the deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as contiguous sequence of connected fragment. SAFE strings are valid SMILES string, and thus are able to preserve the same amount of information. The intuitive representation of molecules as unordered sequence of connected fragments gretly simplify the following tasks often encoutered in molecular design:
- de novo design
- superstructure generation
- scaffold decoration
- motif extension
- linker generation
- scaffold morphing.
The construction of a SAFE strings requires definition a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrate the process of building a SAFE string. The resulting string is a valid SMILES that can be read by datamol or RDKit.
Installation
You can install safe
using pip:
pip install safe-mol
You can use conda/mamba. Ask @maclandrol for credentials to the conda forge or for a token
mamba install -c conda-forge safe-mol
Datasets and Models
Type | Name | Infos | Size | Comment |
---|---|---|---|---|
Model | datamol-io/safe-gpt | 87M params | 350M | Default model |
Dataset | datamol-io/safe-gpt | 1.1B rows | 250GB | Training dataset |
Dataset | datamol-io/safe-drugs | 26 rows | 20 kB | Benchmarking dataset |
Usage
Please refer to the documentation, which contains tutorials for getting started with safe
and detailed descriptions of the functions provided.
API
We summarize some key functions provided by the safe
package below.
Function | Description |
---|---|
safe.encode |
Translates a SMILES string into its corresponding SAFE string. |
safe.decode |
Translates a SAFE string into its corresponding SMILES string. The SAFE decoder just augment RDKit's Chem.MolFromSmiles with an optional correction argument to take care of missing hydrogens bonds. |
safe.split |
Tokenizes a SAFE string to build a generative model. |
Examples
Translation between SAFE and SMILES representations
import safe
ibuprofen = "CC(Cc1ccc(cc1)C(C(=O)O)C)C"
# SMILES -> SAFE -> SMILES translation
try:
ibuprofen_sf = safe.encode(ibuprofen) # c12ccc3cc1.C3(C)C(=O)O.CC(C)C2
ibuprofen_smi = safe.decode(ibuprofen_sf, canonical=True) # CC(C)Cc1ccc(C(C)C(=O)O)cc1
except safe.EncoderError:
pass
except safe.DecoderError:
pass
ibuprofen_tokens = list(safe.split(ibuprofen_sf))
Training a new models
A command line interface is available to train a new model, please run safe-train --help
For example:
safe-train --config <path to config> \
--model-path <path to model> \
--tokenizer <path to tokenizer> \
--dataset <path to dataset> \
--num_labels 9 \
--torch_compile True \
--optim "adamw_torch" \
--learning_rate 1e-5 \
--prop_loss_coeff 1e-3 \
--gradient_accumulation_steps 1 \
--output_dir "<path to outputdir>" \
--max_steps 5
References
If you use this repository, please cite the following related paper:
@misc{noutahi2023gotta,
title={Gotta be SAFE: A New Framework for Molecular Design},
author={Emmanuel Noutahi and Cristian Gabellini and Michael Craig and Jonathan S. C Lim and Prudencio Tossou},
year={2023},
eprint={2310.10773},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
License
Note that all data and model weights of SAFE are exclusively licensed for research purposes. The accompanying dataset is licensed under CC BY 4.0, which permits solely non-commercial usage. See DATA_LICENSE for details.
This code base is licensed under the Apache-2.0 license. See LICENSE for details.
Development lifecycle
Setup dev environment
mamba create -n safe -f env.yml
mamba activate safe
pip install --no-deps -e .
Tests
You can run tests locally with:
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file safe-mol-0.1.1.tar.gz
.
File metadata
- Download URL: safe-mol-0.1.1.tar.gz
- Upload date:
- Size: 395.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b7e3e9f282a2ceb717da5bde59143ce741737df37eb21ee23e871266d50f016 |
|
MD5 | be769288e08a0e94856bf05629491410 |
|
BLAKE2b-256 | 4e5a7bb23e86c3c433121425d8cf08567551ac35633c41519b273633e9230de1 |
File details
Details for the file safe_mol-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: safe_mol-0.1.1-py3-none-any.whl
- Upload date:
- Size: 50.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a305d7532ef718641f78f7d3aa2a4b16047e9766f8b19eb5cf0235eda8753091 |
|
MD5 | 7bbafd669cd30a1e086dab0eed4c9bbf |
|
BLAKE2b-256 | 09f083769ddb662c652fce7d2699f5e7c42798f2a2f853534afe0f6f1626aaf0 |