unimol_tools is a Python package for property prediction with Uni-Mol in molecule, materials and protein.
Project description
Uni-Mol Tools
Unimol_tools is a easy-use wrappers for property prediction,representation and downstreams with Uni-Mol.
Uni-Mol tools for various prediction and downstreams.
📖 Documentation: unimol-tools.readthedocs.io
Install
- pytorch is required, please install pytorch according to your environment. if you are using cuda, please install pytorch with cuda. More details can be found at https://pytorch.org/get-started/locally/
Option 1: Installing from PyPi (Recommended, for stable version)
pip install unimol_tools --upgrade
We recommend installing huggingface_hub so that the required unimol models can be automatically downloaded at runtime! It can be install by
pip install huggingface_hub
huggingface_hub allows you to easily download and manage models from the Hugging Face Hub, which is key for using Uni-Mol models.
Option 2: Installing from source (for latest version)
## Clone repository
git clone https://github.com/deepmodeling/unimol_tools.git
cd unimol_tools
## Dependencies installation
pip install -r requirements.txt
## Install
python setup.py install
Models in Huggingface
The UniMol pretrained models can be found at dptech/Uni-Mol-Models.
If pretrained_model_path or pretrained_dict_path are left as None the
toolkit will automatically download the corresponding files from this
Hugging Face repository at runtime.
If the download is slow, you can use a mirror, such as:
export HF_ENDPOINT=https://hf-mirror.com
By default unimol_tools first tries the official Hugging Face endpoint. If that fails and HF_ENDPOINT is not set, it automatically retries using https://hf-mirror.com. Set HF_ENDPOINT yourself if you want to explicitly choose a mirror or the official site.
Modify the default directory for weights
Setting the UNIMOL_WEIGHT_DIR environment variable specifies the directory for pre-trained weights if the weights have been downloaded from another source.
export UNIMOL_WEIGHT_DIR=/path/to/your/weights/dir/
News
- 2025-09-22: Lightweight pre-training tools are now available in Unimol_tools!
- 2025-05-26: Unimol_tools is now independent from the Uni-Mol repository!
- 2025-03-28: Unimol_tools now support Distributed Data Parallel (DDP)!
- 2024-11-22: Unimol V2 has been added to Unimol_tools!
- 2024-07-23: User experience improvements: Add
UNIMOL_WEIGHT_DIR. - 2024-06-25: unimol_tools has been publish to pypi! Huggingface has been used to manage the pretrain models.
- 2024-06-20: unimol_tools v0.1.0 released, we remove the dependency of Uni-Core. And we will publish to pypi soon.
- 2024-03-20: unimol_tools documents is available at https://unimol-tools.readthedocs.io/en/latest/
Examples
Molecule property prediction
from unimol_tools import MolTrain, MolPredict
clf = MolTrain(
task='classification',
data_type='molecule',
epochs=10,
batch_size=16,
metrics='auc',
# pretrained weights are downloaded automatically when left as ``None``
# pretrained_model_path='/path/to/checkpoint.ckpt',
# pretrained_dict_path='/path/to/dict.txt',
)
clf.fit(data = train_data)
# currently support data with smiles based csv/txt file, and sdf file with mol,
# and custom dict of {'atoms':[['C','C'],['C','H','O']], 'coordinates':[coordinates_1,coordinates_2]}
# The dict format can refer to the following format, or be obtained from sdf,
# which can also be directly input into the model.
train_sdf = PandasTools.LoadSDF('exp/unimol_conformers_train.sdf')
train_dict = {
'atoms': [list(atom.GetSymbol() for atom in mol.GetAtoms()) for mol in train_sdf['ROMol']],
# atoms[0]: ['C', 'C', 'O', 'C', 'O', 'C', ...]
'coordinates': [mol.GetConformers()[0].GetPositions() for mol in train_sdf['ROMol']],
# coordinates[0]: array([[ 6.6462, -1.8268, 1.9275],
# [ 6.1552, -1.9367, 0.4873],
# [ 5.1832, -0.8757, 0.3007],
# [ 5.4651, -0.0272, -0.7266],
# [ 4.8586, -0.0844, -1.7917],
# [ 6.5362, 0.9767, -0.3742],
# ...,])
'TARGET': train_sdf['TARGET'].tolist()
# TARGET: [0, 1, 0, 0, 1, 0, ...]
}
# clf.fit(data = train_sdf)
# clf.fit(data = train_dict)
clf = MolPredict(load_model='../exp')
res = clf.predict(data = test_data)
Molecule representation
import numpy as np
from unimol_tools import UniMolRepr
# single SMILES UniMol representation. If no paths are provided the
# pretrained model and dictionary are fetched from Hugging Face.
clf = UniMolRepr(
data_type='molecule',
remove_hs=False,
# pretrained_model_path='/path/to/checkpoint.ckpt',
# pretrained_dict_path='/path/to/dict.txt',
)
smiles = 'c1ccc(cc1)C2=NCC(=O)Nc3c2cc(cc3)[N+](=O)[O]'
smiles_list = [smiles]
unimol_repr = clf.get_repr(smiles_list, return_atomic_reprs=True)
# CLS token repr
print(np.array(unimol_repr['cls_repr']).shape)
# atomic level repr, align with rdkit mol.GetAtoms()
print(np.array(unimol_repr['atomic_reprs']).shape)
Command-line utilities
Hydra-powered entry points make training, prediction, and representation
available from the command line. Key-value pairs override options from the
YAML files in unimol_tools/config.
Training
python -m unimol_tools.cli.run_train \
train_path=train.csv \
task=regression \
save_path=./exp \
smiles_col=smiles \
target_cols=[target1] \
epochs=10 \
learning_rate=1e-4 \
batch_size=16 \
kfold=5
Prediction
python -m unimol_tools.cli.run_predict load_model=./exp data_path=test.csv
Representation
python -m unimol_tools.cli.run_repr data_path=test.csv smiles_col=smiles
Molecule pretraining
unimol_tools provides a command-line utility for pretraining Uni-Mol models on
your own dataset. The script uses
Hydra so configuration values can be overridden at the
command line. Two common invocation examples are shown below: one for LMDB data
and one for a CSV of SMILES strings.
LMDB dataset
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export HYDRA_FULL_ERROR=1
export OMP_NUM_THREADS=1
torchrun --standalone --nproc_per_node=NUM_GPUS \
-m unimol_tools.cli.run_pretrain \
dataset.train_path=train.lmdb \
dataset.valid_path=valid.lmdb \
dataset.data_type=lmdb \
dataset.dict_path=dict.txt \
training.total_steps=1000000 \
training.batch_size=16 \
training.update_freq=1
dataset.dict_path is optional. The effective batch size is
n_gpu * training.batch_size * training.update_freq.
CSV dataset
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export HYDRA_FULL_ERROR=1
export OMP_NUM_THREADS=1
torchrun --standalone --nproc_per_node=NUM_GPUS \
-m unimol_tools.cli.run_pretrain \
dataset.train_path=train.csv \
dataset.valid_path=valid.csv \
dataset.data_type=csv \
dataset.smiles_column=smiles \
training.total_steps=1000000 \
training.batch_size=16 \
training.update_freq=1
For multi-node training, specify additional arguments, for example:
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export HYDRA_FULL_ERROR=1
export OMP_NUM_THREADS=1
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=0 \
--master_addr=<master-ip> --master_port=<port> \
-m unimol_tools.cli.run_pretrain ...
All available options are defined in
pretrain_config.py, and checkpoints
along with the dictionary are saved to the run directory. When GPU memory is
limited, increase training.update_freq to accumulate gradients while keeping
the effective batch size n_gpu * training.batch_size * training.update_freq.
Credits
We thanks all contributors from the community for their suggestions, bug reports and chemistry advices. Currently unimol-tools is maintained by Yaning Cui, Xiaohong Ji, Zhifeng Gao from DP Technology and AI for Science Insitution, Beijing.
Please kindly cite our papers if you use this tools.
@article{gao2023uni,
title={Uni-qsar: an auto-ml tool for molecular property prediction},
author={Gao, Zhifeng and Ji, Xiaohong and Zhao, Guojiang and Wang, Hongshuai and Zheng, Hang and Ke, Guolin and Zhang, Linfeng},
journal={arXiv preprint arXiv:2304.12239},
year={2023}
}
License
This project is licensed under the terms of the MIT license. See LICENSE for additional details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unimol_tools-0.1.5.tar.gz.
File metadata
- Download URL: unimol_tools-0.1.5.tar.gz
- Upload date:
- Size: 88.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58b995881797ad57824df9363dc9ea91ab396b65492b8898b6faf3418980ac77
|
|
| MD5 |
2110b9350fd39880cb42dabaef1a54e9
|
|
| BLAKE2b-256 |
18cc2664de775c033ea96c3423c789f45275c68f89feaff02ae16a3a5f2c9227
|
File details
Details for the file unimol_tools-0.1.5-py3-none-any.whl.
File metadata
- Download URL: unimol_tools-0.1.5-py3-none-any.whl
- Upload date:
- Size: 106.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
311a80fdcfeebd842c3270b32775acaf4f8989bcc6f20ab2b0e8acaa73bcbdca
|
|
| MD5 |
8194b6e9fd4adf00f0f304920d19b365
|
|
| BLAKE2b-256 |
1cd19ad33541fb5d8f6051af52ee8b4760bcea9356341a5a71ee36e2cabb43ef
|