Skip to main content

Molformer's submodule of GT4SD.

Project description

GT4SD's submodule for the MolFormer model

GT4SD submodule for the MolFormer model. The original MolFormer's codebase can be found at https://github.com/IBM/molformer. We refer the users to the original repo for usage information and further details about the model.

Development setup & installation

The recommended way to install the gt4sd-molformer is to create a dedicated conda environment, this will ensure all requirements are satisfied:

git clone https://github.com/GT4SD/gt4sd-molformer.git
cd gt4sd-molformer/
conda env create -f conda.yml
conda activate gt4sd-molformer

Then run:

pip install .

If you would like to contribute to the package, you can install the package in editable mode:

pip install -e ".[dev]" 

Note: In order to be able to train or finetune a model, Apex Optimizers must be compiled with CUDA and C++ extensions. This can be done using the provided install_apex.sh script. Before executing the script, the path to the CUDA 11 installation should have been saved in the CUDA_HOME env variable.

export CUDA_HOME='Cuda 11 install'
bash install_apex.sh

References

If you use MolFormer in your projects, please consider citing the following:

@article{10.1038/s42256-022-00580-7, 
year = {2022}, 
title = {{Large-scale chemical language representations capture molecular structure and properties}}, 
author = {Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel}, 
journal = {Nature Machine Intelligence}, 
doi = {10.1038/s42256-022-00580-7}, 
abstract = {{Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties. Large language models have recently emerged with extraordinary capabilities, and these methods can be applied to model other kinds of sequence, such as string representations of molecules. Ross and colleagues have created a transformer-based model, trained on a large dataset of molecules, which provides good results on property prediction tasks.}}, 
pages = {1256--1264}, 
number = {12}, 
volume = {4}
}

@misc{https://doi.org/10.48550/arxiv.2106.09553,
  doi = {10.48550/ARXIV.2106.09553},
  url = {https://arxiv.org/abs/2106.09553},
  author = {Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel},
  keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL), Biomolecules (q-bio.BM), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Biological sciences, FOS: Biological sciences},
  title = {Large-Scale Chemical Language Representations Capture Molecular Structure and Properties},
  publisher = {arXiv},
  year = {2021},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

If you use gt4sd in your projects, please consider citing the following:

@article{manica2022gt4sd,
  title={GT4SD: Generative Toolkit for Scientific Discovery},
  author={Manica, Matteo and Cadow, Joris and Christofidellis, Dimitrios and Dave, Ashish and Born, Jannis and Clarke, Dean and Teukam, Yves Gaetan Nana and Hoffman, Samuel C and Buchan, Matthew and Chenthamarakshan, Vijil and others},
  journal={arXiv preprint arXiv:2207.03928},
  year={2022}
}

License

The gt4sd codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gt4sd-molformer-0.1.3.tar.gz (71.8 kB view details)

Uploaded Source

Built Distribution

gt4sd_molformer-0.1.3-py3-none-any.whl (90.4 kB view details)

Uploaded Python 3

File details

Details for the file gt4sd-molformer-0.1.3.tar.gz.

File metadata

  • Download URL: gt4sd-molformer-0.1.3.tar.gz
  • Upload date:
  • Size: 71.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.2

File hashes

Hashes for gt4sd-molformer-0.1.3.tar.gz
Algorithm Hash digest
SHA256 3cac4e97cb499bc020c3338f4f7224705ee988e98ce00ebf70bd2f49a7bceb29
MD5 1e4719502eed46aa51330b5ab6dbb4e5
BLAKE2b-256 d19fd67aaa8c23a88142d8e890d8eff55a630c2e50964f16714552742e537b8c

See more details on using hashes here.

Provenance

File details

Details for the file gt4sd_molformer-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for gt4sd_molformer-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ab93005e324926bbfd128d7280bc12af11f8197dfdc2f2fa687527fa6afa646d
MD5 8205a1aed3762c74c8194fa77ae04c3d
BLAKE2b-256 ddc1f882da792d456d222bc3eb0183d94221944740235f2e69139789d76277f0

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page