DECIMER 1.0: Deep Learning for Chemical Image Recognition using Transformers python package
Project description
DECIMER 1.0: Deep Learning for Chemical Image Recognition using Transformers
Abstract
The DECIMER 1.0 [8] (Deep lEarning for Chemical ImagE Recognition) project [1] was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution.
The original implementation of DECIMER[1] using GPU takes a longer training time when we use a bigger dataset of more than 1 million images. To overcome these longer training times, many implement the training script to work on multiple GPUs. However, we tried to step up and implemented our code to use Google's Machine Learning hardware TPU(Tensor Processing Unit) [2]. You can learn more about the hardware here.
Method and model changes
- The DECIMER now uses EfficientNet-B3 [3],[4] for Image feature extraction and a transformer model [5] for predicting the SMILES.
- The SMILES [6] are encoded to SELFIES [7] during training and predictions
Changes in the training method
- We converted our datasets into TFRecord Files, A binary file system the TPUs can read in a much faster way. Also, we can use these files to train on GPUs. Using the TFRecord helps us train the model fast by overcoming the bottleneck of reading multiple files from the hard disks.
- We moved our data to Google Cloud Buckets. An efficient storage solution provided by google cloud environment where we can access these files from any google cloud VMs easily and in a much faster way. (To get the highest speed, the cloud storage and the VM should be in the same region)
- We adopted the TensorFlow data pipeline to load all TFRecord files to the TPUs from Google Cloud Buckets.
- We modified the main training code to work on TPUs using TPU strategy introduced in Tensorflow 2.0.
Documentation
- Currently, we are working on improving the documentation
Datasets
The datasets are available in SMILES and SELFIES format. To generate the images, please refer to the code below. Download the datasets from Zenodo:
$ java -cp cdk-2.3.jar:. Smilesdepictor filtered_SMILES.txt
The image augmentations can be generated using the python imgaug package.
Usage:
How to re-train the models
1. Generate the image data and SMILES data using the provided Java files. Input files should be in SMILES format.
# Filter only the compounds that fit DECIMER Ruleset.
$ java -cp cdk-2.3.jar:. Pubchemfilter Input_SMILES.txt
# Generate images and save them into folders.
$ java -cp cdk-2.3.jar:. Smilesdepictor filtered_SMILES.txt
2. Generate SELFIES and split them.
$ python3 Smiles2SELFIES.py Generated_SMILES.txt
# Use sed command on linux to split the SELFIES into tokens using the square brackets.
$ sed -i 's/\]\[/\] \[/g' Generated_SELFIES.txt
3. Create TFRecords.
# Use the Create_tokenizer.py to create tokens and the file paths for image files. The input will be the Generated_SELFIES.txt file.
# This generates multiple files with tokenized SELFIES and Image paths. Also, this generates the final tokenizer.pkl and max_length.pkl, which can be used later for training.
# Use the Create_TFrecord_From_Vectors.py to generate TF records.
$ python3 Create_TFrecord_From_Vectors.py 1
4. Move the TFRecords to Google CLOUD Storage
$ gsutil -m cp -r path/to/tfrecords/ path/to/cloud/storage
5. Train on Google Cloud TPUs.
Create a VM and a TPU node in the exact location as your google cloud storage bucket and modify the TFRecord path, tokenizer.pkl and max_length.pkl paths.
Change the TPU node name.
Once the TPU is ready on your Virtual machine console, execute: python3 TPU_Trainer_Image2Smiles_transformer.py
How to use DECIMER?
We suggest using DECIMER inside a Conda environment, which makes the dependencies to install easily.
- Conda can be downloaded as part of the Anaconda or the Miniconda platforms (Python 3.7). We recommend installing miniconda3. Using Linux, you can get it with:
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
Instructions
$ sudo apt update
$ sudo apt install default-jdk # In case if you do not have Java already installed
Python Package Installation
Install the latest code from GitHub with:
$ pip install git+https://github.com/Kohulan/DECIMER-Image_Transformer.git
Install in development mode with:
$ git clone https://github.com/Kohulan/DECIMER-Image_Transformer.git decimer
$ cd decimer/
$ pip install -e.
- Where
-e
means "editable" mode.
Install from PyPi
$ pip install decimer
How to use inside your own python script
from decimer import DECIMER
model_name = “Isomeric"
img_path = “caffeine.png”
caffeine_smiles = decimer.predict_SMILES(img_path,model_name)
print(caffeine_smiles)
Install tensorflow==2.3.0 if you do not have an Nvidia GPU (On Mac OS)
CLI Usage
The Python package automatically installs the decimer
command-line tool.
$ decimer --help # Use for help
- When you run the program for the first time, the models will get automatically downloaded(Note: total size is ~ 1GB). Also, you can manually download the models from here e.g.:
$ decimer --model Canonical --image Sample_Images/caffeine.png # Predict SMILES for a single image.
$ decimer --model Isomeric --dir Sample_Images # Predict SMILES for all the images inside a folder.
DECIMER automatically selects the Canonical model, but you can choose one of the following models
Available Models:
- Canonical: Model trained on images depicted using canonical SMILES
- Isomeric: Model trained on images depicted using isomeric SMILES, which includes stereochemical information + ions
- Augmented: Model trained on images depicted using isomeric SMILES with augmentations
License:
- This project is licensed under the MIT License - see the LICENSE file for details
Citation
- Rajan, K., Zielesny, A. & Steinbeck, C. DECIMER 1.0: deep learning for chemical image recognition using transformers. J Cheminform 13, 61 (2021). https://doi.org/10.1186/s13321-021-00538-8
References
- Rajan, K., Zielesny, A. & Steinbeck, C. DECIMER: towards deep learning for chemical image recognition. J Cheminform 12, 65 (2020). https://doi.org/10.1186/s13321-020-00469-w
- Norrie T, Patil N, Yoon DH, Kurian G, Li S, Laudon J, Young C, Jouppi N, Patterson D (2021) The Design Process for Google's Training Chips: TPUv2 and TPUv3. IEEE Micro 41:56–63
- Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference on Machine Learning. PMLR, pp 6105–6114
- Xie Q, Luong M-T, Hovy E, Le QV (2020) Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 10687–10698
- Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention Is All You Need. arXiv [cs.CL]
- Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36
- Krenn M, Häse F, Nigam A, Friederich P, Aspuru-Guzik A (2020) Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Mach Learn: Sci Technol 1:045024
- Rajan, Kohulan; Zielesny, Achim; Steinbeck, Christoph (2021): DECIMER 1.0: Deep Learning for Chemical Image Recognition using Transformers. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.14479287.v1
Acknowledgement
- We thank Charles Tapley Hoyt for his valuable advice and help in improving the DECIMER repository.
- We are grateful for the company @Google making free computing time on their TensorFlow Research Cloud infrastructure available to us.
Author: Kohulan
Project Website: DECIMER
Research Group
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file decimer-1.0.3.tar.gz
.
File metadata
- Download URL: decimer-1.0.3.tar.gz
- Upload date:
- Size: 548.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ad345041fe92b74b31c1174c2d4ea976f3c42127d0a14e29507547ff6bc22c2 |
|
MD5 | 09ded96540bd4e8ff656070be911459f |
|
BLAKE2b-256 | 11903264a31974059dbc98c7b7d8c86196500ffdbd0b5f878820223409112087 |
File details
Details for the file decimer-1.0.3-py3-none-any.whl
.
File metadata
- Download URL: decimer-1.0.3-py3-none-any.whl
- Upload date:
- Size: 554.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0ac204358ea6261d2688f264c51370fa8da9f5f70a45d21c30789557fb2a99bd |
|
MD5 | 991725bfe7935d9ad0f3f35f837e57d6 |
|
BLAKE2b-256 | 2f557d6dc83759b50ffe2b7e0eb0b6c14869807ed1d7f6fd3ad8f0b5692b66e0 |