Skip to main content

DECIMER 1.0: Deep Learning for Chemical Image Recognition using Transformers python package

Project description

DECIMER 1.0: Deep Learning for Chemical Image Recognition using Transformers

License Maintenance GitHub issues GitHub contributors DOI Documentation Status GitHub release PyPI version fury.io

Abstract

The DECIMER 1.0 [8] (Deep lEarning for Chemical ImagE Recognition) project [1] was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution.

The original implementation of DECIMER[1] using GPU takes a longer training time when we use a bigger dataset of more than 1 million images. To overcome these longer training times, many implement the training script to work on multiple GPUs. However, we tried to step up and implemented our code to use Google's Machine Learning hardware TPU(Tensor Processing Unit) [2]. You can learn more about the hardware here.

GitHub Logo

Method and model changes

  • The DECIMER now uses EfficientNet-B3 [3],[4] for Image feature extraction and a transformer model [5] for predicting the SMILES.
  • The SMILES [6] are encoded to SELFIES [7] during training and predictions

Changes in the training method

  • We converted our datasets into TFRecord Files, A binary file system the TPUs can read in a much faster way. Also, we can use these files to train on GPUs. Using the TFRecord helps us train the model fast by overcoming the bottleneck of reading multiple files from the hard disks.
  • We moved our data to Google Cloud Buckets. An efficient storage solution provided by google cloud environment where we can access these files from any google cloud VMs easily and in a much faster way. (To get the highest speed, the cloud storage and the VM should be in the same region)
  • We adopted the TensorFlow data pipeline to load all TFRecord files to the TPUs from Google Cloud Buckets.
  • We modified the main training code to work on TPUs using TPU strategy introduced in Tensorflow 2.0.

Documentation

Datasets

The datasets are available in SMILES and SELFIES format. To generate the images, please refer to the code below. Download the datasets from Zenodo: DOI

$ java -cp cdk-2.3.jar:. Smilesdepictor filtered_SMILES.txt

The image augmentations can be generated using the python imgaug package.

Usage:

How to re-train the models

1. Generate the image data and SMILES data using the provided Java files. Input files should be in SMILES format.

# Filter only the compounds that fit DECIMER Ruleset.
$ java -cp cdk-2.3.jar:. Pubchemfilter Input_SMILES.txt

# Generate images and save them into folders.
$ java -cp cdk-2.3.jar:. Smilesdepictor filtered_SMILES.txt

2. Generate SELFIES and split them.

$ python3 Smiles2SELFIES.py Generated_SMILES.txt

# Use sed command on linux to split the SELFIES into tokens using the square brackets.
$ sed -i 's/\]\[/\] \[/g' Generated_SELFIES.txt

3. Create TFRecords.

# Use the Create_tokenizer.py to create tokens and the file paths for image files. The input will be the Generated_SELFIES.txt file.
# This generates multiple files with tokenized SELFIES and Image paths. Also, this generates the final tokenizer.pkl and max_length.pkl, which can be used later for training.

# Use the Create_TFrecord_From_Vectors.py to generate TF records. 
$ python3 Create_TFrecord_From_Vectors.py 1 

4. Move the TFRecords to Google CLOUD Storage

$ gsutil -m cp -r path/to/tfrecords/ path/to/cloud/storage

5. Train on Google Cloud TPUs.

Create a VM and a TPU node in the exact location as your google cloud storage bucket and modify the TFRecord path, tokenizer.pkl and max_length.pkl paths.

Change the TPU node name.

Once the TPU is ready on your Virtual machine console, execute: python3 TPU_Trainer_Image2Smiles_transformer.py

How to use DECIMER?

We suggest using DECIMER inside a Conda environment, which makes the dependencies to install easily.

  • Conda can be downloaded as part of the Anaconda or the Miniconda platforms (Python 3.7). We recommend installing miniconda3. Using Linux, you can get it with:
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

Instructions

$ sudo apt update
$ sudo apt install default-jdk # In case if you do not have Java already installed

Python Package Installation

Install the latest code from GitHub with:

$ pip install git+https://github.com/Kohulan/DECIMER-Image_Transformer.git

Install in development mode with:

$ git clone https://github.com/Kohulan/DECIMER-Image_Transformer.git decimer
$ cd decimer/
$ pip install -e.
  • Where -e means "editable" mode.

Install from PyPi

$ pip install decimer

How to use inside your own python script

from decimer import DECIMER
model_name = Isomeric"
img_path = caffeine.png
caffeine_smiles = decimer.predict_SMILES(img_path,model_name)
print(caffeine_smiles)

Install tensorflow==2.3.0 if you do not have an Nvidia GPU (On Mac OS)

CLI Usage

The Python package automatically installs the decimer command-line tool.

$ decimer --help  # Use for help
  • When you run the program for the first time, the models will get automatically downloaded(Note: total size is ~ 1GB). Also, you can manually download the models from here e.g.:
$ decimer --model Canonical --image Sample_Images/caffeine.png       # Predict SMILES for a single image.
$ decimer --model Isomeric --dir Sample_Images         # Predict SMILES for all the images inside a folder.

DECIMER automatically selects the Canonical model, but you can choose one of the following models

Available Models:

  • Canonical: Model trained on images depicted using canonical SMILES
  • Isomeric: Model trained on images depicted using isomeric SMILES, which includes stereochemical information + ions
  • Augmented: Model trained on images depicted using isomeric SMILES with augmentations

License:

  • This project is licensed under the MIT License - see the LICENSE file for details

Citation

References

  1. Rajan, K., Zielesny, A. & Steinbeck, C. DECIMER: towards deep learning for chemical image recognition. J Cheminform 12, 65 (2020). https://doi.org/10.1186/s13321-020-00469-w
  2. Norrie T, Patil N, Yoon DH, Kurian G, Li S, Laudon J, Young C, Jouppi N, Patterson D (2021) The Design Process for Google's Training Chips: TPUv2 and TPUv3. IEEE Micro 41:56–63
  3. Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference on Machine Learning. PMLR, pp 6105–6114
  4. Xie Q, Luong M-T, Hovy E, Le QV (2020) Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 10687–10698
  5. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention Is All You Need. arXiv [cs.CL]
  6. Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36
  7. Krenn M, Häse F, Nigam A, Friederich P, Aspuru-Guzik A (2020) Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Mach Learn: Sci Technol 1:045024
  8. Rajan, Kohulan; Zielesny, Achim; Steinbeck, Christoph (2021): DECIMER 1.0: Deep Learning for Chemical Image Recognition using Transformers. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.14479287.v1

Acknowledgement

  • We thank Charles Tapley Hoyt for his valuable advice and help in improving the DECIMER repository.
  • We are grateful for the company @Google making free computing time on their TensorFlow Research Cloud infrastructure available to us.

Author: Kohulan

GitHub Logo

Project Website: DECIMER

Research Group

GitHub Logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

decimer-1.0.3.tar.gz (548.9 kB view details)

Uploaded Source

Built Distribution

decimer-1.0.3-py3-none-any.whl (554.1 kB view details)

Uploaded Python 3

File details

Details for the file decimer-1.0.3.tar.gz.

File metadata

  • Download URL: decimer-1.0.3.tar.gz
  • Upload date:
  • Size: 548.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.10

File hashes

Hashes for decimer-1.0.3.tar.gz
Algorithm Hash digest
SHA256 1ad345041fe92b74b31c1174c2d4ea976f3c42127d0a14e29507547ff6bc22c2
MD5 09ded96540bd4e8ff656070be911459f
BLAKE2b-256 11903264a31974059dbc98c7b7d8c86196500ffdbd0b5f878820223409112087

See more details on using hashes here.

File details

Details for the file decimer-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: decimer-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 554.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.10

File hashes

Hashes for decimer-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0ac204358ea6261d2688f264c51370fa8da9f5f70a45d21c30789557fb2a99bd
MD5 991725bfe7935d9ad0f3f35f837e57d6
BLAKE2b-256 2f557d6dc83759b50ffe2b7e0eb0b6c14869807ed1d7f6fd3ad8f0b5692b66e0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page