Skip to main content

DECIMER Segmentation - Extraction of chemical structure depictions from scientific literature

Project description

DECIMER-Image-Segmentation

License Maintenance GitHub issues GitHub contributors tensorflow DOI GitHub release PyPI version fury.io

Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation, the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature.

The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs.

By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at https://decimer.ai, lets the user upload a pdf file and retrieve the segmented structure depictions.

GitHub Logo

Usage

  • To use DECIMER Segmentation, clone the repository to your local disk. Mask-RCNN runs on a GPU-enabled PC or simply on CPU, so please do make sure you have all the necessary drivers installed if you are using the GPU.
We recommend to use DECIMER-Segmentation inside a Conda environment to facilitate the installation of the dependencies.
  • Conda can be downloaded as part of the Anaconda or the Miniconda platforms (Python 3.0). We recommend to install miniconda3. Using Linux you can get it with:
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

How to install DECIMER-Segmentation

$ git clone https://github.com/Kohulan/DECIMER-Image-Segmentation
$ cd DECIMER-Image-Segmentation
$ conda create --name DECIMER_IMGSEG python=3.10
$ conda activate DECIMER_IMGSEG
$ conda install pip
$ python -m pip install -U pip #Upgrade pip
$ pip install .
$ conda install -c conda-forge poppler

#From Pypi
$ pip install decimer-segmentation

The Mask-RCNN Model is available at: DOI

How to use DECIMER-Segmentation

  • The repository contains a script that can be used for the segmentation of chemical structures from an image of a scanned page or from a pdf document:
$ python3 segment_structures_in_document.py file_name (the file can be an image of a scanned page or a pdf document) 
  • Segmented images are saved in the output folder (which has the name of the pdf file).

  • Alternatively, you can use integrate DECIMER Segmentation in your Python code:

from decimer_segmentation import segment_chemical_structures, segment_chemical_structures_from_file
import cv2

# Segment structures in scanned page image (np.array)
page = cv2.imread(scanned_page_file_path)
segments = segment_chemical_structures(page, expand=True)

# Segment structures from file (pdf or image)
# Windows users may need to specify the location of their poppler installation with the poppler_path argument if they want to process pdf files
segments = segment_chemical_structures_from_file(path, expand=True, poppler_path=None)

Notes for Windows users:

  • Execute DECIMER_Segmentation.py in the Anaconda Powershell Prompt

  • If you run into an error with the pdf conversion on Windows, you need to download poppler and extract the file.

  • The method segment_chemical_structures_from_file() takes a 'poppler_path' argument where the user can specify the path of their poppler installation ('PATH/TO/POPPLER/bin').

Authors

decimer.ai

Citation

Rajan, K., Brinkhaus, H.O., Sorokina, M. et al. DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature. J Cheminform 13, 20 (2021). https://doi.org/10.1186/s13321-021-00496-1

Project page

GitHub Logo

More information about our research group

GitHub Logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

decimer_segmentation-1.5.0.tar.gz (72.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

decimer_segmentation-1.5.0-py3-none-any.whl (75.1 kB view details)

Uploaded Python 3

File details

Details for the file decimer_segmentation-1.5.0.tar.gz.

File metadata

  • Download URL: decimer_segmentation-1.5.0.tar.gz
  • Upload date:
  • Size: 72.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for decimer_segmentation-1.5.0.tar.gz
Algorithm Hash digest
SHA256 1b9399b0c155e4a5b1ad6acf9a2d78dc7cf15b7b6d543faa2caf217150a5d8ca
MD5 771a590499a9c93597bdd3b3b2c3eea6
BLAKE2b-256 4fbcc889de3f3be02023ac7c005bd60505236a28ce295c617b009e7df6c69814

See more details on using hashes here.

File details

Details for the file decimer_segmentation-1.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for decimer_segmentation-1.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bd6916e2f608eb764f0a1bff72d9e21b5f153101ef01c6d276f918873638a760
MD5 190fa7bf606882cb1d5b0b714d360dfe
BLAKE2b-256 0479607c1f5cd6c9713c556d0fba39abfa3162b525952f0b2512e6ea04d6d804

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page