Skip to main content

A curator for chemistry-related pdf files

Project description

pdf2chem

codecov Release Documentation Status

A curator for chemistry-related pdf files

Installation

$ pip install pdf2chem
$ cde data download
in Jupyter or Colab

!pip install pdf2chem
!cde data download
import pdf2chem as p2c

Features

  • This version allows the user to curate a folder of chemistry-related pdf files, extracting known chemicals mentioned in the files to csv files with the names as written in the pdf and the SMILES string for each chemical. Other outputs (e.g., InChI or other known names for the chemical) are possible and may be incorporated into future versions.

  • The package should automatically detect local vs. hosted runtimes and choose the compatible pdf extraction method in textract.

Dependencies

  • The package directly uses cirpy, ChemDataExtractor, pandas, os, re, time, datetime, and sys in addition to native Python 3. Many of these in turn have a fair few dependencies of their own.

Usage

  • Use use self-contained Colab page at https://drive.google.com/file/d/1YYZm-Ew-408q86DjDbQTCUoWUNVl_UN3/view?usp=sharing or

  • Install and import as described above

  • Place pdf files of interest (typically journal articles) in an accessible folder.

  • Execute p2c.curate_folder()

  • If the files are not in the current directory, pass the directory to the function as an argument, e.g. p2c.curate_folder('C:/Users/kfrog/literature')

  • The files will then be analyzed internally before a list of words and phrases suspected to be known chemicals is sent to NIH's servers to be resolved. Chemicals found and their SMILES strings will be aggregated in a csv file for each pdf. After each pdf is processed, the data from each csv file will be combined to an aggregated csv file for all the papers in that run.

  • Please note: this program depends on both stable internet access and uptime/responsiveness at NIH's servers. The latter are often slower or down entirely on the weekends, and sometimes this is seen during the week as well. We appreciate the team there making the databases as accessible as they do.

Documentation

The official documentation is hosted on Read the Docs: https://pdf2chem.readthedocs.io/en/latest/

Contributors

We welcome and recognize all contributions. You can see a list of current contributors in the contributors tab.

Credits

This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.

This package makes heavy use of ChemDataExtractor and CIRpy, packages developed by Swain and Cole and released under the MIT license. Swain, M. C., & Cole, J. M. "ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature", J. Chem. Inf. Model. 2016, 56 (10), pp 1894–1904 10.1021/acs.jcim.6b00207

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2chem-0.1.6.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

pdf2chem-0.1.6-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file pdf2chem-0.1.6.tar.gz.

File metadata

  • Download URL: pdf2chem-0.1.6.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.7.10 Windows/10

File hashes

Hashes for pdf2chem-0.1.6.tar.gz
Algorithm Hash digest
SHA256 9496dc5428f9b08210b64d6c20a09118f15786728c531adf9adfb7def8dcc243
MD5 7b5a4ac49f4b3f88839709262957127c
BLAKE2b-256 f54178b110716c8bae2a58a9f31465e013178582d6e8193e1fc32e4759f4826d

See more details on using hashes here.

File details

Details for the file pdf2chem-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: pdf2chem-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.7.10 Windows/10

File hashes

Hashes for pdf2chem-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 a1bac1f2da8a801968ae6c3045d8fda7624acd6de34a40fa5ba7234a8739931a
MD5 c655890af85e45092a91d05411e6386d
BLAKE2b-256 0f249cbdb2b1d34b1042f06d57d4645d8854ab7bdc77bcbaff4bc44a1834e408

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page