A curator for chemistry-related pdf files
Project description
pdf2chem
A curator for chemistry-related pdf files
Installation
$ pip install pdf2chem
$ cde data download
in Jupyter or Colab
!pip install pdf2chem
!cde data download
import pdf2chem as p2c
Features
-
This version allows the user to curate a folder of chemistry-related pdf files, extracting known chemicals mentioned in the files to csv files with the names as written in the pdf and the SMILES string for each chemical. Other outputs (e.g., InChI or other known names for the chemical) are possible and may be incorporated into future versions.
-
The package should automatically detect local vs. hosted runtimes and choose the compatible pdf extraction method in textract.
Dependencies
- The package directly uses cirpy, ChemDataExtractor, pandas, os, re, time, datetime, and sys in addition to native Python 3. Many of these in turn have a fair few dependencies of their own.
Usage
-
Use use self-contained Colab page at https://drive.google.com/file/d/1YYZm-Ew-408q86DjDbQTCUoWUNVl_UN3/view?usp=sharing or
-
Install and import as described above
-
Place pdf files of interest (typically journal articles) in an accessible folder.
-
Execute p2c.curate_folder()
-
If the files are not in the current directory, pass the directory to the function as an argument, e.g. p2c.curate_folder('C:/Users/kfrog/literature')
-
The files will then be analyzed internally before a list of words and phrases suspected to be known chemicals is sent to NIH's servers to be resolved. Chemicals found and their SMILES strings will be aggregated in a csv file for each pdf. After each pdf is processed, the data from each csv file will be combined to an aggregated csv file for all the papers in that run.
-
Please note: this program depends on both stable internet access and uptime/responsiveness at NIH's servers. The latter are often slower or down entirely on the weekends, and sometimes this is seen during the week as well. We appreciate the team there making the databases as accessible as they do.
Documentation
The official documentation is hosted on Read the Docs: https://pdf2chem.readthedocs.io/en/latest/
Contributors
We welcome and recognize all contributions. You can see a list of current contributors in the contributors tab.
Credits
This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.
This package makes heavy use of ChemDataExtractor and CIRpy, packages developed by Swain and Cole and released under the MIT license. Swain, M. C., & Cole, J. M. "ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature", J. Chem. Inf. Model. 2016, 56 (10), pp 1894–1904 10.1021/acs.jcim.6b00207
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdf2chem-0.1.6.tar.gz
.
File metadata
- Download URL: pdf2chem-0.1.6.tar.gz
- Upload date:
- Size: 8.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.7.10 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9496dc5428f9b08210b64d6c20a09118f15786728c531adf9adfb7def8dcc243 |
|
MD5 | 7b5a4ac49f4b3f88839709262957127c |
|
BLAKE2b-256 | f54178b110716c8bae2a58a9f31465e013178582d6e8193e1fc32e4759f4826d |
File details
Details for the file pdf2chem-0.1.6-py3-none-any.whl
.
File metadata
- Download URL: pdf2chem-0.1.6-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.7.10 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a1bac1f2da8a801968ae6c3045d8fda7624acd6de34a40fa5ba7234a8739931a |
|
MD5 | c655890af85e45092a91d05411e6386d |
|
BLAKE2b-256 | 0f249cbdb2b1d34b1042f06d57d4645d8854ab7bdc77bcbaff4bc44a1834e408 |