Skip to main content

Parsing HTML chemistry papers from certain publishers into plain text

Project description

Chemistry Paper Parser

Convert HTML/XML Chemistry/Material Science articles into plain text.

made-with-python Maintenance PyPI version

1. Install

Requirements

The current version of Chemistry Paper Parser is built for Python >= 3.9. Please check requirements.txt for other dependencies.

Install package

Chemistry Paper Parser is hosted on pypi. You can simply install it with

pip install ChemistryPaperParser

Once installed, you can import the package as chempp in Python:

from chempp import parse_html, parse_xml

html_article, _ = parse_html(path_to_my_local_html)
xml_article, _ = parse_xml(path_to_my_local_xml)

Supported publishers:

Currently, Chemistry Paper Parser supports the following publishers and file types.

Publisher Supports HTML Supports XML
RSC
Springer
Nature
Wiley
AIP
ACS
Elsevier
AAAS (Science)

In addition, table parsing is not supported for all publishers.

For figures, only captions will be parsed and saved in the current version.

2. Example

The open-access ACS article Toland et al. (2023) is used here as an example to demonstrate the article parsing process. The offline file is provided at ./examples/Toland.et.al.2023.html. For online HTML files, you can either download the html files manually and load it as demonstrated below, or use the provided chempp.crawler.load_online_html function (requires external dependencies).

To parse the example article, you can try the following example in your shell.

PYTHONPATH="." python ./examples/process_articles.py --input_dir ./examples/ --output_dir ./output/ --output_format pt

The --input_dir argument can either be the file path or a directory. If it is a directory, the program will try to read and parse all html and xml files in the folder. --output_format defines the output format of the parse file. pt will retain all structural information within the Article class. jsonl saves the file as a Doccano-compatible jsonl file for easy annotation. html saves the file as a simplified HTML for easy demonstration of the annotated sentences and tokens. It also is a good way to present the quality of the parsed article.

Notice that ./examples/process_articles.py is only an incomplete demonstration of chempp APIs and their usage. The notebook ./examples/example.ipynb demonstrates the structure of the parsed Article object and some possible use cases. You can find more details regarding Chemistry Article Parser and its application in my blog. I'll provide more comprehensive API introduction if needed in the future.

3. Known issues

Due to the variety of HTML/XML documents, not all document can be successfully parsed. It would be helpful for our improvement if you can report the failed cases in the Issue section.

  • HTML highlighting sometimes may fail when multiple entities start at the same position due to incorrect text span alignment.
  • May fail to extract sections from Elsevier when section ids are s[\d]+ instead of sec[\d]+, as mentioned in this issue.
  • Fails to extract abstracts from RSC due to updated HTML format, as mentioned in this issue.

Citation

Please consider citing the following article if your find our package useful. Although not mentioned at all, Chemistry Paper Parser is still a part of this project.

@article{toland.2023.accelerated.scheme,
  author = {Toland, Aubrey and Tran, Huan and Chen, Lihua and Li, Yinghao and Zhang, Chao and Gutekunst, Will and Ramprasad, Rampi},
  title = {Accelerated Scheme to Predict Ring-Opening Polymerization Enthalpy: Simulation-Experimental Data Fusion and Multitask Machine Learning},
  journal = {The Journal of Physical Chemistry A},
  volume = {127},
  number = {50},
  pages = {10709-10716},
  year = {2023},
  doi = {10.1021/acs.jpca.3c05870},
  note ={PMID: 38055927},
  URL = {https://doi.org/10.1021/acs.jpca.3c05870},
  eprint = {https://doi.org/10.1021/acs.jpca.3c05870}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

ChemistryPaperParser-0.1.1-py3-none-any.whl (28.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page