Project description

Chemistry Article Parser

Convert HTML/XML Chemistry/Material Science articles into plain text.

Requirement

See requirements.txt.

Packages with versions specified in requirements.txt are used to test the code. Other versions are not fully tested but may also work.

Submodule

To get the submodule files, use

git submodule update --init

Supported publishers:

RSC (HTML)
Springer (HTML)
Nature (HTML)
Wiley (HTML)
AIP (HTML)
ACS (HTML & XML)
Elsevier (HTML & XML)
AAAS (Science) (HTML)

Table parsing is supported but not for all publishers. For figures, only figure captions are parsed in the current version.

Example

Fork this repo and clone it to your local machine;

To parse HTML files, run the following code:

python tests/parse_articles.py --input_dir </path/to/html/files> --parse_html

cd tests
python parse_articles.py config.json

where parameters are stored in file config.json.

Add --parse_xml to the argument list to enable xml parsing.

Issues

Due to the variety of HTML/XML documents, not all document can be successfully parsed. It would be helpful for our improvement if you can report the failed cases in the Issue section.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.1

Jun 15, 2024

0.1.0

Jun 15, 2024

This version

0.0.1

May 29, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ChemistryPaperParser-0.0.1.tar.gz (28.1 kB view hashes)

Uploaded May 29, 2022 Source

Built Distribution

ChemistryPaperParser-0.0.1-py3-none-any.whl (30.5 kB view hashes)

Uploaded May 29, 2022 Python 3

Hashes for ChemistryPaperParser-0.0.1.tar.gz

Hashes for ChemistryPaperParser-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`c95867c5c518855b8652a0ab2e05e874c2f3ac7a9ae58ddf226a06e9db8326d5`
MD5	`f2d89063574f498bd00ea8501ed6846a`
BLAKE2b-256	`fd7763c4e56ceb94e499a99497fb6c64d8087509142e9d8b221999f18c7c95ff`

Hashes for ChemistryPaperParser-0.0.1-py3-none-any.whl

Hashes for ChemistryPaperParser-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d1a0cb31dbd9486c589768f31fbd165bde21130aa151b9819a4384105b2b6378`
MD5	`204526286a4ba1d10157a8c8f28a6084`
BLAKE2b-256	`d11e325aa9362a055db1e86608f9c4e72d06d85122d1b8d8cc841914f013de3e`