Parsing HTML chemistry papers from certain publishers into plain text
Project description
Chemistry Article Parser
Convert HTML/XML Chemistry/Material Science articles into plain text.
Requirement
See requirements.txt
.
Packages with versions specified in requirements.txt
are used to test the code.
Other versions are not fully tested but may also work.
Submodule
To get the submodule files, use
git submodule update --init
Supported publishers:
- RSC (HTML)
- Springer (HTML)
- Nature (HTML)
- Wiley (HTML)
- AIP (HTML)
- ACS (HTML & XML)
- Elsevier (HTML & XML)
- AAAS (Science) (HTML)
Table parsing is supported but not for all publishers. For figures, only figure captions are parsed in the current version.
Example
Fork this repo and clone it to your local machine;
To parse HTML files, run the following code:
python tests/parse_articles.py --input_dir </path/to/html/files> --parse_html
or
cd tests
python parse_articles.py config.json
where parameters are stored in file config.json
.
Add --parse_xml
to the argument list to enable xml parsing.
Issues
Due to the variety of HTML/XML documents, not all document can be successfully parsed. It would be helpful for our improvement if you can report the failed cases in the Issue section.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ChemistryPaperParser-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | c95867c5c518855b8652a0ab2e05e874c2f3ac7a9ae58ddf226a06e9db8326d5 |
|
MD5 | f2d89063574f498bd00ea8501ed6846a |
|
BLAKE2b-256 | fd7763c4e56ceb94e499a99497fb6c64d8087509142e9d8b221999f18c7c95ff |
Hashes for ChemistryPaperParser-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d1a0cb31dbd9486c589768f31fbd165bde21130aa151b9819a4384105b2b6378 |
|
MD5 | 204526286a4ba1d10157a8c8f28a6084 |
|
BLAKE2b-256 | d11e325aa9362a055db1e86608f9c4e72d06d85122d1b8d8cc841914f013de3e |