Python parser for scientific PDF based on GROBID.
Project description
SciPDF Parser
A Python parser for scientific PDF based on GROBID.
Installation
Use pip to install from this Github repository
pip install git+https://github.com/titipata/scipdf_parser
Note
- We also need an
en_core_web_smmodel for spacy, where you can runpython -m spacy download en_core_web_smto download it - You can change GROBID version in
serve_grobid.shto test the parser on a new GROBID version
Usage
Run the GROBID using the given bash script before parsing PDF
bash serve_grobid.sh
This script will download GROBID and run the service at default port 8070 (see more here).
To parse a PDF provided in example_data folder or direct URL, use the following function:
import scipdf
article_dict = scipdf.parse_pdf_to_dict('example_data/futoma2017improved.pdf') # return dictionary
# option to parse directly from URL to PDF, if as_list is set to True, output 'text' of parsed section will be in a list of paragraphs instead
article_dict = scipdf.parse_pdf_to_dict('https://www.biorxiv.org/content/biorxiv/early/2018/11/20/463760.full.pdf', as_list=False)
# output example
>> {
'title': 'Proceedings of Machine Learning for Healthcare',
'abstract': '...',
'sections': [
{'heading': '...', 'text': '...'},
{'heading': '...', 'text': '...'},
...
],
'references': [
{'title': '...', 'year': '...', 'journal': '...', 'author': '...'},
...
],
'figures': [
{'figure_label': '...', 'figure_type': '...', 'figure_id': '...', 'figure_caption': '...', 'figure_data': '...'},
...
],
'doi': '...'
}
xml = scipdf.parse_pdf('example_data/futoma2017improved.pdf', soup=True) # option to parse full XML from GROBID
To parse figures from PDF using pdffigures2, you can run
scipdf.parse_figures('example_data', output_folder='figures') # folder should contain only PDF files
You can see example output figures in figures folder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scipdf_parser-0.52.tar.gz.
File metadata
- Download URL: scipdf_parser-0.52.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
750df1af1adf9393d0844fa1292a2689e0e2c05e14285389212b16181d406716
|
|
| MD5 |
6f631e9c4b81fe4bf371a9b9152486b6
|
|
| BLAKE2b-256 |
741f7f7371f54696b30f2e436934cafd8bb3b294ef874558894207dc022c57ed
|
File details
Details for the file scipdf_parser-0.52-py3-none-any.whl.
File metadata
- Download URL: scipdf_parser-0.52-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63a7fa588ac039bd913fb0ea175f533535ad70f89b602ba726caac64e81c423a
|
|
| MD5 |
35ad7ab74db7918071bb4d675d3331e6
|
|
| BLAKE2b-256 |
d29c755427ef9f58815d19cf060e9f0d85a4851267dbdf39189097bf8b47892c
|