Skip to main content

Python parser for scientific PDF based on GROBID.

Project description

SciPDF Parser

A Python parser for scientific PDF based on GROBID.

Installation

Use pip to install from this Github repository

pip install git+https://github.com/titipata/scipdf_parser

Note

  • We also need an en_core_web_sm model for spacy, where you can run python -m spacy download en_core_web_sm to download it
  • You can change GROBID version in serve_grobid.sh to test the parser on a new GROBID version

Usage

Run the GROBID using the given bash script before parsing PDF

bash serve_grobid.sh

This script will download GROBID and run the service at default port 8070 (see more here). To parse a PDF provided in example_data folder or direct URL, use the following function:

import scipdf
article_dict = scipdf.parse_pdf_to_dict('example_data/futoma2017improved.pdf') # return dictionary
 
# option to parse directly from URL to PDF, if as_list is set to True, output 'text' of parsed section will be in a list of paragraphs instead
article_dict = scipdf.parse_pdf_to_dict('https://www.biorxiv.org/content/biorxiv/early/2018/11/20/463760.full.pdf', as_list=False)

# output example
>> {
    'title': 'Proceedings of Machine Learning for Healthcare',
    'abstract': '...',
    'sections': [
        {'heading': '...', 'text': '...'},
        {'heading': '...', 'text': '...'},
        ...
    ],
    'references': [
        {'title': '...', 'year': '...', 'journal': '...', 'author': '...'},
        ...
    ],
    'figures': [
        {'figure_label': '...', 'figure_type': '...', 'figure_id': '...', 'figure_caption': '...', 'figure_data': '...'},
        ...
    ],
    'doi': '...'
}

xml = scipdf.parse_pdf('example_data/futoma2017improved.pdf', soup=True) # option to parse full XML from GROBID

To parse figures from PDF using pdffigures2, you can run

scipdf.parse_figures('example_data', output_folder='figures') # folder should contain only PDF files

You can see example output figures in figures folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scipdf-mirror-0.1.dev0.tar.gz (31.3 MB view details)

Uploaded Source

Built Distribution

scipdf_mirror-0.1.dev0-py3-none-any.whl (30.2 MB view details)

Uploaded Python 3

File details

Details for the file scipdf-mirror-0.1.dev0.tar.gz.

File metadata

  • Download URL: scipdf-mirror-0.1.dev0.tar.gz
  • Upload date:
  • Size: 31.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.4

File hashes

Hashes for scipdf-mirror-0.1.dev0.tar.gz
Algorithm Hash digest
SHA256 d2ad1f5b260669d551dd03c71fa092061bafa7bae889d179a2a01f43a9f55363
MD5 332cde4f42cb2ea28fbc8f0d0c400c84
BLAKE2b-256 1b9b4ec48c1cf0688a61da82e894da543eaf420425ed7010f1109be8c207556f

See more details on using hashes here.

File details

Details for the file scipdf_mirror-0.1.dev0-py3-none-any.whl.

File metadata

File hashes

Hashes for scipdf_mirror-0.1.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 d4b96d1142c9bd441306f8b96334ac7ba5f5f537df3d3e5c125a3a596bd7ba03
MD5 111ee6091242c85bd205a5ded6d32f05
BLAKE2b-256 ee4b7fd8bd3a155ccec7194254be15b2030b8b9f99d98f3bec5be6a7ecede4c5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page