Skip to main content

Python parser for scientific PDF based on GROBID.

Project description

SciPDF Parser

A Python parser for scientific PDF based on GROBID.

Installation

Use pip to install from this Github repository

pip install git+https://github.com/titipata/scipdf_parser

Note

  • We also need an en_core_web_sm model for spacy, where you can run python -m spacy download en_core_web_sm to download it
  • You can change GROBID version in serve_grobid.sh to test the parser on a new GROBID version

Usage

Run the GROBID using the given bash script before parsing PDF

bash serve_grobid.sh

This script will download GROBID and run the service at default port 8070 (see more here). To parse a PDF provided in example_data folder or direct URL, use the following function:

import scipdf
article_dict = scipdf.parse_pdf_to_dict('example_data/futoma2017improved.pdf') # return dictionary
 
# option to parse directly from URL to PDF, if as_list is set to True, output 'text' of parsed section will be in a list of paragraphs instead
article_dict = scipdf.parse_pdf_to_dict('https://www.biorxiv.org/content/biorxiv/early/2018/11/20/463760.full.pdf', as_list=False)

# output example
>> {
    'title': 'Proceedings of Machine Learning for Healthcare',
    'abstract': '...',
    'sections': [
        {'heading': '...', 'text': '...'},
        {'heading': '...', 'text': '...'},
        ...
    ],
    'references': [
        {'title': '...', 'year': '...', 'journal': '...', 'author': '...'},
        ...
    ],
    'figures': [
        {'figure_label': '...', 'figure_type': '...', 'figure_id': '...', 'figure_caption': '...', 'figure_data': '...'},
        ...
    ],
    'doi': '...'
}

xml = scipdf.parse_pdf('example_data/futoma2017improved.pdf', soup=True) # option to parse full XML from GROBID

To parse figures from PDF using pdffigures2, you can run

scipdf.parse_figures('example_data', output_folder='figures') # folder should contain only PDF files

You can see example output figures in figures folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scipdf_parser-0.52.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

scipdf_parser-0.52-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file scipdf_parser-0.52.tar.gz.

File metadata

  • Download URL: scipdf_parser-0.52.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for scipdf_parser-0.52.tar.gz
Algorithm Hash digest
SHA256 750df1af1adf9393d0844fa1292a2689e0e2c05e14285389212b16181d406716
MD5 6f631e9c4b81fe4bf371a9b9152486b6
BLAKE2b-256 741f7f7371f54696b30f2e436934cafd8bb3b294ef874558894207dc022c57ed

See more details on using hashes here.

File details

Details for the file scipdf_parser-0.52-py3-none-any.whl.

File metadata

  • Download URL: scipdf_parser-0.52-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for scipdf_parser-0.52-py3-none-any.whl
Algorithm Hash digest
SHA256 63a7fa588ac039bd913fb0ea175f533535ad70f89b602ba726caac64e81c423a
MD5 35ad7ab74db7918071bb4d675d3331e6
BLAKE2b-256 d29c755427ef9f58815d19cf060e9f0d85a4851267dbdf39189097bf8b47892c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page