Python parser for scientific PDF based on GROBID.

These details have not been verified by PyPI

Project links

Homepage

Project description

SciPDF Parser

A Python parser for scientific PDF based on GROBID.

Installation

Use pip to install from this Github repository

pip install git+https://github.com/titipata/scipdf_parser

Note

We also need an en_core_web_sm model for spacy, where you can run python -m spacy download en_core_web_sm to download it
You can change GROBID version in serve_grobid.sh to test the parser on a new GROBID version

Usage

Run the GROBID using the given bash script before parsing PDF

bash serve_grobid.sh

This script will download GROBID and run the service at default port 8070 (see more here). To parse a PDF provided in example_data folder or direct URL, use the following function:

import scipdf
article_dict = scipdf.parse_pdf_to_dict('example_data/futoma2017improved.pdf') # return dictionary
 
# option to parse directly from URL to PDF, if as_list is set to True, output 'text' of parsed section will be in a list of paragraphs instead
article_dict = scipdf.parse_pdf_to_dict('https://www.biorxiv.org/content/biorxiv/early/2018/11/20/463760.full.pdf', as_list=False)

# output example
>> {
    'title': 'Proceedings of Machine Learning for Healthcare',
    'abstract': '...',
    'sections': [
        {'heading': '...', 'text': '...'},
        {'heading': '...', 'text': '...'},
        ...
    ],
    'references': [
        {'title': '...', 'year': '...', 'journal': '...', 'author': '...'},
        ...
    ],
    'figures': [
        {'figure_label': '...', 'figure_type': '...', 'figure_id': '...', 'figure_caption': '...', 'figure_data': '...'},
        ...
    ],
    'doi': '...'
}

xml = scipdf.parse_pdf('example_data/futoma2017improved.pdf', soup=True) # option to parse full XML from GROBID

To parse figures from PDF using pdffigures2, you can run

scipdf.parse_figures('example_data', output_folder='figures') # folder should contain only PDF files

You can see example output figures in figures folder.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.52

Nov 12, 2023

0.42

Sep 8, 2023

0.5

Nov 12, 2023

0.4

Sep 8, 2023

0.3

Aug 27, 2023

0.2

Aug 27, 2023

0.1.dev0 pre-release

Aug 27, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scipdf_parser-0.52.tar.gz (10.4 kB view details)

Uploaded Nov 12, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scipdf_parser-0.52-py3-none-any.whl (10.7 kB view details)

Uploaded Nov 12, 2023 Python 3

File details

Details for the file scipdf_parser-0.52.tar.gz.

File metadata

Download URL: scipdf_parser-0.52.tar.gz
Upload date: Nov 12, 2023
Size: 10.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for scipdf_parser-0.52.tar.gz
Algorithm	Hash digest
SHA256	`750df1af1adf9393d0844fa1292a2689e0e2c05e14285389212b16181d406716`
MD5	`6f631e9c4b81fe4bf371a9b9152486b6`
BLAKE2b-256	`741f7f7371f54696b30f2e436934cafd8bb3b294ef874558894207dc022c57ed`

See more details on using hashes here.

File details

Details for the file scipdf_parser-0.52-py3-none-any.whl.

File metadata

Download URL: scipdf_parser-0.52-py3-none-any.whl
Upload date: Nov 12, 2023
Size: 10.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for scipdf_parser-0.52-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63a7fa588ac039bd913fb0ea175f533535ad70f89b602ba726caac64e81c423a`
MD5	`35ad7ab74db7918071bb4d675d3331e6`
BLAKE2b-256	`d29c755427ef9f58815d19cf060e9f0d85a4851267dbdf39189097bf8b47892c`

See more details on using hashes here.

scipdf-parser 0.52

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SciPDF Parser

Installation

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes