Skip to main content

Python library for serializing GROBID TEI XML to dataclass

Project description

grobid

Python library for serializing GROBID TEI XML to dataclasses

Build Status Coverage Status Latest Version Python Version License

Installation

Use pip to install:

$ pip install grobid
$ pip install grobid[json] # for JSON serializable dataclass objects

You can also download the .whl file from the release section:

$ pip install *.whl

Usage

Client

In order to convert an academic PDF to TEI XML file, we use GROBID's REST services. Specifically the processFulltextDocument endpoint.

from pathlib import Path
from grobid.models.form import Form, File
from grobid.models.response import Response

pdf_file = Path("<your-academic-article>.pdf")
with open(pdf_file, "rb") as file:
    form = Form(
        file=File(
            payload=file.read(),
            file_name=pdf_file.name,
            mime_type="application/pdf",
        )
    )
    c = Client(base_url="<base-url>", form=form)
    try:
        xml_content = c.sync_request().content  # TEI XML file in bytes
    except GrobidClientError as e:
        print(e)

where base-url is the URL of the GROBID REST service

You can use https://cloud.science-miner.com/grobid/ to test

Form

The Form class supports most of the optional parameters of the processFulltextDocument endpoint.

Parser

If you want to serialize the XML content, we can use the Parser class to create dataclasses objects.

Not all of the GROBID annoation guidelines are met, but compliance is a goal. See #1.

from grobid.tei import Parser

xml_content: bytes
parser = Parser(xml_content)
article = parser.parse()
article.to_json()  # throws RuntimeError if extra require 'json' not installed

where xml_content is the same as in Client section

Alternately, you can load the XML from a file:

from grobid.tei import Parser

with open("<your-academic-article>.xml", "rb") as xml_file:
  xml_content = xml_file.read()
  parser = Parser(xml_content)
  article = parser.parse()
  article.to_json()  # throws RuntimeError if extra require 'json' not installed

We use mashumaro to serialize the dataclasses into JSON (mashumaro supports other formats, you can submit a PR if you want). By default, mashumaro isn't installed, use pip install grobid[json].

License

MIT

Contributing

You are welcome to add missing features by submitting a PR, however, I won't be accepting any requests other than GROBID annotation compliance.

Disclaimer

This module was originally part of a group university project, however, all the code and tests was also authored by me.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grobid-0.1.3.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grobid-0.1.3-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file grobid-0.1.3.tar.gz.

File metadata

  • Download URL: grobid-0.1.3.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for grobid-0.1.3.tar.gz
Algorithm Hash digest
SHA256 7159bf18ed4c0aef3c7b4b518e4287585c8b0404deaf0ad36ce67612eee7f4ac
MD5 f018190414215ecf352b687364e05350
BLAKE2b-256 08233ffbd60d2534a21c19994e47f1f7bd35f38b34652c9be5fdd9a8b9548478

See more details on using hashes here.

File details

Details for the file grobid-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: grobid-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for grobid-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 395588edd82823eff009b05cfdebedca27e3f8cafc16ecffee63e777a4b7ac54
MD5 0bab16b88cdc746fe50097532500455a
BLAKE2b-256 2ada9d24e4637883fb558d31bfbb3539cf0c3960735ffe7ada4e52507690cc91

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page