Python library for serializing GROBID TEI XML to dataclass
Project description
grobid
Python library for serializing GROBID TEI XML to dataclasses
Installation
Use pip
to install:
$ pip install grobid
$ pip install grobid[json] # for JSON serializable dataclass objects
You can also download the .whl
file from the release section:
$ pip install *.whl
Usage
Client
In order to convert an academic PDF to TEI XML file, we use GROBID's REST services. Specifically the processFulltextDocument endpoint.
from pathlib import Path
from grobid.models.form import Form, File
from grobid.models.response import Response
pdf_file = Path("<your-academic-article>.pdf")
with open(pdf_file, "rb") as file:
form = Form(
file=File(
payload=file.read(),
file_name=pdf_file.name,
mime_type="application/pdf",
)
)
c = Client(base_url="<base-url>", form=form)
try:
xml_content = c.sync_request().content # TEI XML file in bytes
except GrobidClientError as e:
print(e)
where base-url
is the URL of the GROBID REST service
You can use
https://cloud.science-miner.com/grobid/
to test
Form
The Form
class supports most of the optional parameters of the processFulltextDocument
endpoint.
Parser
If you want to serialize the XML content, we can use the Parser
class to
create dataclasses
objects.
Not all of the GROBID annoation guidelines are met, but compliance is a goal. See #1.
from grobid.tei import Parser
xml_content: bytes
parser = Parser(xml_content)
article = parser.parse()
article.to_json() # raises RuntimeError if extra require 'json' not installed
where xml_content
is the same as in Client section
Alternately, you can load the XML from a file:
from grobid.tei import Parser
with open("<your-academic-article>.xml", "rb") as xml_file:
xml_content = xml_file.read()
parser = Parser(xml_content)
article = parser.parse()
article.to_json() # throws RuntimeError if extra require 'json' not installed
We use orjson to provide a method to_json
to
serialize the dataclasses into JSON. By default, orjson isn't installed, use
pip install grobid[json]
.
License
MIT
Contributing
You are welcome to add missing features by submitting a PR, however, I won't be accepting any requests other than GROBID annotation compliance.
Disclaimer
This module was originally part of a group university project, however, all the code and tests was also authored by me.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file grobid-0.2.0.tar.gz
.
File metadata
- Download URL: grobid-0.2.0.tar.gz
- Upload date:
- Size: 17.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f1085d2be0bf21c516f9f69157e92704a190573a97f2084de133c7c04819c7a2 |
|
MD5 | f161440bef5cee452e4864d114178506 |
|
BLAKE2b-256 | 6d8ba40d008a98fc6df0171ab7d1f4c06146169ff8479ef75a1f615f1044c989 |
File details
Details for the file grobid-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: grobid-0.2.0-py3-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 754f867482562ca7c456742c9b46bf8488e18bdd045edf090ce7ed598f8cc15a |
|
MD5 | e3540d6f31e84249d326c6a0b097d24d |
|
BLAKE2b-256 | ca977f5e4282e9ecc3aab54b5411994b077153de4b2e937c41a69f5f97c13ce7 |