parser and transforms for GROBID-flavor TEI-XML
Project description
grobid_tei_xml
: Python parser and transforms for GROBID-flavor TEI-XML
This is a simple python library for parsing the TEI-XML structured documents returned by GROBID, a machine learning tool for extracting text and bibliographic metadata from research article PDFs.
TEI-XML is a standard format, and there exist other libraries to parse entire documents and work with annotated text. This library is focused specifically on extracting "header" metadata from document (eg, title, authors, journal name, volume, issue), content in flattened text form (full abstract and body text as single strings, for things like search indexing), and structured citation metadata.
Quickstart
grobid_tei_xml
works with Python 3, using only the standard library. It does
not talk to the GROBID HTTP API or read files off disk on it's own, but see
examples below. The library is packaged on pypi.org.
Install using pip
, usually within a virtualenv
:
pip install grobid_tei_xml
The main entry points are the functions process_document_xml(xml_text)
and
process_citation_xml(xml_text)
(or process_citation_list_xml(xml_text)
for
multiple citations), which return python dataclass objects. The helper method
.to_dict()
can be useful for, eg, serializing these objects to JSON.
Usage Examples
Read an XML file from disk, parse it, and print to stdout as JSON:
import json
import grobid_tei_xml
xml_path = "./tests/files/small.xml"
with open(xml_path, 'r') as xml_file:
doc = grobid_tei_xml.parse_document_xml(xml_file.read())
print(json.dumps(doc.to_dict(), indent=2))
Use requests
to download a PDF from the web, submit to GROBID (via HTTP API),
parse the TEI-XML response with grobid_tei_xml
, and print some metadata
fields:
import requests
import grobid_tei_xml
pdf_resp = requests.get("https://arxiv.org/pdf/1802.01168v3")
pdf_resp.raise_for_status()
grobid_resp = requests.post(
"https://cloud.science-miner.com/grobid/api/processFulltextDocument",
files={
'input': pdf_resp.content,
'consolidate_Citations': 0,
'includeRawCitations': 1,
},
timeout=60.0,
)
grobid_resp.raise_for_status()
doc = grobid_tei_xml.parse_document_xml(grobid_resp.text)
print("title: " + doc.header.title)
print("authors: " + ", ".join([a.full_name for a in doc.header.authors]))
print("doi: " + str(doc.header.doi))
print("citation count: " + str(len(doc.citations)))
print("abstract: " + doc.abstract)
Use requests
to submit a "raw" citation string to GROBID for extraction,
parse the response with grobid_tei_xml
, and print the structured output to
stdout:
import requests
import grobid_tei_xml
raw_citation = "Kvenvolden K.A. and Field M.E. 1981. Thermogenic hydrocarbons in unconsolidated sediment of Eel River Basin, offshore northern California. AAPG Bulletin 65:1642-1646"
grobid_resp = requests.post(
"https://cloud.science-miner.com/grobid/api/processCitation",
data={
'citations': raw_citation,
'consolidateCitations': 0,
'includeRawCitations': 1,
},
timeout=10.0,
)
grobid_resp.raise_for_status()
citation = grobid_tei_xml.parse_citation_xml(grobid_resp.text)
print(citation)
See Also
grobid_client_python
:
Python client and CLI tool for making requests to GROBID via HTTP API. Returns
TEI-XML; could be used with this library (grobid_tei_xml
) for parsing into
python object or, eg, JSON.
s2orc-doc2json: Python library from AI2 which includes a similar Python library for extracting both bibliographic metadata and (structured) full text from GROBID TEI-XML. Has nice features like resolving references to bibliography entry.
delb: more flexible/powerful interface to TEI-XML documents. would be a better tool for working with structured text (body, abstract, etc)
"Parsing TEI XML documents with Python" (2019): blog post about basic parsing of GROBID TEI-XML files into Pandas DataFrames
License
This library is available under the permissive MIT License. See LICENSE.txt
for a copy.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file grobid_tei_xml-0.1.3.tar.gz
.
File metadata
- Download URL: grobid_tei_xml-0.1.3.tar.gz
- Upload date:
- Size: 11.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 35c9afb14f6f76100dce5f5815e67ec9fa4122e2f268394e0baf6eafbd8668d8 |
|
MD5 | d86737097a9fae7738301fb7134ee48f |
|
BLAKE2b-256 | 66ee8eb2cd1253154de2b0dbbe5ed3a5b0963589b74bbe5f8a55b1137093d816 |
File details
Details for the file grobid_tei_xml-0.1.3-py2.py3-none-any.whl
.
File metadata
- Download URL: grobid_tei_xml-0.1.3-py2.py3-none-any.whl
- Upload date:
- Size: 14.5 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 022fdf54dbd067b520c1effe3c1a1f2ac248492ea310627e9462757748cb461b |
|
MD5 | eee968e074d18e757f7f943f26132965 |
|
BLAKE2b-256 | d0f00ac75a2aca1bb89989a0944243cd130ec83683f5bbd941a9c45de53d6033 |