Skip to main content

parser and transforms for GROBID-flavor TEI-XML

Project description

grobid_tei_xml: Python parser and transforms for GROBID-flavor TEI-XML

This is a simple python library for parsing the TEI-XML structured documents returned by GROBID, a machine learning tool for extracting text and bibliographic metadata from research article PDFs.

TEI-XML is a standard format, and there exist other libraries to parse entire documents and work with annotated text. This library is focused specifically on extracting "header" metadata from document (eg, title, authors, journal name, volume, issue), content in flattened text form (full abstract and body text as single strings, for things like search indexing), and structured citation metadata.

Quickstart

grobid_tei_xml works with Python 3, using only the standard library. It does not talk to the GROBID HTTP API or read files off disk on it's own, but see examples below. The library is packaged on pypi.org.

Install using pip, usually within a virtualenv:

pip install grobid_tei_xml

The main entry points are the functions process_document_xml(xml_text) and process_citation_xml(xml_text) (or process_citation_list_xml(xml_text) for multiple citations), which return python dataclass objects. The helper method .to_dict() can be useful for, eg, serializing these objects to JSON.

Usage Examples

Read an XML file from disk, parse it, and print to stdout as JSON:

import json
import grobid_tei_xml

xml_path = "./tests/files/small.xml"

with open(xml_path, 'r') as xml_file:
    doc = grobid_tei_xml.parse_document_xml(xml_file.read())

print(json.dumps(doc.to_dict(), indent=2))

Use requests to download a PDF from the web, submit to GROBID (via HTTP API), parse the TEI-XML response with grobid_tei_xml, and print some metadata fields:

import requests
import grobid_tei_xml

pdf_resp = requests.get("https://arxiv.org/pdf/1802.01168v3")
pdf_resp.raise_for_status()

grobid_resp = requests.post(
    "https://cloud.science-miner.com/grobid/api/processFulltextDocument",
    files={
        'input': pdf_resp.content,
        'consolidate_Citations': 0,
        'includeRawCitations': 1,
    },
    timeout=60.0,
)
grobid_resp.raise_for_status()

doc = grobid_tei_xml.parse_document_xml(grobid_resp.text)

print("title: " + doc.header.title)
print("authors: " + ", ".join([a.full_name for a in doc.header.authors]))
print("doi: " + str(doc.header.doi))
print("citation count: " + str(len(doc.citations)))
print("abstract: " + doc.abstract)

Use requests to submit a "raw" citation string to GROBID for extraction, parse the response with grobid_tei_xml, and print the structured output to stdout:

import requests
import grobid_tei_xml

raw_citation = "Kvenvolden K.A. and Field M.E. 1981. Thermogenic hydrocarbons in unconsolidated sediment of Eel River Basin, offshore northern California. AAPG Bulletin 65:1642-1646"

grobid_resp = requests.post(
    "https://cloud.science-miner.com/grobid/api/processCitation",
    data={
        'citations': raw_citation,
        'consolidateCitations': 0,
        'includeRawCitations': 1,
    },
    timeout=10.0,
)
grobid_resp.raise_for_status()

citation = grobid_tei_xml.parse_citation_xml(grobid_resp.text)
print(citation)

See Also

grobid_client_python: Python client and CLI tool for making requests to GROBID via HTTP API. Returns TEI-XML; could be used with this library (grobid_tei_xml) for parsing into python object or, eg, JSON.

GROBID Documentation

s2orc-doc2json: Python library from AI2 which includes a similar Python library for extracting both bibliographic metadata and (structured) full text from GROBID TEI-XML. Has nice features like resolving references to bibliography entry.

delb: more flexible/powerful interface to TEI-XML documents. would be a better tool for working with structured text (body, abstract, etc)

"Parsing TEI XML documents with Python" (2019): blog post about basic parsing of GROBID TEI-XML files into Pandas DataFrames

License

This library is available under the permissive MIT License. See LICENSE.txt for a copy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grobid_tei_xml-0.1.3.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

grobid_tei_xml-0.1.3-py2.py3-none-any.whl (14.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file grobid_tei_xml-0.1.3.tar.gz.

File metadata

  • Download URL: grobid_tei_xml-0.1.3.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.3

File hashes

Hashes for grobid_tei_xml-0.1.3.tar.gz
Algorithm Hash digest
SHA256 35c9afb14f6f76100dce5f5815e67ec9fa4122e2f268394e0baf6eafbd8668d8
MD5 d86737097a9fae7738301fb7134ee48f
BLAKE2b-256 66ee8eb2cd1253154de2b0dbbe5ed3a5b0963589b74bbe5f8a55b1137093d816

See more details on using hashes here.

File details

Details for the file grobid_tei_xml-0.1.3-py2.py3-none-any.whl.

File metadata

  • Download URL: grobid_tei_xml-0.1.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 14.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.3

File hashes

Hashes for grobid_tei_xml-0.1.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 022fdf54dbd067b520c1effe3c1a1f2ac248492ea310627e9462757748cb461b
MD5 eee968e074d18e757f7f943f26132965
BLAKE2b-256 d0f00ac75a2aca1bb89989a0944243cd130ec83683f5bbd941a9c45de53d6033

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page