Skip to main content

Pubmed XML parsing and typehints.

Project description

pubmed-types (v0.2.0)

Built with: xsdata-pydantic

Introduction

A complete implementation of the XML schema for PMC Open Access articles and Pubmed article sets (citations).

This package helps to parse PubMed XML data into Pydantic models. This validates the input xml data and provides typehints for working with the complex XML structures present in PubMed data.

Most Recent Changes

  • Breaking Change: The parse_pubmed_xml is replaced by pmc_article and pubmed_article_set.
  • More test coverage
  • Pubmed Articles can now parse MathML
  • Restructured code to separate out jats (pmc open access articles) and pubmed (pubmed article set)
  • One unit test with 99% coverage
  • Added CHANGELOG.md

Why do I need this?

PubMed keeps track of 10s of millions of research data, and a complex XML structure is used to store it. Parsing XML on its own is challenging enough. Add to it the feature rich data inside of each citation, and you will find yourself with hours or days of navigating the XML structure.

The approach here was to autogenerate Pydantic classes to parse the XML using the xsdata-pydantic tool. This approach has the benefit of making sure every piece of data is parsed properly, and an error is thrown is something is missing or incorrect. Instead of using dictionaries to hold the data, Pydantic classes have the benefit of providing type hints with tab completion for IDEs, making it easier to navigate the complex structure of the citation data.

How do I use it?

It is possible to use xsdata-pydantic and the autogenerated classes directly to parse an XML file, but we provide a convenience function to easily open PubMed XMl citations and PMC open access articles.

Example 1: A PMC Open Access Article

import tarfile
import urllib.request as request
from contextlib import closing
from pathlib import Path

from pubmed_types import pmc_article

# Input file source and output file destination
source = (
    "ftp://ftp.ncbi.nlm.nih.gov"
    + "/pub/pmc/oa_bulk/oa_comm/xml"
    + "/oa_comm_xml.incr.2023-03-21.tar.gz"
)
destination = Path("downloads")
destination.mkdir(exist_ok=True)

# 1. Get an open access article dataset from the FTP server
with closing(request.urlopen(source)) as url:
    with tarfile.open(fileobj=url, mode="r:gz") as fr:
        fr.extractall(destination)

# 2. Parse the file
file_path = destination.joinpath("PMC009xxxxxx").joinpath("PMC9970662.xml")
full_text = pmc_article(file_path)

# 3. Print out the article title
print(f"Title: {full_text.front.article_meta.title_group.article_title.content[0]}")

Output:

Title: Lactate as a myokine and exerkine: drivers and signals of physiology and metabolism

Example 2: A Pubmed baseline citation file

import gzip
import urllib.request as request
from contextlib import closing
from pathlib import Path

from pubmed_types import pubmed_article_set

# Input file source and output file destination
source = "ftp://ftp.ncbi.nlm.nih.gov" + "/pubmed/updatefiles" + "/pubmed23n1168.xml.gz"
destination = Path("downloads").joinpath("pubmed23n1168.xml")
destination.parent.mkdir(exist_ok=True)

# 1. Get a pubmed citation daily update file from the FTP server
with closing(request.urlopen(source)) as url:
    with gzip.GzipFile(fileobj=url, mode="rb") as fr:
        with open(destination, mode="wb") as fw:
            fw.write(fr.read())

# 2. Parse the file
article_set = pubmed_article_set(destination)

# 3. Get the number of citations in the file
print(f"Number of citations: {len(article_set.pubmed_article)}")
print(
    f"{article_set.pubmed_article[0].medline_citation.article.article_title.content[0]}"
)

Output:

Number of citations: 2543
A Patent and Pattern Mother.

FAQ

Why does it take so long to parse a pubmed citation set

There is a lot of data, and the schema is deep and complex.

Why are the return structures so complicated?

The return structures are a direct reflection of the XML format defined by the NLM. In the future some utility classes might be made for common components (title, authors, etc), but for now this is intended to be an unbiased way of parsing the XML.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pubmed_types-0.2.0.tar.gz (211.0 kB view details)

Uploaded Source

Built Distribution

pubmed_types-0.2.0-py3-none-any.whl (481.1 kB view details)

Uploaded Python 3

File details

Details for the file pubmed_types-0.2.0.tar.gz.

File metadata

  • Download URL: pubmed_types-0.2.0.tar.gz
  • Upload date:
  • Size: 211.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.0 Linux/5.15.0-1038-azure

File hashes

Hashes for pubmed_types-0.2.0.tar.gz
Algorithm Hash digest
SHA256 56b9e53128f43eddb83118ed4e5e2c3dee23d03813277f3ec6b6951b9f522130
MD5 cd21b8a094044a2497e3f8754ae5e590
BLAKE2b-256 5849412119a8d07431b75c961a4630a48252b03d798cc4b783374ab883783959

See more details on using hashes here.

File details

Details for the file pubmed_types-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pubmed_types-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 481.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.0 Linux/5.15.0-1038-azure

File hashes

Hashes for pubmed_types-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 936f85f5011f294dcc01389375f4c5e21622441d8b54b273cced31bfbf5f3d39
MD5 1389dcc684e39ff568cd855c046c90b6
BLAKE2b-256 c8f61a8d035939c9432975eecf87305af7c4cd31466aefdf6a574e4171dc12d3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page