Skip to main content

Download and parse pubmed publication data

Project description

Read XML files and pull out selected values. Values to collect are determined by paths found in a structure file. The structure file also includes a key which associates the values with a parent element and names, which determine which file to place the elements in.

Files can be passed as either gzipped or uncompressed XML files or from standard in.

For more info on Pubmed's XML files see: pubmed190101.dtd.

Usage:

xml_read --cache-dir=cache --structure-file=structure.yml \
    data/*.xml.gz

Or, with python:

import pubmedparser
import pubmedparser.ftp

# Download data
files = pubmedparser.ftp.download(range(1, 6))

# Read XML files using a YAML file to describe what data to collect.
data_dir = "file_example"
structure_file = "example/structure.yml"
results = pubmedparser.read_xml(files, structure_file, data_dir)

See the example file for more options.

In python, the structure file can be replaced with a dictionary of dictionaries as well.

Building CLI

Requires zlib.

Clone the repository and in the directory run:

make cli

or using nix:

nix shell "gitlab:DavidRConnell/pubmedparser"

Installing with pip

pip install pubmedparser2

Structure file

The structure file is a YAML file containing key-value pairs for different tags and paths. There are two required keys: root and key. Root provide the top-level tag, in the case of the pubmed files this will be PubmedArticleSet.

root: "/PubmedArticleSet"

The / is not strictly required as the program will ignore them, but they are used to conform to the xpath syntax (although this program does not handle all cases for xpath).

Only tags below the root tag will be considered and the parsing will terminate once the program has left the root of the tree.

Key is a reference tag. In the pubmed case, all data is with respect to a publication, so the key should identify the publication the values are linked to. The PMID tag is a suitable candidate.

key: "/PubmedArticle/MedlineCitation/PMID"

After root, all paths are taken as relative to the root node.

The other name-pairs in the file determine what other items to collect. These can either be a simple name and path, like the key, such as:

Language: "/PubmedArticle/MedlineCitation/Article/Language"
Keywords: "/PubmedArticle/MedlineCitation/KeywordList/Keyword"

Or they can use a hierarchical representation to get multiple values below a child. This is mainly used to handle lists of items where there is an indefinite number of items below the list.

Author: {
  root: "/PubmedArticle/MedlineCitation/Article/AuthorList",
  key: "/Author/auto_index",
  LastName: "/Author/LastName",
  ForeName: "/Author/ForeName",
  Affiliation: "/Author/AffiliationInfo/Affiliation",
  Orcid: "/Author/Identifier/[@Source='ORCID']"
}

Here, all paths are relative to the sub-structures root path, which is in turn relative to the parent structure's root. This sub-structure uses the same rules as the parent structure, so it needs both a root and key name-value pair. The results of searching each path are written to separate files. Each file gets a column for the parent and child key. So in this case, each element of the author is linked by an author key and that is related to the publication they authored through the parent key.

The main parser is called recursively to parse this structure so it's worth thinking about what the root should be under the context that the parser will be called with that root. This means if, instead of stopping at /AuthorList, /Author was added to the end of the root, the parser would be called for each individual author, instead of once per author list, leading to all author's getting the index 0.

There are a number of additional syntax constructs to note in the above example. The key uses the special name auto_index, since there is no author ID in the XML data, an index is used to count the authors in the order they appear. This resets for each publication and starts at 0. Treating the auto_index as the tail of a path allows you to control when the indexing occurs—the index is incremented whenever it hits a /Author tag.

In addition to the auto_index key, there is a second special index name, condensed.

Reference: {
  root: "/PubmedArticle/PubmedData/ReferenceList/Reference/ArticleIdList"
  key: "/condensed"
  PMID: "/ArticleId/[@IdType='pubmed']"
  DOI: "/ArticleId/[@IdType='doi']"
}

In the case of condensed, instead of writing the results to separate files, they will printed as columns in the same file, and therefore do not need an additional key for the sub-structure. If any of the elements are missing, they will be left blank, for example, if the parser does not find a pubmed ID for a given reference, the row will look like "%s\t\t%s" where the first string will contain the parent key (the PMID of the publication citing this reference) and the second string will contain the reference's DOI.

The /[@attribute='value'] syntax at the end of a path tells the parser to only collect an element if it has an attribute and the attribute's value matches the supplied value. Similarly the /@attribute syntax, tells the parser to collect the value of the attribute attribute along with the element's value. Then both values will be written to the output file. Currently only a single attribute can be specified.

Lastly, there is a special syntax for writing condensed sub-structures:

Date: "/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/{Year,Month,Day}"

The {child,child,child} syntax allows you to select multiple children at the same level to be printed to a single file. This is useful when multiple children make up a single piece of information (i.e. the publication date).

A similar example structure file can be found in the example directory of this project at: file:./example/structure.yml.

Structure dictionary

The structure of the xml data to read can also be described as a python dictionary of dictionaries.

The form is similar to the file:

structure = {
    "root": "//PubmedArticleSet",
    "key": "/PubmedArticle/MedlineCitation/PMID",
    "DOI": "/PubmedArticle/PubmedData/ArticleIdList/ArticleId/[@IdType='doi']",
    "Date": "/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/{Year,Month,Day}",
    "Journal": "/PubmedArticle/MedlineCitation/Article/Journal/{Title,ISOAbbreviation}",
    "Language": "/PubmedArticle/MedlineCitation/Article/Language",
    "Author": {
        "root": "/PubmedArticle/MedlineCitation/Article/AuthorList",
        "key": "/Author/auto_index",
        "LastName": "/Author/LastName",
        "ForName": "/Author/ForeName",
        "Affiliation": "/Author/AffiliationInfo/Affiliation",
        "Orcid": "/Author/Identifier/[@Source='ORCID']",
    },
    "Grant": {
        "root": "/PubmedArticle/MedlineCitation/Article/GrantList",
        "key": "/Grant/auto_index",
        "ID": "/Grant/GrantID",
        "Agency": "/Grant/Agency",
    },
    "Chemical": "/PubmedArticle/MedlineCitation/ChemicalList/Chemical/NameOfSubstance/@UI",
    "Qualifier": "/PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading/QualifierName/@UI",
    "Descriptor": "/PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading/DescriptorName/@UI",
    "Keywords": "/PubmedArticle/MedlineCitation/KeywordList/Keyword",
    "Reference": {
        "root": (
            "/PubmedArticle/PubmedData/ReferenceList/Reference/ArticleIdList"
        ),
        "key": "/condensed",
        "PMID": "/ArticleId/[@IdType='pubmed']",
        "DOI": "/ArticleId/[@IdType='doi']",
    },
}

This can then be passed to pubmedparser.read_xml in place of the structure file.

Future goals

Improve printing logic

Currently, values are printed as they are read in. Since the results for the different paths are written to separate files, this shouldn't matter, except for the case of the key. The key is not printed to its own results file, instead whatever the last seen key was is printed as the key for the current value being printed. If the key is not the first element to be read in the subtree, there will be a mismatch between value and publication ID.

In the case of PMID this is consistently the first element, so there should not be a problem, however, it could be in other scenarios.

Error handling

After refactoring the code, I have started adding some error handling code, however this has not been consistently applied. Ideally, the default behavior will be for functions to return error codes. Then use an error checking macro to test that the result was not an error. I would also like to add a set error strings that would be printed depending on the error code. Possibly use a structure to represent errors so that the erroring function could supply an additional string along with the error.

Better error handling like this could also allow the python package to write it's own error handling function in the C API to override the default error mechanism to use python level errors. This would be done by testing if an error handler function was defined, if so the error checking macro would use that function, otherwise it would fallback to a default function.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pubmedparser2-2.0.1.tar.gz (27.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pubmedparser2-2.0.1-cp310-cp310-manylinux_2_37_x86_64.whl (186.6 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.37+ x86-64

File details

Details for the file pubmedparser2-2.0.1.tar.gz.

File metadata

  • Download URL: pubmedparser2-2.0.1.tar.gz
  • Upload date:
  • Size: 27.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.10.12 Linux/6.1.55

File hashes

Hashes for pubmedparser2-2.0.1.tar.gz
Algorithm Hash digest
SHA256 e5c126cf176569bd6fc1d8d350708e406013bde79f402abe5c9ed7a256c2dc02
MD5 58e32225a423404ceffd224ebd4826f5
BLAKE2b-256 36764627cf1c6d212e99f320a183636b05e65763dbd7d3ce72c8b9b26a979c26

See more details on using hashes here.

File details

Details for the file pubmedparser2-2.0.1-cp310-cp310-manylinux_2_37_x86_64.whl.

File metadata

File hashes

Hashes for pubmedparser2-2.0.1-cp310-cp310-manylinux_2_37_x86_64.whl
Algorithm Hash digest
SHA256 98ce56ea55ff0c5a7af894c1c4594f726c67108155ca7e5fe0bb99fb74ed448a
MD5 dec186980267eee55015db6582250a29
BLAKE2b-256 90d7d27c561c6b9297d34ad65afb8a05373154191b4591b40a936c419436d86f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page