Skip to main content

Download and parse pubmed publication data

Project description

PyPI - Version PyPI - Python Version

Read XML files and pull out selected values. Values to collect are determined by paths found in a structure file. The structure file also includes a key which associates the values with a parent element and names, which determine which file to place the elements in.

Files can be passed as either gzipped or uncompressed XML files or from standard in.

For more info on Pubmed's XML files see: pubmed_190101_.dtd.

Usage:

import pubmedparser
import pubmedparser.ftp

# Download data
files = pubmedparser.ftp.download(range(1, 6))

# Read XML files using a YAML file to describe what data to collect.
data_dir = "file_example"
structure_file = "example/structure.yml"
results = pubmedparser.read_xml(files, structure_file, data_dir)

See the example file for more options.

In python, the structure file can be replaced with a dictionary of dictionaries as well.

Or, as a CLI:

xml_read --cache-dir=cache --structure-file=structure.yml \
    data/*.xml.gz

Installing with pip

pip install pubmedparser2

Building python package

Requires zlib.

Clone the repository and cd into the directory. Then use poetry to build and install the package.

make python

Structure file

The structure file is a YAML file containing key-value pairs for different tags and paths. There are two required keys: root and key. Root provide the top-level tag, in the case of the pubmed files this will be PubmedArticleSet.

root: "/PubmedArticleSet"

The / is not strictly required as the program will ignore them, but they are used to conform to the xpath syntax (although this program does not handle all cases for xpath).

Only tags below the root tag will be considered and the parsing will terminate once the program has left the root of the tree.

Key is a reference tag. In the pubmed case, all data is with respect to a publication, so the key should identify the publication the values are linked to. The PMID tag is a suitable candidate.

key: "/PubmedArticle/MedlineCitation/PMID"

After root, all paths are taken as relative to the root node.

The other name-pairs in the file determine what other items to collect. These can either be a simple name and path, like the key, such as:

Language: "/PubmedArticle/MedlineCitation/Article/Language"
Keywords: "/PubmedArticle/MedlineCitation/KeywordList/Keyword"

Or they can use a hierarchical representation to get multiple values below a child. This is mainly used to handle lists of items where there is an indefinite number of items below the list.

Author: {
  root: "/PubmedArticle/MedlineCitation/Article/AuthorList",
  key: "/Author/auto_index",
  LastName: "/Author/LastName",
  ForeName: "/Author/ForeName",
  Affiliation: "/Author/AffiliationInfo/Affiliation",
  Orcid: "/Author/Identifier/[@Source='ORCID']"
}

Here, all paths are relative to the sub-structures root path, which is in turn relative to the parent structure's root. This sub-structure uses the same rules as the parent structure, so it needs both a root and key name-value pair. The results of searching each path are written to separate files. Each file gets a column for the parent and child key. So in this case, each element of the author is linked by an author key and that is related to the publication they authored through the parent key.

The main parser is called recursively to parse this structure so it's worth thinking about what the root should be under the context that the parser will be called with that root. This means if, instead of stopping at /AuthorList, /Author was added to the end of the root, the parser would be called for each individual author, instead of once per author list, leading to all author's getting the index 0.

There are a number of additional syntax constructs to note in the above example. The key uses the special name auto_index, since there is no author ID in the XML data, an index is used to count the authors in the order they appear. This resets for each publication and starts at 0. Treating the auto_index as the tail of a path allows you to control when the indexing occurs—the index is incremented whenever it hits a /Author tag.

In addition to the auto_index key, there is a second special index name, condensed.

Reference: {
  root: "/PubmedArticle/PubmedData/ReferenceList/Reference/ArticleIdList"
  key: "/condensed"
  PMID: "/ArticleId/[@IdType='pubmed']"
  DOI: "/ArticleId/[@IdType='doi']"
}

In the case of condensed, instead of writing the results to separate files, they will printed as columns in the same file, and therefore do not need an additional key for the sub-structure. If any of the elements are missing, they will be left blank, for example, if the parser does not find a pubmed ID for a given reference, the row will look like "%s\t\t%s" where the first string will contain the parent key (the PMID of the publication citing this reference) and the second string will contain the reference's DOI.

The /[@attribute='value'] syntax at the end of a path tells the parser to only collect an element if it has an attribute and the attribute's value matches the supplied value. Similarly the /@attribute syntax, tells the parser to collect the value of the attribute attribute along with the element's value. Then both values will be written to the output file. Currently only a single attribute can be specified.

Lastly, there is a special syntax for writing condensed sub-structures:

Date: "/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/{Year,Month,Day}"

The {child,child,child} syntax allows you to select multiple children at the same level to be printed to a single file. This is useful when multiple children make up a single piece of information (i.e. the publication date).

A similar example structure file can be found in the example directory of this project at: file:./example/structure.yml.

Structure dictionary

The structure of the xml data to read can also be described as a python dictionary of dictionaries.

The form is similar to the file:

structure = {
    "root": "//PubmedArticleSet",
    "key": "/PubmedArticle/MedlineCitation/PMID",
    "DOI": "/PubmedArticle/PubmedData/ArticleIdList/ArticleId/[@IdType='doi']",
    "Date": "/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/{Year,Month,Day}",
    "Journal": "/PubmedArticle/MedlineCitation/Article/Journal/{Title,ISOAbbreviation}",
    "Language": "/PubmedArticle/MedlineCitation/Article/Language",
    "Author": {
        "root": "/PubmedArticle/MedlineCitation/Article/AuthorList",
        "key": "/Author/auto_index",
        "LastName": "/Author/LastName",
        "ForName": "/Author/ForeName",
        "Affiliation": "/Author/AffiliationInfo/Affiliation",
        "Orcid": "/Author/Identifier/[@Source='ORCID']",
    },
    "Grant": {
        "root": "/PubmedArticle/MedlineCitation/Article/GrantList",
        "key": "/Grant/auto_index",
        "ID": "/Grant/GrantID",
        "Agency": "/Grant/Agency",
    },
    "Chemical": "/PubmedArticle/MedlineCitation/ChemicalList/Chemical/NameOfSubstance/@UI",
    "Qualifier": "/PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading/QualifierName/@UI",
    "Descriptor": "/PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading/DescriptorName/@UI",
    "Keywords": "/PubmedArticle/MedlineCitation/KeywordList/Keyword",
    "Reference": {
        "root": (
            "/PubmedArticle/PubmedData/ReferenceList/Reference/ArticleIdList"
        ),
        "key": "/condensed",
        "PMID": "/ArticleId/[@IdType='pubmed']",
        "DOI": "/ArticleId/[@IdType='doi']",
    },
}

This can then be passed to pubmedparser.read_xml in place of the structure file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pubmedparser2-2.1.2.tar.gz (42.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pubmedparser2-2.1.2-cp313-cp313-manylinux_2_35_x86_64.whl (112.0 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.35+ x86-64

pubmedparser2-2.1.2-cp313-cp313-macosx_14_0_arm64.whl (61.1 kB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

pubmedparser2-2.1.2-cp313-cp313-macosx_13_0_x86_64.whl (61.2 kB view details)

Uploaded CPython 3.13macOS 13.0+ x86-64

pubmedparser2-2.1.2-cp312-cp312-manylinux_2_35_x86_64.whl (112.0 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.35+ x86-64

pubmedparser2-2.1.2-cp312-cp312-macosx_14_0_arm64.whl (61.1 kB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

pubmedparser2-2.1.2-cp312-cp312-macosx_13_0_x86_64.whl (61.2 kB view details)

Uploaded CPython 3.12macOS 13.0+ x86-64

pubmedparser2-2.1.2-cp311-cp311-manylinux_2_35_x86_64.whl (112.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.35+ x86-64

pubmedparser2-2.1.2-cp311-cp311-macosx_14_0_arm64.whl (61.1 kB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

pubmedparser2-2.1.2-cp311-cp311-macosx_13_0_x86_64.whl (61.2 kB view details)

Uploaded CPython 3.11macOS 13.0+ x86-64

pubmedparser2-2.1.2-cp310-cp310-manylinux_2_35_x86_64.whl (111.8 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

pubmedparser2-2.1.2-cp310-cp310-macosx_14_0_arm64.whl (61.1 kB view details)

Uploaded CPython 3.10macOS 14.0+ ARM64

pubmedparser2-2.1.2-cp310-cp310-macosx_13_0_x86_64.whl (43.1 kB view details)

Uploaded CPython 3.10macOS 13.0+ x86-64

File details

Details for the file pubmedparser2-2.1.2.tar.gz.

File metadata

  • Download URL: pubmedparser2-2.1.2.tar.gz
  • Upload date:
  • Size: 42.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for pubmedparser2-2.1.2.tar.gz
Algorithm Hash digest
SHA256 13a2dd4fc4765ed5ac16d0618b115c6c1ba7beff9cfd2e7905349557b45431d4
MD5 5fadc4f2dbd24536d4b2df944bf2bf8e
BLAKE2b-256 21862529ffc0a167c40756cfe85df17d6c65e40870e9974c8bcb1b2b94b0af78

See more details on using hashes here.

Provenance

The following attestation bundles were made for pubmedparser2-2.1.2.tar.gz:

Publisher: release.yaml on net-synergy/pubmedparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pubmedparser2-2.1.2-cp313-cp313-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for pubmedparser2-2.1.2-cp313-cp313-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 5b3fa04bb19da712f9d311630bb09a958943634d43f60bedb7b66c4d327e0613
MD5 ca70b982ba885cf2af4573e6d6b17b27
BLAKE2b-256 67464c1848702055b4ea2f35e4fd383e13602de1423cd9bb4d6945e28b4d77e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for pubmedparser2-2.1.2-cp313-cp313-manylinux_2_35_x86_64.whl:

Publisher: release.yaml on net-synergy/pubmedparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pubmedparser2-2.1.2-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for pubmedparser2-2.1.2-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 e4d13c763214f55a6672d647c098a4d1e0eea6559998b6ddf1ccf357326aff89
MD5 41c1ecfa624713e4bba0acddabe2f9d1
BLAKE2b-256 6afb2a7b5509d0e0954545bcddc38e606fc5e0379bd61e5d504aca055419f528

See more details on using hashes here.

Provenance

The following attestation bundles were made for pubmedparser2-2.1.2-cp313-cp313-macosx_14_0_arm64.whl:

Publisher: release.yaml on net-synergy/pubmedparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pubmedparser2-2.1.2-cp313-cp313-macosx_13_0_x86_64.whl.

File metadata

File hashes

Hashes for pubmedparser2-2.1.2-cp313-cp313-macosx_13_0_x86_64.whl
Algorithm Hash digest
SHA256 d5c5a2985e4b75c77d974a4befb48fd8b181130c0f54f51c809f075e5cd56178
MD5 6cd2f7dac20718a47c68c9e5d705e913
BLAKE2b-256 6bda6947df6637d51d4d2554f56f322f53c1c1367409b39969405e4b2f10f849

See more details on using hashes here.

Provenance

The following attestation bundles were made for pubmedparser2-2.1.2-cp313-cp313-macosx_13_0_x86_64.whl:

Publisher: release.yaml on net-synergy/pubmedparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pubmedparser2-2.1.2-cp312-cp312-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for pubmedparser2-2.1.2-cp312-cp312-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 6ed156ad7e607c183d46c2f25acf23cb153d8d48ef0a5ffe061c041408bacb8e
MD5 d9f1d307cf27e6fafa474191debfd461
BLAKE2b-256 439cbed728b6ba9c31ff622bbc151c5a9d6ed3c0097a1c8f954292dc04a36078

See more details on using hashes here.

Provenance

The following attestation bundles were made for pubmedparser2-2.1.2-cp312-cp312-manylinux_2_35_x86_64.whl:

Publisher: release.yaml on net-synergy/pubmedparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pubmedparser2-2.1.2-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for pubmedparser2-2.1.2-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 d0fd4e29ffadfa30f8ce53a607f9d37ae574575366b884334699dabac7118691
MD5 8a39337ea5cc1991c26bb261f2d7d97c
BLAKE2b-256 429fdef6f10d00ff4dc7a0670cd5ae1a3464247b7f975056d4c5e82c99460a18

See more details on using hashes here.

Provenance

The following attestation bundles were made for pubmedparser2-2.1.2-cp312-cp312-macosx_14_0_arm64.whl:

Publisher: release.yaml on net-synergy/pubmedparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pubmedparser2-2.1.2-cp312-cp312-macosx_13_0_x86_64.whl.

File metadata

File hashes

Hashes for pubmedparser2-2.1.2-cp312-cp312-macosx_13_0_x86_64.whl
Algorithm Hash digest
SHA256 bc4d926b26bbd7b1b4c38dea3a84f57f1cc7eef325e6e6530039079616337e29
MD5 c41699dac4d6a9c817d53df90a94470d
BLAKE2b-256 40fedb27331b9f614c7fc339885e1064f144510c7992c1461951ab66d918c41e

See more details on using hashes here.

Provenance

The following attestation bundles were made for pubmedparser2-2.1.2-cp312-cp312-macosx_13_0_x86_64.whl:

Publisher: release.yaml on net-synergy/pubmedparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pubmedparser2-2.1.2-cp311-cp311-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for pubmedparser2-2.1.2-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 a35edfae591f183f5a4f43bd9222e6485c7fea86c3e1224448b81dbb798dcb49
MD5 a6c766bc81389d2ed8b026afbd016a5c
BLAKE2b-256 1edfae727deab35ace0e898e940b99c1d6b2e2975bc0ed47f1e33826742ffd90

See more details on using hashes here.

Provenance

The following attestation bundles were made for pubmedparser2-2.1.2-cp311-cp311-manylinux_2_35_x86_64.whl:

Publisher: release.yaml on net-synergy/pubmedparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pubmedparser2-2.1.2-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for pubmedparser2-2.1.2-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 d4df29db1177b0317e775091c36395f0f4da6a2164226d0acc041a132b0c7c22
MD5 72fbae8973def597b247d2ad55cd257c
BLAKE2b-256 14908e0faafb56b26c78cdd1f98f9024aca010f3f311510311c666ec6fd55566

See more details on using hashes here.

Provenance

The following attestation bundles were made for pubmedparser2-2.1.2-cp311-cp311-macosx_14_0_arm64.whl:

Publisher: release.yaml on net-synergy/pubmedparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pubmedparser2-2.1.2-cp311-cp311-macosx_13_0_x86_64.whl.

File metadata

File hashes

Hashes for pubmedparser2-2.1.2-cp311-cp311-macosx_13_0_x86_64.whl
Algorithm Hash digest
SHA256 90060e980ba90f99fb91a8b5efb3c7a0eb549999bf11e1b6a59f1e702cc34b54
MD5 bfebd1d05e792a872b56bc7d841edfac
BLAKE2b-256 fef7baaff8ff68a8d9dac737161ad9b42db03aed71508bb0bf2b38b98a8142a3

See more details on using hashes here.

Provenance

The following attestation bundles were made for pubmedparser2-2.1.2-cp311-cp311-macosx_13_0_x86_64.whl:

Publisher: release.yaml on net-synergy/pubmedparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pubmedparser2-2.1.2-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for pubmedparser2-2.1.2-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 821cba3c6d49c74ca34ccc38f2f8eb1c0d4f4992e3ea843dbf52e0d48a3e47cf
MD5 523a9704f14aa361e23316fb2498687b
BLAKE2b-256 7ba41be96f397065f7b8caba2334f2432ac40eb6d7ab1a9c402fe46e2c144fc8

See more details on using hashes here.

Provenance

The following attestation bundles were made for pubmedparser2-2.1.2-cp310-cp310-manylinux_2_35_x86_64.whl:

Publisher: release.yaml on net-synergy/pubmedparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pubmedparser2-2.1.2-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for pubmedparser2-2.1.2-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 477f5af617c0a42f50e42ddbff9dac4be1003c508c8035cb9f01c0c741bd7aad
MD5 07fb82b4436edceb4dee52dcf9e254d0
BLAKE2b-256 f22131d43a62db12a140e67251fc9fb23fc995c0182e3b6fd595055e88e4b8f4

See more details on using hashes here.

Provenance

The following attestation bundles were made for pubmedparser2-2.1.2-cp310-cp310-macosx_14_0_arm64.whl:

Publisher: release.yaml on net-synergy/pubmedparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pubmedparser2-2.1.2-cp310-cp310-macosx_13_0_x86_64.whl.

File metadata

File hashes

Hashes for pubmedparser2-2.1.2-cp310-cp310-macosx_13_0_x86_64.whl
Algorithm Hash digest
SHA256 a9fb74e9984af9493e37f1799ec0261b0ae8e50c86633cec3b24b1318e57a72c
MD5 7d3d5d2c08a509f16a4e85642c9abea8
BLAKE2b-256 6646e4c5d56f35759b32eb7f0b3e225b48b86fc74b23a7beed5ac6f2d685e689

See more details on using hashes here.

Provenance

The following attestation bundles were made for pubmedparser2-2.1.2-cp310-cp310-macosx_13_0_x86_64.whl:

Publisher: release.yaml on net-synergy/pubmedparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page