Skip to main content

Rxiv XML/JSON parsing and typehints.

Project description

rxiv-types (v0.1.0)

Built with: xsdata-pydantic

Introduction

A complete implementation of the XML/JSON schema for *Rxiv preprint servers. This covers arXiv, medrXiv, biorXiv, chemrXiv, and DOAJ.

This package helps to parse XML/JSON data into Pydantic models. This validates the input xml data and provides typehints for working with the complex XML structures present in PubMed data.

Why do I need this?

Parsing XML on its own is challenging. Add to it the feature rich data inside of each citation, and you will find yourself with hours or days of navigating the XML structure.

The approach here was to autogenerate Pydantic classes to parse the XML using the xsdata-pydantic tool. This approach has the benefit of making sure every piece of data is parsed properly, and an error is thrown if something is missing or incorrect. Instead of using dictionaries to hold the data, Pydantic classes have the benefit of providing type hints with tab completion for IDEs, making it easier to navigate the complex structure of the citation data.

How do I use it?

It is possible to use xsdata-pydantic and the autogenerated classes directly to parse an XML file, but we provide a convenience function to easily open PubMed XMl citations and PMC open access articles.

Example 1: Parse ChemRxiv Data

from pathlib import Path

import requests

from rxiv_types import chemrxiv_records

chemrxiv_url = "https://chemrxiv.org/engage/chemrxiv/public-api/v1/oai?verb=ListRecords&metadataPrefix=oai_dc&from=2000-01-01"

# 1. Get some chemrxiv data from the API
result = requests.get(chemrxiv_url)
destination = Path(f"downloads/data/chemrxiv.xml")
destination.parent.mkdir(parents=True, exist_ok=True)
with open(destination, "wb") as fw:
    fw.write(result.content)

# 2. Parse the data, and display the first article title
result = chemrxiv_records(destination)

# 3. Print some information about the first record
print("Paper 1:")
print(f"Title: {''.join(result.list_records.record[0].metadata.dc.title)}")
print(f"Authors: {'; '.join(result.list_records.record[0].metadata.dc.creator)}")
print(f"Abstract: {''.join(result.list_records.record[0].metadata.dc.description)}")

Output:

Paper 1:
Title: Excitonics: A universal set of binary gates for molecular exciton processing and signaling
Authors: Nicolas, Sawaya; Dmitrij, Rappoport; Daniel, Tabor; Alan, Aspuru-Guzik
Abstract: The ability to regulate energy transfer pathways through materials is an
 important goal of nanotechnology, as a greater degree of control is 
crucial for developing sensing, solar energy, and bioimaging 
applications. Such control necessitates a toolbox of actuation methods 
that can direct energy transfer based on user input. Here we propose a 
novel molecular exciton gate, analogous to a traditional transistor, for
 controlling exciton migration in chromophoric systems. The gate may be 
activated with an input of light or an input flow of excitons. Unlike 
previous gates and switches that control exciton transfer, our proposal 
does not require isomerization or molecular rearrangement, instead 
relying on excitation migration via the second singlet (S2) state of the
 gate molecule--hence the system is named an "S2 exciton gate." After 
presenting a set of system properties required for proper function of 
the S2 exciton gate, we show how one would overcome the two possible 
challenges: short-lived excited states and suppression of false 
positives. Precision and error rates are studied computationally in a 
model system with respect to excited-state decay rates and variations in
 molecular orientation. Finally, we demonstrate that the S2 exciton gate
 gate can be used to produce binary logical AND, OR, and NOT operations,
 providing a universal excitonic computation platform with a range of 
potential applications, including e.g. in signal processing for 
microscopy.

FAQ

Why are the return structures so complicated?

The return structures are a direct reflection of the XML format defined by OAI and any customizations from the hosting preprint servers. In the future some utility classes might be made for common components (title, authors, etc), but for now this is intended to be an unbiased way of parsing the XML.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rxiv_types-0.1.0.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

rxiv_types-0.1.0-py3-none-any.whl (31.6 kB view details)

Uploaded Python 3

File details

Details for the file rxiv_types-0.1.0.tar.gz.

File metadata

  • Download URL: rxiv_types-0.1.0.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.9.16 Linux/6.2.6-76060206-generic

File hashes

Hashes for rxiv_types-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e9d0898fc09780e8b27b447bab47eba21c499848ade736d5913bd2ae40244445
MD5 b8e6e75966d7503f6d94d74c552190af
BLAKE2b-256 7610adce38c0b3bb8f4002fed906fc6b7d77f60e096c82b7f2d21d4acc9ddc71

See more details on using hashes here.

File details

Details for the file rxiv_types-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: rxiv_types-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 31.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.9.16 Linux/6.2.6-76060206-generic

File hashes

Hashes for rxiv_types-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 968661182d7fb4682001c4cc35e259a48208a369ecc28932946ae41f5cc5692b
MD5 2d2bc7224e371a95447e9214135b8730
BLAKE2b-256 c841d567b7e6ecbbaef5442627c37e5b76ca3e00aa9930438add9780c30d8e57

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page