Skip to main content

Rxiv XML/JSON parsing and typehints.

Project description

rxiv-types (v0.1.0)

Built with: xsdata-pydantic

Introduction

A complete implementation of the XML/JSON schema for *Rxiv preprint servers. This covers arXiv, medrXiv, biorXiv, chemrXiv, and DOAJ.

This package helps to parse XML/JSON data into Pydantic models. This validates the input xml data and provides typehints for working with the complex XML structures present in PubMed data.

Why do I need this?

Parsing XML on its own is challenging. Add to it the feature rich data inside of each citation, and you will find yourself with hours or days of navigating the XML structure.

The approach here was to autogenerate Pydantic classes to parse the XML using the xsdata-pydantic tool. This approach has the benefit of making sure every piece of data is parsed properly, and an error is thrown if something is missing or incorrect. Instead of using dictionaries to hold the data, Pydantic classes have the benefit of providing type hints with tab completion for IDEs, making it easier to navigate the complex structure of the citation data.

How do I use it?

It is possible to use xsdata-pydantic and the autogenerated classes directly to parse an XML file, but we provide a convenience function to easily open PubMed XMl citations and PMC open access articles.

Example 1: Parse ChemRxiv Data

from pathlib import Path

import requests

from rxiv_types import chemrxiv_records

chemrxiv_url = "https://chemrxiv.org/engage/chemrxiv/public-api/v1/oai?verb=ListRecords&metadataPrefix=oai_dc&from=2000-01-01"

# 1. Get some chemrxiv data from the API
result = requests.get(chemrxiv_url)
destination = Path(f"downloads/data/chemrxiv.xml")
destination.parent.mkdir(parents=True, exist_ok=True)
with open(destination, "wb") as fw:
    fw.write(result.content)

# 2. Parse the data, and display the first article title
result = chemrxiv_records(destination)

# 3. Print some information about the first record
print("Paper 1:")
print(f"Title: {''.join(result.list_records.record[0].metadata.dc.title)}")
print(f"Authors: {'; '.join(result.list_records.record[0].metadata.dc.creator)}")
print(f"Abstract: {''.join(result.list_records.record[0].metadata.dc.description)}")

Output:

Paper 1:
Title: Excitonics: A universal set of binary gates for molecular exciton processing and signaling
Authors: Nicolas, Sawaya; Dmitrij, Rappoport; Daniel, Tabor; Alan, Aspuru-Guzik
Abstract: The ability to regulate energy transfer pathways through materials is an
 important goal of nanotechnology, as a greater degree of control is 
crucial for developing sensing, solar energy, and bioimaging 
applications. Such control necessitates a toolbox of actuation methods 
that can direct energy transfer based on user input. Here we propose a 
novel molecular exciton gate, analogous to a traditional transistor, for
 controlling exciton migration in chromophoric systems. The gate may be 
activated with an input of light or an input flow of excitons. Unlike 
previous gates and switches that control exciton transfer, our proposal 
does not require isomerization or molecular rearrangement, instead 
relying on excitation migration via the second singlet (S2) state of the
 gate molecule--hence the system is named an "S2 exciton gate." After 
presenting a set of system properties required for proper function of 
the S2 exciton gate, we show how one would overcome the two possible 
challenges: short-lived excited states and suppression of false 
positives. Precision and error rates are studied computationally in a 
model system with respect to excited-state decay rates and variations in
 molecular orientation. Finally, we demonstrate that the S2 exciton gate
 gate can be used to produce binary logical AND, OR, and NOT operations,
 providing a universal excitonic computation platform with a range of 
potential applications, including e.g. in signal processing for 
microscopy.

FAQ

Why are the return structures so complicated?

The return structures are a direct reflection of the XML format defined by OAI and any customizations from the hosting preprint servers. In the future some utility classes might be made for common components (title, authors, etc), but for now this is intended to be an unbiased way of parsing the XML.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rxiv_types-0.1.0.tar.gz (13.6 kB view hashes)

Uploaded Source

Built Distribution

rxiv_types-0.1.0-py3-none-any.whl (31.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page