Skip to main content

Pythonic XML parsing/reading

Project description

dunder_xml_reader

Description/Overview

The dunder_xml_reader package provides classes and functions that aim to make navigating and extracting data from an XML document easier and more "Pythonic". All you have to do is pass a string containing valid XML to the parse_xml() function and it will return a Python object that exposes properties as dictionary keys and child-nodes as attributes. The following sample code illustrates. Consider the following XML:

<cXML payloadID="1233444-2001@premier.workchairs.com" xml:lang="en-CA" timestamp="2000-10-12T18:41:29-08:00">
    <Header>
        <To>
            <Credential domain="AribaNetworkUserId" type="marketplace">
                <Identity>bigadmin@marketplace.org</Identity>
            </Credential>
            <Credential domain="AribaNetworkUserId">
                <Identity>admin@acme.com</Identity>
            </Credential>
        </To>
    </Header>
</cXML>

It can be easily nagivated with the following Python code:

>>> with open('order_request.xml') as infile:
...     xml = dunder_xml_reader.parse_xml(infile.read())
>>>
>>> print(xml['payloadID'])
1233444-2001@premier.workchairs.com
>>>
>>> print(xml.Header.To.Credential[0].Identity.text())
bigadmin@marketplace.org
>>>

As you can see, the property payloadID is accessed as if it were a dictionary key of the xml object. Likewise, the Header child node of the cXML document is accessed as if it were an attribute of the xml object.

The Python object(s) returned by parse_xml() use Python dunder methods to fulfill functionality provided by other Python built-ins:

>>> # Python's "in" operator
>>>
>>> 'payloadID' in xml
True
>>> 'someOtherThing' in xml
False
>>>

>>> # Python's "hasattr" function
>>>
>>> hasattr(xml, 'Header')
True
>> hasattr(xml, 'Footer')
False
>>>

>>> # Python's "len" function
>>>
>>> len(xml.Header.To.Credential)
2
>>>

The Python objects returns by parse_xml() have the following methods:

  • tag() - Returns the XML element tag (i.e. "<cXML>" returns 'cXML')
  • text() - Returns the text between the tag and it's closer (i.e. "<p>blah</p>" returns 'blah')

Nodes that have siblings with the same element tag (i.e. Credential in the above example) are represented as a XmlNodeList, a subclass of Python's built-in list. The XmlNodeList class has a number of extra convenience methods:

Installation

The easiest way to install dunder_xml_reader is to use pip:

$ pip install dunder-xml-reader

Some Examples

The following are some examples of how you can use the Python objects returns by parse_xml():

Getting property values by dereferencing as a dictionary

>>> xml = parse_xml(raw_xml_text)
>>> payload_id = xml['payloadID']
>>> print(payload_id)
1233444-2001@premier.workchairs.com
>>>

You can even check to see if a property exists using Python's built-in in operator

>>> xml = parse_xml(raw_xml_text)
>>> 'payloadID' in xml
True
>>> 'someOtherWeirdThing` in xml
False

Getting siblings by dereferencing as a list

>>> xml = parse_xml(raw_xml_text)
>>> second_credential = xml.Header.To.Credential[1]
>>> print(second_credential.Identity.text())
admin@acme.com
>>>

A nice side effect of being able to dereference an CxmlNode as a list is that you can for-loop over CxmlNodes:

>>> xml = parse_xml(raw_xml_text)
>>> for cred in xml.Header.To.Credential
...   print(cred['domain'], cred.Identity.text())
...
AribaNetworkUserId bigadmin@marketplace.org
AribaNetworkUserId admin@acme.com
>>>

You can also use the built-in len() function with CxmlNodes:

>>> xml = parse_xml(raw_xml_text)
>>> print(len(xml.Header.To.Credential))
2
>>>

Looking up the original element name with the .tag() method

>>> request = parse_xml(raw_xml_text).Request.ConfirmationRequest
>>> print(request.tag())
ConfirmationRequest
>>>

Getting the inner-text of the element with the .text() method

>>> xml = parse_xml(raw_xml_text)
>>> node = xml.Header.From.Credential.Identity
>>> print(node.text())
942888711
>>>

Getting the first of a list of siblings with the .first() method

>>> xml = parse_xml(raw_xml_text)
>>> node = xml.Header.To.Credential.first()
>>> print node['type']
marketplace
>>>

Getting the last of a list of siblings with the .last() method

>>> xml = parse_xml(raw_xml_text)
>>> print(xml.Header.To.Credential.last().Identity.text())
admin@acme.com
>>>

Filtering nodes with the .filter() method

>>> extrinsics = parse_xml(raw_xml_text).Request.OrderRequest.OrderRequestHeader.Extrinsic
>>> user_nodes = extrinsics.filter(lambda n: n['name'].startswith('User'))
>>> print(len(user_nodes), len(extrinsics))
5 12
>>>

Simpler filtering nodes with the .filter_prop() method

>>> extrinsics = parse_xml(raw_xml_text).Request.OrderRequest.OrderRequestHeader.Extrinsic
>>> user_nodes = extrinsics.filter_prop('name', 'User', startswith_predicate)
>>> print(len(user_nodes), len(extrinsics))
5 12
>>>

Mapping nodes to something else with a lambda expression

>>> line_item = parse_xml(raw_xml_text).Request.OrderRequest.ItemOut[0]
>>> taxes = line_item.tax.map(lambda n: float(n.TaxDetail.TaxAmount.text()))
>>> print(taxes)
[0.805063025, 0.0, 3.703289915]
>>>

Simpler mapping with the .map_attr() method

>>> from_header = parse_xml(raw_xml_text).Header.From
>>> identity_nodes = from_header.Credential.map_attr('Identity')
>>> print(identity_nodes[1].text())
test_customer
>>>

Simpler mapping with the .map_prop() method

>>> from_header = parse_xml(raw_xml_text).Header.From
>>> credential_domains = from_header.Credential.map_prop('domain')
>>> print(credential_domains)
['DUNS', 'CompanyName', 'InteropKey']
>>>

Simpler mapping with the .map_text() method

>>> bill_to = parse_xml(raw_xml_text).Request.OrderRequest.OrderRequestHeader.BillTo
>>> street = bill_to.Address.PostalAddress.Street.map_text()
>>> print(street)
['1242 West Main Street', 'Suite 6a']
>>>

String joining with the .join_text() method

>>> address = parse_xml(raw_xml).Request.OrderRequest.OrderRequestHeader.BillTo.Address
>>> print(address.PostalAddress.Street.join_text("\n"))
1231 West Main Street
Suite 6a
>>>

String joining with the .join_prop() method

>>> xml = parse_xml(raw_xml_text)
>>> print(xml.Header.From.Credential.join_prop('domain'))
DUNS, CompanyName, InteropKey
>>>

XML with namespaces

The dunder_xml_reader package can handle XML with varying defined namespaces. It does this by optionally replacing namespaces in tags with an alias prefix. Take the following XML for example:

<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
  <soapenv:Body>
    <ReceiveClaimResponse xmlns="http://insurance.com/webservices/ReceiveClaim.job">
      <ClaimResponse>
        <status>S</status>
        <errorDetail/>
        <claimNumber>LC0000001742</claimNumber>
      </ClaimResponse>
    </ReceiveClaimResponse>
  </soapenv:Body>
</soapenv:Envelope>

Using the namespaces argument to parse_xml(), you can have the attributes prefixed with an alias:

>>> soap = parse_xml(raw_soap_text, namespaces={
...   'http://schemas.xmlsoap.org/soap/envelope/': 'soap',
...   'http://insurance.com/webservices/ReceiveClaim.job': 'claim'
... }
>>> soap.soap_Body.claim_ReceiveClaimResponse.claim_ClaimResponse.claim_status.text()
S
>>>

If you don't want any prefixes and you just want to strip out the namespace references you can either pass a set of namespace strings to parse_xml():

>>> soap = parse_xml(raw_soap_text, namespaces={
...   'http://schemas.xmlsoap.org/soap/envelope/',
...   'http://insurance.com/webservices/ReceiveClaim.job'
... }
>>> soap.Body.ReceiveClaimResponse.ClaimResponse.status.text()
S
>>>

Bonus: SafeReference

SafeReference is a wrapper that you can wrap a xmlNode instance in (or any other Python object graph for that matter) that will prevent errors like:

TypeError: 'NoneType' object is not subscriptable
AttributeError: 'NoneType' object has no attribute 'blah'

A quick example. Say you want to issue the following line of code:

>>> x = xml.Request.OrderRequest.OrderRequestHeader.BillTo.Address.PostalAddress.City.text()

But, the XML that was parsed in order to create the xml object doesn't have a BillTo element. You'd expect to get the following exception:

AttributeError: 'OrderRequestHeader' object has no attribute 'BillTo'

Now you could test your way through the graph heirchy (aka look-before-you-leap):

>>> if hasattr(xml, 'Request') and \
...    hasattr(xml.Request, 'OrderRequest') and \
...    hasattr(xml.Request.OrderRequest, 'OrderRequestHeader') and \
...    hasattr(xml.Request.OrderRequest.OrderRequestHeader, 'BillTo'):
...   x = xml.Request.OrderRequest.OrderRequestHeader.BillTo.Address.PostalAddress.City.text()
... else:
...   x = ''

Or, you could just wrap your assignment in a try/except (aka ask-for-forgiveness):

>>> try:
...   x = xml.Request.OrderRequest.OrderRequestHeader.BillTo.Address.PostalAddress.City.text()
... except AttributeError:
...   x = ''

But in both cases you're writing a lot of "check to see if it's really there"-type code. The goal of SafeReference is to prevent that code from needing to be written in the first place. If your object-graph is wrapped in a SafeReference instance, it will check at each level for the requested attribute to be there or missing. If it's there, it will return the requested attribute. However, if it's not there, it will return a default-value. Here's an example:

>>> xml = parse_xml(raw_xml_text_thats_missing_a_BillTo_element)
>>> safe_xml = safe_reference(xml, 'n/a')
>>> x = safe_xml.Request.OrderRequest.OrderRequestHeader.BillTo.Address.PostalAddress.City.text()
>>> print(x)
n/a
>>>

This result of n/a will be the result no matter what item is missing (Request, OrderRequest, OrderRequestHeader, BillTo, etc.) along the way. If any of the objects in that long dereferencing are missing, the default will be returned. However, if nothing is missing, the actual graph value will be returned.

>>> xml = parse_xml(raw_xml_text_with_nothing_missing)
>>> safe_xml = safe_reference(xml, 'n/a')
>>> x = safe_xml.Request.OrderRequest.OrderRequestHeader.BillTo.Address.PostalAddress.City.text()
>>> print(x)
New York
>>>

Bonus 2: extract_namespaces()

If you have a chunk of XML or a SOAP message that conains numerous namespace references, figuring our what the namespaces are (so you can pass them as an argument to the parse_xml()), you can use the extract_namespaces() function:

>>> from dunder_xml_reader import extract_namespaces
>>> xml = """
...     <section xmlns="http://www.ibm.com/events" xmlns:bk="urn:loc.gov:books" xmlns:pi="urn:personalInformation"
...              xmlns:isbn='urn:ISBN:0-395-36341-6'>
...         <title>Book-Signing Event</title>
...         <signing>
...             <bk:author pi:title="Mr" pi:name="Jim Ross"/>
...             <book bk:title="Writing COBOL for Fun and Profit" isbn:number="0426070806"/>
...             <comment xmlns=''>What a great issue!</comment>
...         </signing>
...     </section>
...  """
...
>>> ns = extract_namespaces(xml)
>>> print(ns)
{(None, 'http://www.ibm.com/events'), ('bk', 'urn:loc.gov:books'), ('pi', 'urn:personalInformation'), ('isbn', 'urn:ISBN:0-395-36341-6') }
>>>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dunder_xml_reader-0.1.1.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

dunder_xml_reader-0.1.1-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file dunder_xml_reader-0.1.1.tar.gz.

File metadata

  • Download URL: dunder_xml_reader-0.1.1.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.8.18 Linux/6.2.0-1018-azure

File hashes

Hashes for dunder_xml_reader-0.1.1.tar.gz
Algorithm Hash digest
SHA256 688fd2d240443505e3be46f523abe1f6ad5d179227eed35a972782df1cc6fb02
MD5 0bc22d2de8fe558acf8a72ddb81e5ef5
BLAKE2b-256 055be21bb8c35599cc4784fdbd31a8ed43edd5e47d3844da1907972060b56e65

See more details on using hashes here.

File details

Details for the file dunder_xml_reader-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: dunder_xml_reader-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.8.18 Linux/6.2.0-1018-azure

File hashes

Hashes for dunder_xml_reader-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f38d738de4991e4c12924f4a19465ba3a9f11df74b397c6fa4c86afc21b0b8b5
MD5 c6af272f457ff8c448f9b65abe3ece70
BLAKE2b-256 01dffb8f3f8d4d7f17d9bf98d67ad6c0398353774f8a6db1746a8d5f4d488fc3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page