Skip to main content

Python Libary that allows for customized parsing of XML files using a set of configurations. Output is a dictonary. This library builds on the xml2dict library.

Project description

Custom XML to Dict Parser

Table of Contents

Overview

This package allows you to parse XML files. The tool uses the xml2dict package to parse XML files in raw format and returns data as a python dictionary and builds on that to provide custom tailoring of what information to return from the XML file. In other words, with a configuration file, you can return specific data from the XML file in a specific format.

Library Installalion

To install the library simply run the following command in a cmd, shell or whatever...

# It's recommended to create a virtual environment

# Windows
pip install CustomXMLParser

# Linux
pip3 install CustomXMLParser

Library usage?

Example usage

If you wish to read the XML file as is and simply convert it to a python dictionary, then do the following:

from CustomXMLParser import XmlParser

xml_parser = XmlParser(parser_type='raw')
xml_file = 'path_to_xml_file'
xml_dict = xml_parser.parse(xml_file)

If you wish to read specific portions of the XML file and format them in a particular way, then do the following:

from CustomXMLParser import XmlParser

config_file = 'path_to_config_file'
xml_parser = XmlParser(config_file=config_file, parser_type='custom')
xml_file = 'path_to_xml_file'
xml_dict = xml_parser.parse(xml_file)

Note, the XmlParser class uses the following default XML attributes

'''
name_key (str, optional): this is a custom/xml configuration parameter, and it is the name of primary tag. Defaults to "@name".
table_key (str, optional): this is a custom/xml configuration parameter, and it is the table identifier. Defaults to "th".
header_key (str, optional): this is a custom/xml configuration parameter, and it is the header identifier. Defaults to 'header'.
data_key (str, optional): this is a custom/xml configuration parameter, and it is the data identifier. Defaults to "rows".
header_text_key (str, optional): this is a custom/xml configuration parameter, and it is the table's key identifier. Defaults to "#text".
'''

You can override those attributes by passing them to the constructor of the XmlParser class as follows:

from CustomXMLParser import XmlParser

config_file = 'path_to_config_file'
xml_parser = XmlParser(config_file=config_file, parser_type='custom', encoding='utf-8',
                       name_key='<desired_name_key>', table_key='<desired_table_key>', header_key='<desired_header_keyr>',
                       data_key='<desired_data_key>', header_text_key='<desired_header_text_key>')
xml_file = 'path_to_xml_file'
xml_dict = xml_parser.parse(xml_file)

Config file

Below shows an example of configurations for custom parsing of XML.

{
  "TREE":{
    "TABLE_A": {},
    "TABLE_B": {"TABLE_C": {}}
  },

  "TABLE_A":
    [
      "element0_tag,element0_name",
      "element1_tag,element1_name"
    ],
  "TABLE_B":
    [
      "element0_tag*,element1_tag*,element2_tag,element2_name"
    ],
  "TABLE_C":
    [
      "element0_tag,element0_name"
    ]
}

General Rules

  • Capitalize all dictionary keys.
  • * is wildcard notation: returns data for all available elements

Tree structure

The structure can be flat or nested. If you wish to return child data for a particular parent, then you have to include the child as value for the parent. For example, parent TABLE_B has child TABLE_C. If TABLE_C has a child of its own, then we add it to TABLE_C in the same way.

Data structure

Let's make some assumptions about elements to make this example easy to follow.

  • For TABLE_A, assume element0_tag and element1_tag map to table, element0_name to info, and element1_name to metadata.
  • For TABLE_B, assume element0_tag maps to container, element1_tag to node, and element2_tag to table, and element2_name to info.
  • For TABLE_C, assume element0_tag maps to table and element0_name to images

In the above config example, we are interested in returning data for TABLE_A, TABLE_B, AND TABLE_C. For each key, a path or a list of paths (xpath) is/are required to be provided in order to retrieve data from the XML file. For example:

  • TABLE_A has two paths ["table,info", "table,metadata"], data under info and metadata tables will be returned and stored in TABLE_A
  • TABLE_B has single path ["container*,node*,table,images"], data under info table for all nodes and all containers will be returned and stored in TABLE_B.
  • TABLE_C has single path ["table,images"], data under images table for all parent nodes and containers will be returned and stored in TABLE_C.

Notice, full path isn't required for TABLE_C and the GFC (greatest common factor) between the child TABLE_C and the parent TABLE_B is only required in the parent table. Since TABLE_C is a child of TABLE_B, it falls under the same path, but TABLE_C breaks away at "table,images" and that's why it is the only specified path. In other words, since TABLE_C is a child of TABLE_B, all TABLE_B rules carry over to TABLE_C.


Author: Hamdan, Muhammad (@mhamdan91 - ©)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

CustomXMLParser-1.0.7.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

CustomXMLParser-1.0.7-py3-none-any.whl (10.8 kB view details)

Uploaded Python 3

File details

Details for the file CustomXMLParser-1.0.7.tar.gz.

File metadata

  • Download URL: CustomXMLParser-1.0.7.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.5

File hashes

Hashes for CustomXMLParser-1.0.7.tar.gz
Algorithm Hash digest
SHA256 6b6f5a91b46047ffed5e9b84e7a587f24a2ec675c53c23e9b9f710c0d33e7efd
MD5 8bda3e632bba25c3680c0ab522f41855
BLAKE2b-256 f3963126452c8c50296e55cc631ce480a9cb57ce901863d5972c7c14e50d4c31

See more details on using hashes here.

File details

Details for the file CustomXMLParser-1.0.7-py3-none-any.whl.

File metadata

File hashes

Hashes for CustomXMLParser-1.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 1a4ab9548957ef0a335d36e8d9ede1020e23b3a72c210dcefd93a15ce2f1dfc5
MD5 6b5e107bcaa8a3602860b04f88da4aff
BLAKE2b-256 4456b358fd7a6df8a0f21e188ff1b443a932a03bda43046bb74dfb4f35e758ff

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page