Skip to main content

Python Libary that allows for customized parsing of XML files using a set of configurations. Output is a dictonary. This library builds on the xml2dict library.

Project description

Custom XML to Dict Parser

Table of Contents

Overview

This package allows you to parse XML files. The tool uses the xml2dict package to parse XML files in raw format and returns data as a python dictionary and builds on that to provide custom tailoring of what information to return from the XML file. In other words, with a configuration file, you can return specific data from the XML file in a specific format.

Library Installalion

To install the library simply run the following command in a cmd, shell or whatever...

# It's recommended to create a virtual environment

# Windows
pip install xmlparser

# Linux
pip3 install xmlparser

Library usage?

Example usage

If you wish to read the XML file as is and simply convert it to a python dictionary, then do the following:

from xmlparser import XmlParser

xml_parser = XmlParser(parser_type='raw')
xml_file = 'path_to_xml_file'
xml_dict = xml_parser.parse(xml_file)

If you wish to read specific portions of the XML file and format them in a particular way, then do the following:

from xmlparser import XmlParser

config_file = 'path_to_config_file'
xml_parser = XmlParser(config_file=config_file, parser_type='custom')
xml_file = 'path_to_xml_file'
xml_dict = xml_parser.parse(xml_file)

Note, the XmlParser class uses the following default XML attributes

'''
name_key (str, optional): this is a custom/xml configuration parameter, and it is the name of primary tag. Defaults to "@name".
table_key (str, optional): this is a custom/xml configuration parameter, and it is the table identifier. Defaults to "th".
header_key (str, optional): this is a custom/xml configuration parameter, and it is the header identifier. Defaults to 'header'.
data_key (str, optional): this is a custom/xml configuration parameter, and it is the data identifier. Defaults to "rows".
header_text_key (str, optional): this is a custom/xml configuration parameter, and it is the table's key identifier. Defaults to "#text".
'''

You can override those attributes by passing them to the constructor of the XmlParser class as follows:

from xmlparser import XmlParser

config_file = 'path_to_config_file'
xml_parser = XmlParser(config_file=config_file, parser_type='custom', encoding='utf-8',
                       name_key='<desired_name_key>', table_key='<desired_table_key>', header_key='<desired_header_keyr>',
                       data_key='<desired_data_key>', header_text_key='<desired_header_text_key>')
xml_file = 'path_to_xml_file'
xml_dict = xml_parser.parse(xml_file)

Config file

Below shows an example of configurations for custom parsing of XML.

{
  "TREE":{
    "TABLE_A": {},
    "TABLE_B": {"TABLE_C": {}}
  },

  "TABLE_A":
    [
      "element0_tag,element0_name",
      "element1_tag,element1_name"
    ],
  "TABLE_B":
    [
      "element0_tag*,element1_tag*,element2_tag,element2_name"
    ],
  "TABLE_C":
    [
      "element0_tag,element0_name"
    ]
}

General Rules

  • Capitalize all dictionary keys.
  • * is wildcard notation: returns data for all available elements

Tree structure

The structure can be flat or nested. If you wish to return child data for a particular parent, then you have to include the child as value for the parent. For example, parent TABLE_B has child TABLE_C. If TABLE_C has a child of its own, then we add it to TABLE_C in the same way.

Data structure

Let's make some assumptions about elements to make this example easy to follow.

  • For TABLE_A, assume element0_tag and element1_tag map to table, element0_name to info, and element1_name to metadata.
  • For TABLE_B, assume element0_tag maps to container, element1_tag to node, and element2_tag to table, and element2_name to info.
  • For TABLE_C, assume element0_tag maps to table and element0_name to images

In the above config example, we are interested in returning data for TABLE_A, TABLE_B, AND TABLE_C. For each key, a path or a list of paths (xpath) is/are required to be provided in order to retrieve data from the XML file. For example:

  • TABLE_A has two paths ["table,info", "table,metadata"], data under info and metadata tables will be returned and stored in TABLE_A
  • TABLE_B has single path ["container*,node*,table,images"], data under info table for all nodes and all containers will be returned and stored in TABLE_B.
  • TABLE_C has single path ["table,images"], data under images table for all parent nodes and containers will be returned and stored in TABLE_C.

Notice, full path isn't required for TABLE_C and the GFC (greatest common factor) between the child TABLE_C and the parent TABLE_B is only required in the parent table. Since TABLE_C is a child of TABLE_B, it falls under the same path, but TABLE_C breaks away at "table,images" and that's why it is the only specified path. In other words, since TABLE_C is a child of TABLE_B, all TABLE_B rules carry over to TABLE_C.


Author: Hamdan, Muhammad (@mhamdan91 - ©)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

CustomXMLParser-1.0.1.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

CustomXMLParser-1.0.1-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file CustomXMLParser-1.0.1.tar.gz.

File metadata

  • Download URL: CustomXMLParser-1.0.1.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.5

File hashes

Hashes for CustomXMLParser-1.0.1.tar.gz
Algorithm Hash digest
SHA256 8d6493243cd01836e0fda9fb35221e439365dffe249bcc7c5d141c231b3da158
MD5 8ef8a37c1d720dde46a6fcc0f333ba0c
BLAKE2b-256 427cef204b18a30151ebc24f7ede9ede8f751b07ebffe0a25cdd3fc0e5246b36

See more details on using hashes here.

File details

Details for the file CustomXMLParser-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for CustomXMLParser-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d18b11f68358a378cbabe6b12384f311ad7a6297b48d8b93c5989bc9c571fc0c
MD5 d08f69a2c22c88c48804f46d0e3cc5df
BLAKE2b-256 7ac220edf704f4b8d8b861d893eca365c0c6be78c05137b7b38b458dede70281

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page