Python Libary that allows for customized parsing of XML files using a set of configurations. Output is a dictonary. This library builds on the xml2dict library.
Project description
Custom XML to Dict Parser
Table of Contents
Overview
This package allows you to parse XML files. The tool uses the xml2dict
package to parse XML files in raw format and returns data as a python dictionary and builds on that to provide custom tailoring of what information to return from the XML file. In other words, with a configuration file, you can return specific data from the XML file in a specific format.
Library Installalion
To install the library simply run the following command in a cmd, shell or whatever...
# It's recommended to create a virtual environment
# Windows
pip install CustomXMLParser
# Linux
pip3 install CustomXMLParser
Library usage?
Example usage
If you wish to read the XML file as is and simply convert it to a python dictionary, then do the following:
from CustomXMLParser import XmlParser
xml_parser = XmlParser(parser_type='raw')
xml_file = 'path_to_xml_file'
xml_dict = xml_parser.parse(xml_file)
If you wish to read specific portions of the XML file and format them in a particular way, then do the following:
from CustomXMLParser import XmlParser
config_file = 'path_to_config_file'
xml_parser = XmlParser(config_file=config_file, parser_type='custom')
xml_file = 'path_to_xml_file'
xml_dict = xml_parser.parse(xml_file)
Note, the XmlParser
class uses the following default XML attributes
'''
name_key (str, optional): this is a custom/xml configuration parameter, and it is the name of primary tag. Defaults to "@name".
table_key (str, optional): this is a custom/xml configuration parameter, and it is the table identifier. Defaults to "th".
header_key (str, optional): this is a custom/xml configuration parameter, and it is the header identifier. Defaults to 'header'.
data_key (str, optional): this is a custom/xml configuration parameter, and it is the data identifier. Defaults to "rows".
header_text_key (str, optional): this is a custom/xml configuration parameter, and it is the table's key identifier. Defaults to "#text".
'''
You can override those attributes by passing them to the constructor of the XmlParser
class as follows:
from CustomXMLParser import XmlParser
config_file = 'path_to_config_file'
xml_parser = XmlParser(config_file=config_file, parser_type='custom', encoding='utf-8',
name_key='<desired_name_key>', table_key='<desired_table_key>', header_key='<desired_header_keyr>',
data_key='<desired_data_key>', header_text_key='<desired_header_text_key>')
xml_file = 'path_to_xml_file'
xml_dict = xml_parser.parse(xml_file)
Config file
Below shows an example of configurations for custom parsing of XML.
{
"TREE":{
"TABLE_A": {},
"TABLE_B": {"TABLE_C": {}}
},
"TABLE_A":
[
"element0_tag,element0_name",
"element1_tag,element1_name"
],
"TABLE_B":
[
"element0_tag*,element1_tag*,element2_tag,element2_name"
],
"TABLE_C":
[
"element0_tag,element0_name"
]
}
General Rules
- Capitalize all dictionary keys.
- * is wildcard notation: returns data for all available elements
Tree structure
The structure can be flat or nested. If you wish to return child data for a particular parent, then you have to include the child as value for the parent. For example, parent TABLE_B has child TABLE_C. If TABLE_C has a child of its own, then we add it to TABLE_C in the same way.
Data structure
Let's make some assumptions about elements to make this example easy to follow.
- For TABLE_A, assume element0_tag and element1_tag map to
table
, element0_name toinfo
, and element1_name tometadata
. - For TABLE_B, assume element0_tag maps to
container
, element1_tag tonode
, and element2_tag totable
, and element2_name toinfo
. - For TABLE_C, assume element0_tag maps to
table
and element0_name toimages
In the above config example, we are interested in returning data for TABLE_A, TABLE_B, AND TABLE_C. For each key, a path or a list of paths (xpath) is/are required to be provided in order to retrieve data from the XML file. For example:
- TABLE_A has two paths ["table,info", "table,metadata"], data under
info
andmetadata
tables will be returned and stored in TABLE_A - TABLE_B has single path ["container*,node*,table,images"], data under
info
table for all nodes and all containers will be returned and stored in TABLE_B. - TABLE_C has single path ["table,images"], data under
images
table for all parent nodes and containers will be returned and stored in TABLE_C.
Notice, full path isn't required for TABLE_C and the GFC (greatest common factor) between the child TABLE_C and the parent TABLE_B is only required in the parent table. Since TABLE_C is a child of TABLE_B, it falls under the same path, but TABLE_C breaks away at "table,images" and that's why it is the only specified path. In other words, since TABLE_C is a child of TABLE_B, all TABLE_B rules carry over to TABLE_C.
Author: Hamdan, Muhammad (@mhamdan91 - ©)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file CustomXMLParser-1.0.7.tar.gz
.
File metadata
- Download URL: CustomXMLParser-1.0.7.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b6f5a91b46047ffed5e9b84e7a587f24a2ec675c53c23e9b9f710c0d33e7efd |
|
MD5 | 8bda3e632bba25c3680c0ab522f41855 |
|
BLAKE2b-256 | f3963126452c8c50296e55cc631ce480a9cb57ce901863d5972c7c14e50d4c31 |
File details
Details for the file CustomXMLParser-1.0.7-py3-none-any.whl
.
File metadata
- Download URL: CustomXMLParser-1.0.7-py3-none-any.whl
- Upload date:
- Size: 10.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a4ab9548957ef0a335d36e8d9ede1020e23b3a72c210dcefd93a15ce2f1dfc5 |
|
MD5 | 6b5e107bcaa8a3602860b04f88da4aff |
|
BLAKE2b-256 | 4456b358fd7a6df8a0f21e188ff1b443a932a03bda43046bb74dfb4f35e758ff |