Python Libary that allows for customized parsing of XML files using a set of configurations. Output is a dictonary. This library builds on the xml2dict library.
Project description
Custom XML to Dict Parser
Table of Contents
Overview
This package allows you to parse XML files. The tool uses the xml2dict
package to parse XML files in raw format and returns data as a python dictionary and builds on that to provide custom tailoring of what information to return from the XML file. In other words, with a configuration file, you can return specific data from the XML file in a specific format.
Library Installalion
To install the library simply run the following command in a cmd, shell or whatever...
# It's recommended to create a virtual environment
# Windows
pip install CustomXMLParser
# Linux
pip3 install CustomXMLParser
Library usage?
Example usage
If you wish to read the XML file as is and simply convert it to a python dictionary, then do the following:
from CustomXMLParser import XmlParser
xml_parser = XmlParser(parser_type='raw')
xml_file = 'path_to_xml_file'
xml_dict = xml_parser.parse(xml_file)
If you wish to dump a dict to XML file or a string, then do the following:
from CustomXMLParser import XmlParser
xml_parser = XmlParser(parser_type='raw')
xml_file = 'path_to_xml_file'
xml_dict = xml_parser.parse(xml_file)
my_dict = manipulate(xml_dict) # Manipulate raw dict, but MUST maintain structure
out_xml_file = 'path_to_out_xml_file'
xml_parser.dump(out_xml_file, my_dict, pretty=True) # This dump dict to xml file
my_xml_string = xml_parser.dumps(data, pretty=True) # This dumps dict to a string
If you wish to read specific portions of the XML file and format them in a particular way, then do the following:
from CustomXMLParser import XmlParser
config_file = 'path_to_config_file'
xml_parser = XmlParser(config_file=config_file, parser_type='custom')
xml_file = 'path_to_xml_file'
xml_dict = xml_parser.parse(xml_file)
If you wish to dump a any dict to XML file or a string, then do the following:
from CustomXMLParser import XmlParser
xml_parser = XmlParser(parser_type='raw')
xml_file = 'path_to_xml_file'
xml_dict = xml_parser.parse(xml_file)
my_dict = manipulate(xml_dict)... # Manipulate raw dict, but MUST maintain structure
out_xml_file = 'path_to_out_xml_file'
xml_parser.dump(my_dict, out_xml_file, input_format='custom', root='root', pretty=True) # This dump dict to xml file
my_xml_string = xml_parser.dumps(my_dict, input_format='custom', root='root', pretty=True) # This dumps dict to a string
Note, the XmlParser
class uses the following default XML attributes
'''
name_key (str, optional): this is a custom/xml configuration parameter, and it is the name of primary tag. Defaults to "@name".
table_key (str, optional): this is a custom/xml configuration parameter, and it is the table identifier. Defaults to "th".
header_key (str, optional): this is a custom/xml configuration parameter, and it is the header identifier. Defaults to 'header'.
data_key (str, optional): this is a custom/xml configuration parameter, and it is the data identifier. Defaults to "rows".
header_text_key (str, optional): this is a custom/xml configuration parameter, and it is the table's key identifier. Defaults to "#text".
'''
You can override those attributes by passing them to the constructor of the XmlParser
class as follows:
from CustomXMLParser import XmlParser
config_file = 'path_to_config_file'
xml_parser = XmlParser(config_file=config_file, parser_type='custom', encoding='utf-8',
name_key='<desired_name_key>', table_key='<desired_table_key>', header_key='<desired_header_keyr>',
data_key='<desired_data_key>', header_text_key='<desired_header_text_key>')
xml_file = 'path_to_xml_file'
xml_dict = xml_parser.parse(xml_file)
Config file
Below shows an example of configurations for custom parsing of XML.
{
"TREE":{
"TABLE_A": {},
"TABLE_B": {"TABLE_C": {"KEYS": "key1,key2"}}
},
"TABLE_A":
[
"element0_tag,element0_name",
"element1_tag,element1_name"
],
"TABLE_B":
[
"element0_tag*,element1_tag*,element2_tag,element2_name"
],
"TABLE_C":
[
"element0_tag,element0_name"
]
}
General Rules
- Capitalize all dictionary keys.
- * is wildcard notation: returns data for all available elements
Tree structure
The structure can be flat or nested. If you wish to return child data for a particular parent, then you have to include the child as value for the parent. For example, parent TABLE_B has child TABLE_C. If TABLE_C has a child of its own, then we add it to TABLE_C in the same way.
REQUESTING_SPECIFIC_KEYS:
notice that **TABLE_C** specifies a key called `KEYS` and a value of `key1,key2`.
This configuration allows you to only return matching keys `key1` and `key2` for *TABLE_C*.
If the `KEYS` key is not specified, then all keys are returned by default.
The `KEYS` key must be unique and is not present in the XML file. If it is present,
then user can change the default key name through class attributes.
Data structure
Let's make some assumptions about elements to make this example easy to follow.
- For TABLE_A, assume element0_tag and element1_tag map to
table
, element0_name toinfo
, and element1_name tometadata
. - For TABLE_B, assume element0_tag maps to
container
, element1_tag tonode
, and element2_tag totable
, and element2_name toinfo
. - For TABLE_C, assume element0_tag maps to
table
and element0_name toimages
In the above config example, we are interested in returning data for TABLE_A, TABLE_B, AND TABLE_C. For each key, a path or a list of paths (xpath) is/are required to be provided in order to retrieve data from the XML file. For example:
- TABLE_A has two paths ["table,info", "table,metadata"], data under
info
andmetadata
tables will be returned and stored in TABLE_A - TABLE_B has single path ["container*,node*,table,images"], data under
info
table for all nodes and all containers will be returned and stored in TABLE_B. - TABLE_C has single path ["table,images"], data under
images
table for all parent nodes and containers will be returned and stored in TABLE_C.
NOTICE:
full path isn't required for **TABLE_C** and the *GFC* (greatest common factor) between the child **TABLE_C**,
and the parent **TABLE_B** is only required in the parent table. Since **TABLE_C** is a child of **TABLE_B**,
it falls under the same path, but **TABLE_C** breaks away at "table,images" and that's why it is the only specified path.
In other words, since **TABLE_C** is a child of **TABLE_B**, all *TABLE_B* rules carry over to *TABLE_C*.
Author: Hamdan, Muhammad (@mhamdan - ©)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file CustomXMLParser-1.1.1.tar.gz
.
File metadata
- Download URL: CustomXMLParser-1.1.1.tar.gz
- Upload date:
- Size: 8.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6722d1eedc03bca340c6a02edeb9819a0414549b11497e1caee68f947096890 |
|
MD5 | a567c53a2ab8cd29925608604b764fa6 |
|
BLAKE2b-256 | b2e4c32fa2cf5ebcbe7e80d920cc093f7fab72280d0b1fc4b91b734dc311ebc7 |
File details
Details for the file CustomXMLParser-1.1.1-py3-none-any.whl
.
File metadata
- Download URL: CustomXMLParser-1.1.1-py3-none-any.whl
- Upload date:
- Size: 12.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6539a6f2800b45565732d538d6b1a0b172bf30cf4f3b26229310a1d19a662330 |
|
MD5 | 0ce9d3206b5c5609ce6fb3a0a4dddc89 |
|
BLAKE2b-256 | 6231bce0f53011a0b482e68245efe8ffcb38228fe1fd1964d0f2f7946f0666a4 |