Skip to main content

semi structured xml to dict

Project description

semi structured xml to dict

ssxtd is an xmlreader similar to xmltodict, but supporting semi structured xml, and providing a more flexible environnment. ssxtd use either of :

  • the native xml package
  • the lxml package
  • the defusedxml package

Globally, stick to native xml package. It's the faster for most situations.

Note : ssxtd was created to parse very big files, as a result, the default parsing depth is 2 and the parsing functions are generators.

Quickstart

pip install ssxtd
from ssxtd import parsers
result = next(parsers.xml_parse(my_file, depth=0))

What if ...

...i want to parse big files?

use :
parsers.xml_iterparse(my_file)

note : by default, it will return depth 2 elements.

...my xml file has mixed tag and text?

ssxtd will automatically convert mixed tags and text to a string, keeping the right order.

...i want to use defusedxml to secure my app?

use :
parsers.dxml_parse(my_file)
or
parsers.dxml_iterparse(my_file)

...i want to use lxml?

use :
parsers.lxml_parse(my_file)
or
parsers.lxml_iterparse(my_file)

note : it will be slower than xml_parse and xml_iterparse

Options

depth

you can adjust the depth level of the returned objects, even when not using iterparse.

note : you can't use depth = 0 when using iterparse

trim_spaces

will trim spaces for each value found. can be usefull when you have some ugly xml like:

<root>
  <text>we have
  some indentation
  problems
  </text>
</root>

del_empty (default to True)

if set to False, will not remove empty tags

cleanup_namespaces

if set to False, will not remove namespaces

verbose

if set to True, will show a progression bar \o/

recover

if set to True, will recover from malformed xml ( cf test_malformed_xml.py)

note : lxml_parse and lxml_iterparse will use the lxml abilities whereas the others will use a BeautifulSoup transformation

compression

ssxtd can manage ZIP, GZIP, ByteIO, and path to files

for ZIP and GZIP, you must set the parameter "compression" to either "gz" or "zip"

parsers.xml_parse(my_file, compression="gz"):

you can also set the parameter to "auto", ssxtd will then auto detect the file type from the extension (.xml, .zip, or .gz)

note : atm, in a zip compression mode, only .xml files situated at the root of the zip file will be read

object_processorr

if you specify the parameter "object_processor=my_function" when calling a parser, your function will be called for each object

see bin/run_exemple.py

Allows to do special actions like merging tags directly during the parsing

value_processor

if you specify the parameter "value_processor=my_function" when calling a parser, your function will be called for each value found

e.g a simple type conversion :

def try_conversion(value):
        try:
            return int(value)
        except (ValueError, TypeError):
            pass
        try:
            return float(value)
        except (ValueError, TypeError):
            pass
        return value

Requirements

Python

python >= v3.7.0b1
due to https://github.com/python/cpython/commit/066df4fd454d6ff9be66e80b2a65995b10af174f
you CAN use older version of pythons ( i tested up to 3.5) but you won't be able to read zip files

Libs

  • bs4
  • tqdm

Run tests

install pytest:
pip install pytest
in the root directory, run :
pytest
for running a single file, place yourself at the root folder and run :
python -c "import ssxtd.tests.test_malformed_xml"

Performances of the parsing functions

time to parse https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed19n0001.xml.gz
GZ file size : 19MB
extracted size : 185MB
proc : i7-7700HQ

Function XML file GZ file ZIP file (no compression)
xml_parse 32.76501545000065 36.07715339999959 33.419777400000385
xml_iterparse 37.56028480000168 42.16279835000023 39.137448499999664
lxml_parse 37.464776250000796 38.880011499999455 37.046347550000064
lxml_iterparse 47.04024449999997 45.959421049999946 45.05521540000154
dxml_parse 41.52063830000043 40.07632935000038 38.88691465000011
dxml_iterparse 45.195273199999065 44.895784000000276 44.13825424999959

for much more details please see ssxtd\benchmarks\results\result.csv

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ssxtd-1.1.0.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

ssxtd-1.1.0-py2.py3-none-any.whl (9.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file ssxtd-1.1.0.tar.gz.

File metadata

  • Download URL: ssxtd-1.1.0.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.18.1 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.14.0 CPython/3.7.3

File hashes

Hashes for ssxtd-1.1.0.tar.gz
Algorithm Hash digest
SHA256 5baec6de3f01d556f074e92876068a15cdad3dc1352c51249c83e6e3c4d33137
MD5 038abdf27abcbacb760e0808ad3b0b73
BLAKE2b-256 33efab40965e3d3a33a47e00ffa344481a070836154af87a896c61e3847d0e41

See more details on using hashes here.

File details

Details for the file ssxtd-1.1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: ssxtd-1.1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.18.1 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.14.0 CPython/3.7.3

File hashes

Hashes for ssxtd-1.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 b9179434a34358c8e7358fb3123d306dd794ffa1c1214e8f712379734bc8261c
MD5 e1675c071556f7fe4a8b102d00b8f0f2
BLAKE2b-256 d0e57dd0db1373de57288cede737bd9ca130205fd733d20849ef57855aeafea4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page