A tool to read XML files as pandas dataframes.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Pandas Read XML

A tool to help read XML files as pandas dataframes.

See example in Google Colab here

Isn't it annoying working with data in XML format? I think so. Take a look at this simple example.

<first-tag>
    <not-interested>
        blah blah
    </not-interested>
    <second-tag>
        <the-tag-you-want-as-root>
            <row>
                <columnA>
                    The data that you want
                </columnA>
                <columnB>
                    More data that you want
                </columnB>
            </row>
            <row>
                <columnA>
                    Yet more data that you want
                </columnA>
                <columnB>
                    Eh, get this data too
                </columnB>
            </row>
        </the-tag-you-want-as-root>
    </second-tag>
    <another-irrelevant-tag>
        some other info that you do not want
    </another-irrelevant-tag>
</first-tag>

I wish there was a simple df = pd.read_xml('some_file.xml') like pd.read_csv() and pd.read_json() that we all love.

I can't solve this with my time and skills, but perhaps this package will help get you started.

Install

pip install pandas_read_xml

Import package

import pandas_read_xml as pdx

Read XML as pandas dataframe

You will need to identify the path to the "root" tag in the XML from which you want to extract the data.

df = pdx.read_xml("test.xml", ['first-tag', 'second-tag', 'the-tag-you-want-as-root'])

By default, pandas-read-xml will treat the root tag as being the "rows" of the pandas dataframe. If this is not true, pass the argument root_is_rows=False.

*Sometimes, the XML structure is such that pandas will treat rows vs columns in a way that we think are opposites. For these cases, the read_xml may fail. Try using transpose=True as an argument in such cases. This argument will only affect the reading if root_is_rows=False is passed.

Auto Flatten

The real cumbersome part of working with XML data (or JSON data) is that they do not represent a single table. Rather, they are a (nested) tree representations of what probably were relational databases. Often, these XML data are exported without a clearly documented schema, and more often, no clear way of navigating the data.

What is even more annoying is that, in comparison to JSON, the data structures are not consistent across XML files from the same schema. Some files may have multiples of the same tag, resulting in a list-type data, while in other files of the same schema will only have on of that tag, resulting in a non-list-type data. In other times, the tags are not present which means that the resulting "column" is not just null, but not even a column. This makes it difficult to "flatten".

Pandas already has some tools to help "explode" (items in list become separate rows) and "normalise" (key, value pairs in one column become separate columns of data), but they fail when there are these mixed types within the same tags (columns). Besides, "flattening" (combining exploding and normalising) duplicates other data in the dataframe as well, leading to an explosion of memory requirements.

So, in this tool, I have also attempted to make a few different tools to separate the relational tables.

See the example in Colab (or run the notebook elsewhere)

The auto_separate_tables method will separate out what it guesses to be separate tables. The resulting data is a dictionary where the keys are the "table names" and the corresponding values are the pandas dataframes. Each of the separate tables will have the key_columns as common columns.

You can see the list of separated tables by using python dictionary methods.

data.keys()

And then view the table of interest.

There are also other "smaller" functions that does parts of the job:

flatten(df)
auto_flatten(df, key_columns)
fully_flatten(df, key_columns)

Even more if you look through the code.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.3.1

Apr 8, 2021

0.3.0

Apr 8, 2021

0.2.0

Apr 7, 2021

0.1.1

Apr 7, 2021

0.1.0

Mar 20, 2021

0.0.9

Dec 21, 2020

0.0.8

Oct 22, 2020

0.0.7

Aug 31, 2020

0.0.6

Aug 30, 2020

0.0.5

May 7, 2020

0.0.4

Apr 30, 2020

0.0.3

Apr 30, 2020

0.0.2

Apr 29, 2020

0.0.1

Apr 29, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pandas_read_xml-0.3.1-py3-none-any.whl (6.3 kB view details)

Uploaded Apr 8, 2021 Python 3

File details

Details for the file pandas_read_xml-0.3.1-py3-none-any.whl.

File metadata

Download URL: pandas_read_xml-0.3.1-py3-none-any.whl
Upload date: Apr 8, 2021
Size: 6.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.0

File hashes

Hashes for pandas_read_xml-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`76b0047e9c81c6ba47bc2e788b866280d862f0eea52f0aac0f1d65ba9ff72e3c`
MD5	`892f968506bc15f7d28177003cab2ccd`
BLAKE2b-256	`dd67033ecb058eb44bfabc1f1b4f92e4a80f59c9b423c442255a56e1826776b5`

See more details on using hashes here.

pandas-read-xml 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pandas Read XML

Install

Import package

Read XML as pandas dataframe

Auto Flatten

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes