Skip to main content

Efficiently convert XML data to Apache Arrow format.

Project description

PyPI version Downloads Build Status License: MIT Python Versions

XML2ARROW-PYTHON

A Python package for efficiently converting XML files to Apache Arrow tables using a YAML configuration. This package leverages the xml2arrow Rust crate for high performance.

Installation

pip install xml2arrow

Usage

  1. Create a Configuration File (YAML):

The configuration file (YAML format) defines how your XML structure maps to Arrow tables and fields. Here's a detailed explanation of the configuration structure:

tables:
  - name: <table_name>         # The name of the resulting Arrow table
    xml_path: <xml_path>       # The XML path to the *parent* element of the table's row elements
    row_element: <row_element> # The name of the XML element that represents a row
    levels:                    # Index levels for nested XML structures.
    - <level1>
    - <level2> 
    fields:
    - name: <field_name>       # The name of the Arrow field
      xml_path: <field_path>   # The XML path to the field within a row
      data_type: <data_type>   # The Arrow data type (see below)
      nullable: <true|false>   # Whether the field can be null
      scale: <number>          # Optional scaling factor for floats. 
      offset: <number>         # Optional offset for numeric floats
  - name: ...
  • tables: A list of table configurations. Each entry defines a separate Arrow table to be extracted from the XML.
  • name: The name given to the resulting Arrow RecordBatch (which represents a table).
  • xml_path: An XPath-like string that specifies the XML element that is the parent of the elements representing rows in the table. For example, if your XML contains <library><book>...</book><book>...</book></library>, the xml_path would be /library. The book elements are then identified by the row_element configuration.
  • row_element: The element that represents a single row. For example, if the xml_path is /library/book, the row_element is book.
  • levels: An array of strings that represent parent tables to create an index for nested structures. If the XML structure is /library/shelfs/shelf/books/book you should define levels like this: levels: ["shelfs", "books"]. This will create indexes named <shelfs> and <books>.
  • fields: A list of field configurations for each column in the Arrow table.
    • name: The name of the field in the Arrow schema.
    • xml_path: An XPath-like string that specifies the XML element or attribute containing the field's value. To select an attribute, append @ followed by the attribute name to the element's path. For example, /library/book/@id selects the id attribute of the book element.
    • data_type: The Arrow data type of the field. Supported types are:
      • Boolean (true or false)
      • Int16
      • UInt16
      • Int32
      • UInt32
      • Int64
      • UInt64
      • Float32
      • Float64
      • Utf8 (Strings)
    • nullable: A boolean value indicating whether the field can contain null values.
    • scale (Optional): A scaling factor for float fields (e.g., to convert units).
    • offset (Optional): An offset value for float fields (e.g., to convert units).
  1. Parse the XML
from xml2arrow import XmlToArrowParser

parser = XmlToArrowParser("config.yaml")     # Load configuration
record_batches = parser.parse("data.xml")    # Parse XML using configuration

Example

Suppose we have the following XML file (stations.xml):

<report>
  <header>
    <title>Meteorological Station Data</title>
    <created_by>National Weather Service</created_by>
    <creation_time>2024-12-30T13:59:15Z</creation_time>
  </header>
  <monitoring_stations>
    <monitoring_station id="MS001">
      <location>
        <latitude>-61.39110459389277</latitude>
        <longitude>48.08662749089257</longitude>
        <elevation unit="m">547.1050788360882</elevation>
      </location>
      <measurements>
        <measurement>
          <timestamp>2024-12-30T12:39:15Z</timestamp>
          <temperature unit="C">35.486545480326114</temperature>
          <pressure unit="hPa">950.439973486407</pressure>
          <humidity unit="%">49.77716576844861</humidity>
        </measurement>
        <measurement>
          <timestamp>2024-12-30T12:44:15Z</timestamp>
          <temperature unit="C">29.095166644493865</temperature>
          <pressure unit="hPa">1049.3215015450517</pressure>
          <humidity unit="%">32.5687148391251</humidity>
        </measurement>
      </measurements>
      <metadata>
        <description>Located in the Arctic Tundra area, used for Scientific Research.</description>
        <install_date>2024-03-31</install_date>
      </metadata>
    </monitoring_station>
    <monitoring_station id="MS002">
      <location>
        <latitude>11.891496388319311</latitude>
        <longitude>135.09336983543022</longitude>
        <elevation unit="m">174.53349357280004</elevation>
      </location>
      <measurements>
        <measurement>
          <timestamp>2024-12-30T12:39:15Z</timestamp>
          <temperature unit="C">24.791842953632283</temperature>
          <pressure unit="hPa">989.4054287187706</pressure>
          <humidity unit="%">57.70794884397625</humidity>
        </measurement>
        <measurement>
          <timestamp>2024-12-30T12:44:15Z</timestamp>
          <temperature unit="C">15.153690541845911</temperature>
          <pressure unit="hPa">1001.413052919951</pressure>
          <humidity unit="%">45.45094598045342</humidity>
        </measurement>
        <measurement>
          <timestamp>2024-12-30T12:49:15Z</timestamp>
          <temperature unit="C">-4.022555715139081</temperature>
          <pressure unit="hPa">1000.5225751769922</pressure>
          <humidity unit="%">70.40117458947834</humidity>
        </measurement>
        <measurement>
          <timestamp>2024-12-30T12:54:15Z</timestamp>
          <temperature unit="C">25.852920542644185</temperature>
          <pressure unit="hPa">953.762785698162</pressure>
          <humidity unit="%">42.62088244545566</humidity>
        </measurement>
      </measurements>
      <metadata>
        <description>Located in the Desert area, used for Weather Forecasting.</description>
        <install_date>2024-01-17</install_date>
      </metadata>
    </monitoring_station>
  </monitoring_stations>
</report>

We can define a YAML configuration file (stations.yaml) to specify how to convert the XML data to Arrow tables:

tables:
  - name: report
    xml_path: /
    row_element: report
    levels: []
    fields:
    - name: title
      xml_path: /report/header/title
      data_type: Utf8
      nullable: false
    - name: created_by
      xml_path: /report/header/created_by
      data_type: Utf8
      nullable: false
    - name: creation_time
      xml_path: /report/header/creation_time
      data_type: Utf8
      nullable: false
  - name: stations
    xml_path: /report/monitoring_stations
    row_element: monitoring_station
    levels:
    - station
    fields:
    - name: id
      xml_path: /report/monitoring_stations/monitoring_station/@id  # Path to an attribute
      data_type: Utf8
      nullable: false
    - name: latitude
      xml_path: /report/monitoring_stations/monitoring_station/location/latitude
      data_type: Float32
      nullable: false
    - name: longitude
      xml_path: /report/monitoring_stations/monitoring_station/location/longitude
      data_type: Float32
      nullable: false
    - name: elevation
      xml_path: /report/monitoring_stations/monitoring_station/location/elevation
      data_type: Float32
      nullable: false
    - name: description
      xml_path: report/monitoring_stations/monitoring_station/metadata/description
      data_type: Utf8
      nullable: false
    - name: install_date
      xml_path: report/monitoring_stations/monitoring_station/metadata/install_date
      data_type: Utf8
      nullable: false
  - name: measurements
    xml_path: /report/monitoring_stations/monitoring_station/measurements
    row_element: measurement
    levels:
    - station
    - measurement
    fields:
    - name: timestamp
      xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/timestamp
      data_type: Utf8
      nullable: false
    - name: temperature
      xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/temperature
      data_type: Float64
      nullable: false
      offset: 273.15  # Convert from Celsius to Kelvin
    - name: pressure
      xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/pressure
      data_type: Float64
      nullable: false
      scale: 100.0  # Convert from hPa to Pa
    - name: humidity
      xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/humidity
      data_type: Float64
      nullable: false

Here's how to use xml2arrow to parse the XML and YAML files and get the resulting Arrow tables:

from xml2arrow import XmlToArrowParser

parser = XmlToArrowParser("stations.yaml")     # Load configuration
record_batches = parser.parse("stations.xml")  # Parse XML using configuration
- report:
 ┌─────────────────────────────┬──────────────────────────┬──────────────────────┐
 │ title                       ┆ created_by               ┆ creation_time        │
 │ ---                         ┆ ---                      ┆ ---                  │
 │ str                         ┆ str                      ┆ str                  │
 ╞═════════════════════════════╪══════════════════════════╪══════════════════════╡
 │ Meteorological Station Data ┆ National Weather Service ┆ 2024-12-30T13:59:15Z │
 └─────────────────────────────┴──────────────────────────┴──────────────────────┘
- stations:
 ┌───────────┬───────┬────────────┬────────────┬────────────┬────────────────────────┬──────────────┐
 │ <station> ┆ id    ┆ latitude   ┆ longitude  ┆ elevation  ┆ description            ┆ install_date │
 │ ---       ┆ ---   ┆ ---        ┆ ---        ┆ ---        ┆ ---                    ┆ ---          │
 │ u32       ┆ str   ┆ f32        ┆ f32        ┆ f32        ┆ str                    ┆ str          │
 ╞═══════════╪═══════╪════════════╪════════════╪════════════╪════════════════════════╪══════════════╡
 │ 0         ┆ MS001 ┆ -61.391106 ┆ 48.086628  ┆ 547.105103 ┆ Located in the Arctic  ┆ 2024-03-31   │
 │           ┆       ┆            ┆            ┆            ┆ Tundra a…              ┆              │
 │ 1         ┆ MS002 ┆ 11.891497  ┆ 135.093369 ┆ 174.533493 ┆ Located in the Desert  ┆ 2024-01-17   │
 │           ┆       ┆            ┆            ┆            ┆ area, us…              ┆              │
 └───────────┴───────┴────────────┴────────────┴────────────┴────────────────────────┴──────────────┘
- measurements:
 ┌───────────┬───────────────┬──────────────────────┬─────────────┬───────────────┬───────────┐
 │ <station> ┆ <measurement> ┆ timestamp            ┆ temperature ┆ pressure      ┆ humidity  │
 │ ---       ┆ ---           ┆ ---                  ┆ ---         ┆ ---           ┆ ---       │
 │ u32       ┆ u32           ┆ str                  ┆ f64         ┆ f64           ┆ f64       │
 ╞═══════════╪═══════════════╪══════════════════════╪═════════════╪═══════════════╪═══════════╡
 │ 0         ┆ 0             ┆ 2024-12-30T12:39:15Z ┆ 308.636545  ┆ 95043.997349  ┆ 49.777166 │
 │ 0         ┆ 1             ┆ 2024-12-30T12:44:15Z ┆ 302.245167  ┆ 104932.150155 ┆ 32.568715 │
 │ 1         ┆ 2             ┆ 2024-12-30T12:39:15Z ┆ 297.941843  ┆ 98940.542872  ┆ 57.707949 │
 │ 1         ┆ 3             ┆ 2024-12-30T12:44:15Z ┆ 288.303691  ┆ 100141.305292 ┆ 45.450946 │
 │ 1         ┆ 4             ┆ 2024-12-30T12:49:15Z ┆ 269.127444  ┆ 100052.257518 ┆ 70.401175 │
 │ 1         ┆ 5             ┆ 2024-12-30T12:54:15Z ┆ 299.002921  ┆ 95376.27857   ┆ 42.620882 │
 └───────────┴───────────────┴──────────────────────┴─────────────┴───────────────┴───────────┘

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xml2arrow-0.4.0.tar.gz (22.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

xml2arrow-0.4.0-cp310-abi3-win_amd64.whl (793.2 kB view details)

Uploaded CPython 3.10+Windows x86-64

xml2arrow-0.4.0-cp310-abi3-win32.whl (731.1 kB view details)

Uploaded CPython 3.10+Windows x86

xml2arrow-0.4.0-cp310-abi3-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

xml2arrow-0.4.0-cp310-abi3-musllinux_1_2_i686.whl (1.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

xml2arrow-0.4.0-cp310-abi3-musllinux_1_2_armv7l.whl (1.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

xml2arrow-0.4.0-cp310-abi3-musllinux_1_2_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

xml2arrow-0.4.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (997.3 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

xml2arrow-0.4.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ s390x

xml2arrow-0.4.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ppc64le

xml2arrow-0.4.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (987.2 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARMv7l

xml2arrow-0.4.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (948.6 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

xml2arrow-0.4.0-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl (1.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.5+ i686

xml2arrow-0.4.0-cp310-abi3-macosx_11_0_arm64.whl (835.4 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

xml2arrow-0.4.0-cp310-abi3-macosx_10_12_x86_64.whl (917.3 kB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file xml2arrow-0.4.0.tar.gz.

File metadata

  • Download URL: xml2arrow-0.4.0.tar.gz
  • Upload date:
  • Size: 22.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.1

File hashes

Hashes for xml2arrow-0.4.0.tar.gz
Algorithm Hash digest
SHA256 35d8c312ca1871881cf6e351f59d30794b18438ad5ceadcd98a533b59bb31ce7
MD5 c0a22e54745ef16a627ee645595a6446
BLAKE2b-256 ef7802b387dee357ec2e25dc6b8714c050b2136a32c5e214bd7902f4e9ff2e7b

See more details on using hashes here.

File details

Details for the file xml2arrow-0.4.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.4.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 df7124a6b676d72980d38853c07e6903436155bcea924a7ceb37c30d2b745b4b
MD5 d280feb563ef410fa75f0dd07231c9cd
BLAKE2b-256 202b00aab53c5bd2f6d8add51c08d917af32f64a4c277b2255e7e83a5de50178

See more details on using hashes here.

File details

Details for the file xml2arrow-0.4.0-cp310-abi3-win32.whl.

File metadata

  • Download URL: xml2arrow-0.4.0-cp310-abi3-win32.whl
  • Upload date:
  • Size: 731.1 kB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.1

File hashes

Hashes for xml2arrow-0.4.0-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 4eb9eccfa2a52e7e173719e9307689ef66f6a8638590c09b8ee5c0453a6c640d
MD5 5c10460f4d65b70c952e10d811bb8e84
BLAKE2b-256 af21991ceac53b57097f744c7995dfdaed920dca03b57513d9abbb6f034f3246

See more details on using hashes here.

File details

Details for the file xml2arrow-0.4.0-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.4.0-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 ab557f801b1596b74be719fc06c257d2780e3a1f8bc5c44d936edafae293a587
MD5 5ebcf562cb1aeafe5f728db7db13af14
BLAKE2b-256 e6e977aed908956869714c35d867e7bb4e15b69007ddc1157918e9e1c029fb20

See more details on using hashes here.

File details

Details for the file xml2arrow-0.4.0-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for xml2arrow-0.4.0-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 972144a64db0791d77fc7bf3cdb9eb1b2b38b0b342a18cab961fba066c66f9eb
MD5 e889414669a87801b9dde6d4580921f6
BLAKE2b-256 5305c33035b6e47762ad1091a7664a8481cf05f3bc126850508d6b3c58dfee75

See more details on using hashes here.

File details

Details for the file xml2arrow-0.4.0-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for xml2arrow-0.4.0-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 61ed49c5475ff7290bee2dde7f89cf438317730ccdf3f99b48f12535c05bf3fe
MD5 8cd7fe2f22e4c8990c17c7dcac29cb62
BLAKE2b-256 b5f074c2fa7fae85c42e9170726bf55cbb557b9d4a3a75d489c6ecc985d5c5f4

See more details on using hashes here.

File details

Details for the file xml2arrow-0.4.0-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.4.0-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 6c3667b4620c5c128ae3f2079a38d6eecb6303fe790fd030bda0c1626b573039
MD5 06621bc591722f169ce3bb2c9836db44
BLAKE2b-256 1614d3ecafaf668d88e0ca9966c4c3c282835eb26f2114904cbd98491cd6d17d

See more details on using hashes here.

File details

Details for the file xml2arrow-0.4.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.4.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c9f5a126cf2ad28396f3975b1c676fc87f90d005613db2cc4b0e259b226175dd
MD5 6c72d87cb552aa31b10815b072b1b1c0
BLAKE2b-256 1fa36bbf896e1bb45f07af845dab4edc6ef19bc8259d004bb0d5bb074e51ad69

See more details on using hashes here.

File details

Details for the file xml2arrow-0.4.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for xml2arrow-0.4.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 8721abb8764ce13cca7711f4ebc8f3eb31aef865c63c4c63776fa6692ed8fb52
MD5 3eddaf3f896fcdbdfde5ed90f14768c9
BLAKE2b-256 de358ea3908893ccabf4c541e43dcb350d3b88f815edc01ece02524c81872452

See more details on using hashes here.

File details

Details for the file xml2arrow-0.4.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for xml2arrow-0.4.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 dbcdc458672673755bcd3eeb05cbe5fe366b1783e57611bf1a62f4fb5982808f
MD5 c13f81fd1efa071ae693faf5e5763683
BLAKE2b-256 deba1337bcb32f787bce53b82a4586ed3427178fb57a89d8fca6a2fb3c95f362

See more details on using hashes here.

File details

Details for the file xml2arrow-0.4.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for xml2arrow-0.4.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 fcae06a7eb991beae209fdc11d72ffdf3e19572e79d371b3e2dfba560799c253
MD5 0396f121c7c4482e0beda76964139874
BLAKE2b-256 fc7bfd1215657cd8ae1478401e8adf4f47376ccfd54ea526ace530a0af4c3147

See more details on using hashes here.

File details

Details for the file xml2arrow-0.4.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.4.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 86800465d970b4ce99cd204b3a23b3a0591882220716f41ae1c51a5480f22234
MD5 8cdc6f27e13e28067e22fd87977a8ebb
BLAKE2b-256 ea4bb02092c3423f30254129488d27b6e4e8f3946e468720bb94d8a7c046136f

See more details on using hashes here.

File details

Details for the file xml2arrow-0.4.0-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for xml2arrow-0.4.0-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 4c06b55a0bbbe8f9f58ffe48bb20e572347d846f27b03cd4efa2205f67ff503f
MD5 c3b6fe9a94d5cd001882c4fd883b859a
BLAKE2b-256 8aeaa1f1b382126a7a71e4bded963c57d60a12eadfe137524a11611ef1f3acd0

See more details on using hashes here.

File details

Details for the file xml2arrow-0.4.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.4.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c49d07f6d537b117f2ce3ac6b9a008dbab49362a5b547110cf222ed5af24554b
MD5 eaf197f1cd87bcba20461084e3530aca
BLAKE2b-256 a0209f53912e6af127f3b514812ea4e455da1f342d1840527ea24122609d09e1

See more details on using hashes here.

File details

Details for the file xml2arrow-0.4.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.4.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 789f981f2a22daee9a0f3427d13e6865f40b3f93d924fbcc68692bf793ad09f0
MD5 cf274b9eac710da3f0bd6cf5bfcd8cd5
BLAKE2b-256 41dd2d793455181eba8985cc7e4805b8cb3acbc34276fdeac48e5b8c1ec02a4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page