Skip to main content

Efficiently convert XML data to Apache Arrow format.

Project description

PyPI version Downloads Build Status License: MIT Python Versions

XML2ARROW-PYTHON

A Python package for efficiently converting XML files to Apache Arrow tables using a YAML configuration. This package leverages the xml2arrow Rust crate for high performance.

Installation

pip install xml2arrow

Usage

  1. Create a Configuration File (YAML):

The configuration file (YAML format) defines how your XML structure maps to Arrow tables and fields. Here's a detailed explanation of the configuration structure:

tables:
  - name: <table_name>         # The name of the resulting Arrow table
    xml_path: <xml_path>       # The XML path to the *parent* element of the table's row elements
    levels:                    # Index levels for nested XML structures.
    - <level1>
    - <level2> 
    fields:
    - name: <field_name>       # The name of the Arrow field
      xml_path: <field_path>   # The XML path to the field within a row
      data_type: <data_type>   # The Arrow data type (see below)
      nullable: <true|false>   # Whether the field can be null
      scale: <number>          # Optional scaling factor for floats. 
      offset: <number>         # Optional offset for numeric floats
  - name: ...
  • tables: A list of table configurations. Each entry defines a separate Arrow table to be extracted from the XML.
  • name: The name given to the resulting Arrow RecordBatch (which represents a table).
  • xml_path: An XPath-like string that specifies the XML element that is the parent of the elements representing rows in the table. For example, if your XML contains <library><book>...</book><book>...</book></library>, the xml_path would be /library.
  • levels: An array of strings that represent parent tables to create an index for nested structures. If the XML structure is /library/shelfs/shelf/books/book you should define levels like this: levels: ["shelfs", "books"]. This will create indexes named <shelfs> and <books>.
  • fields: A list of field configurations for each column in the Arrow table.
    • name: The name of the field in the Arrow schema.
    • xml_path: An XPath-like string that specifies the XML element or attribute containing the field's value. To select an attribute, append @ followed by the attribute name to the element's path. For example, /library/book/@id selects the id attribute of the book element.
    • data_type: The Arrow data type of the field. Supported types are:
      • Boolean (true or false)
      • Int16
      • UInt16
      • Int32
      • UInt32
      • Int64
      • UInt64
      • Float32
      • Float64
      • Utf8 (Strings)
    • nullable: A boolean value indicating whether the field can contain null values. This field is optional and defaults to false if not specified.
    • scale (Optional): A scaling factor for float fields (e.g., to convert units).
    • offset (Optional): An offset value for float fields (e.g., to convert units).
  1. Parse the XML
from xml2arrow import XmlToArrowParser

parser = XmlToArrowParser("config.yaml")     # Load configuration
record_batches = parser.parse("data.xml")    # Parse XML using configuration

Example

Suppose we have the following XML file (stations.xml):

<report>
  <header>
    <title>Meteorological Station Data</title>
    <created_by>National Weather Service</created_by>
    <creation_time>2024-12-30T13:59:15Z</creation_time>
  </header>
  <monitoring_stations>
    <monitoring_station id="MS001">
      <location>
        <latitude>-61.39110459389277</latitude>
        <longitude>48.08662749089257</longitude>
        <elevation unit="m">547.1050788360882</elevation>
      </location>
      <measurements>
        <measurement>
          <timestamp>2024-12-30T12:39:15Z</timestamp>
          <temperature unit="C">35.486545480326114</temperature>
          <pressure unit="hPa">950.439973486407</pressure>
          <humidity unit="%">49.77716576844861</humidity>
        </measurement>
        <measurement>
          <timestamp>2024-12-30T12:44:15Z</timestamp>
          <temperature unit="C">29.095166644493865</temperature>
          <pressure unit="hPa">1049.3215015450517</pressure>
          <humidity unit="%">32.5687148391251</humidity>
        </measurement>
      </measurements>
      <metadata>
        <description>Located in the Arctic Tundra area, used for Scientific Research.</description>
        <install_date>2024-03-31</install_date>
      </metadata>
    </monitoring_station>
    <monitoring_station id="MS002">
      <location>
        <latitude>11.891496388319311</latitude>
        <longitude>135.09336983543022</longitude>
        <elevation unit="m">174.53349357280004</elevation>
      </location>
      <measurements>
        <measurement>
          <timestamp>2024-12-30T12:39:15Z</timestamp>
          <temperature unit="C">24.791842953632283</temperature>
          <pressure unit="hPa">989.4054287187706</pressure>
          <humidity unit="%">57.70794884397625</humidity>
        </measurement>
        <measurement>
          <timestamp>2024-12-30T12:44:15Z</timestamp>
          <temperature unit="C">15.153690541845911</temperature>
          <pressure unit="hPa">1001.413052919951</pressure>
          <humidity unit="%">45.45094598045342</humidity>
        </measurement>
        <measurement>
          <timestamp>2024-12-30T12:49:15Z</timestamp>
          <temperature unit="C">-4.022555715139081</temperature>
          <pressure unit="hPa">1000.5225751769922</pressure>
          <humidity unit="%">70.40117458947834</humidity>
        </measurement>
        <measurement>
          <timestamp>2024-12-30T12:54:15Z</timestamp>
          <temperature unit="C">25.852920542644185</temperature>
          <pressure unit="hPa">953.762785698162</pressure>
          <humidity unit="%">42.62088244545566</humidity>
        </measurement>
      </measurements>
      <metadata>
        <description>Located in the Desert area, used for Weather Forecasting.</description>
        <install_date>2024-01-17</install_date>
      </metadata>
    </monitoring_station>
  </monitoring_stations>
</report>

We can define a YAML configuration file (stations.yaml) to specify how to convert the XML data to Arrow tables:

tables:
  - name: report
    xml_path: /
    levels: []
    fields:
    - name: title
      xml_path: /report/header/title
      data_type: Utf8
      nullable: false
    - name: created_by
      xml_path: /report/header/created_by
      data_type: Utf8
      nullable: false
    - name: creation_time
      xml_path: /report/header/creation_time
      data_type: Utf8
      nullable: false
  - name: stations
    xml_path: /report/monitoring_stations
    levels:
    - station
    fields:
    - name: id
      xml_path: /report/monitoring_stations/monitoring_station/@id  # Path to an attribute
      data_type: Utf8
      nullable: false
    - name: latitude
      xml_path: /report/monitoring_stations/monitoring_station/location/latitude
      data_type: Float32
      nullable: false
    - name: longitude
      xml_path: /report/monitoring_stations/monitoring_station/location/longitude
      data_type: Float32
      nullable: false
    - name: elevation
      xml_path: /report/monitoring_stations/monitoring_station/location/elevation
      data_type: Float32
      nullable: false
    - name: description
      xml_path: report/monitoring_stations/monitoring_station/metadata/description
      data_type: Utf8
      nullable: false
    - name: install_date
      xml_path: report/monitoring_stations/monitoring_station/metadata/install_date
      data_type: Utf8
      nullable: false
  - name: measurements
    xml_path: /report/monitoring_stations/monitoring_station/measurements
    levels:
    - station
    - measurement
    fields:
    - name: timestamp
      xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/timestamp
      data_type: Utf8
      nullable: false
    - name: temperature
      xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/temperature
      data_type: Float64
      nullable: false
      offset: 273.15  # Convert from Celsius to Kelvin
    - name: pressure
      xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/pressure
      data_type: Float64
      nullable: false
      scale: 100.0  # Convert from hPa to Pa
    - name: humidity
      xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/humidity
      data_type: Float64
      nullable: false

Here's how to use xml2arrow to parse the XML and YAML files and get the resulting Arrow tables:

from xml2arrow import XmlToArrowParser

parser = XmlToArrowParser("stations.yaml")     # Load configuration
record_batches = parser.parse("stations.xml")  # Parse XML using configuration
- report:
 ┌─────────────────────────────┬──────────────────────────┬──────────────────────┐
 │ title                       ┆ created_by               ┆ creation_time        │
 │ ---                         ┆ ---                      ┆ ---                  │
 │ str                         ┆ str                      ┆ str                  │
 ╞═════════════════════════════╪══════════════════════════╪══════════════════════╡
 │ Meteorological Station Data ┆ National Weather Service ┆ 2024-12-30T13:59:15Z │
 └─────────────────────────────┴──────────────────────────┴──────────────────────┘
- stations:
 ┌───────────┬───────┬────────────┬────────────┬────────────┬────────────────────────┬──────────────┐
 │ <station> ┆ id    ┆ latitude   ┆ longitude  ┆ elevation  ┆ description            ┆ install_date │
 │ ---       ┆ ---   ┆ ---        ┆ ---        ┆ ---        ┆ ---                    ┆ ---          │
 │ u32       ┆ str   ┆ f32        ┆ f32        ┆ f32        ┆ str                    ┆ str          │
 ╞═══════════╪═══════╪════════════╪════════════╪════════════╪════════════════════════╪══════════════╡
 │ 0         ┆ MS001 ┆ -61.391106 ┆ 48.086628  ┆ 547.105103 ┆ Located in the Arctic  ┆ 2024-03-31   │
 │           ┆       ┆            ┆            ┆            ┆ Tundra a…              ┆              │
 │ 1         ┆ MS002 ┆ 11.891497  ┆ 135.093369 ┆ 174.533493 ┆ Located in the Desert  ┆ 2024-01-17   │
 │           ┆       ┆            ┆            ┆            ┆ area, us…              ┆              │
 └───────────┴───────┴────────────┴────────────┴────────────┴────────────────────────┴──────────────┘
- measurements:
 ┌───────────┬───────────────┬──────────────────────┬─────────────┬───────────────┬───────────┐
 │ <station> ┆ <measurement> ┆ timestamp            ┆ temperature ┆ pressure      ┆ humidity  │
 │ ---       ┆ ---           ┆ ---                  ┆ ---         ┆ ---           ┆ ---       │
 │ u32       ┆ u32           ┆ str                  ┆ f64         ┆ f64           ┆ f64       │
 ╞═══════════╪═══════════════╪══════════════════════╪═════════════╪═══════════════╪═══════════╡
 │ 0         ┆ 0             ┆ 2024-12-30T12:39:15Z ┆ 308.636545  ┆ 95043.997349  ┆ 49.777166 │
 │ 0         ┆ 1             ┆ 2024-12-30T12:44:15Z ┆ 302.245167  ┆ 104932.150155 ┆ 32.568715 │
 │ 1         ┆ 2             ┆ 2024-12-30T12:39:15Z ┆ 297.941843  ┆ 98940.542872  ┆ 57.707949 │
 │ 1         ┆ 3             ┆ 2024-12-30T12:44:15Z ┆ 288.303691  ┆ 100141.305292 ┆ 45.450946 │
 │ 1         ┆ 4             ┆ 2024-12-30T12:49:15Z ┆ 269.127444  ┆ 100052.257518 ┆ 70.401175 │
 │ 1         ┆ 5             ┆ 2024-12-30T12:54:15Z ┆ 299.002921  ┆ 95376.27857   ┆ 42.620882 │
 └───────────┴───────────────┴──────────────────────┴─────────────┴───────────────┴───────────┘

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xml2arrow-0.5.0.tar.gz (22.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

xml2arrow-0.5.0-cp310-abi3-win_amd64.whl (794.8 kB view details)

Uploaded CPython 3.10+Windows x86-64

xml2arrow-0.5.0-cp310-abi3-win32.whl (732.2 kB view details)

Uploaded CPython 3.10+Windows x86

xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_i686.whl (1.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_armv7l.whl (1.3 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (998.9 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ s390x

xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ppc64le

xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (988.2 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARMv7l

xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (948.9 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

xml2arrow-0.5.0-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl (1.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.5+ i686

xml2arrow-0.5.0-cp310-abi3-macosx_11_0_arm64.whl (835.3 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

xml2arrow-0.5.0-cp310-abi3-macosx_10_12_x86_64.whl (918.3 kB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file xml2arrow-0.5.0.tar.gz.

File metadata

  • Download URL: xml2arrow-0.5.0.tar.gz
  • Upload date:
  • Size: 22.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.1

File hashes

Hashes for xml2arrow-0.5.0.tar.gz
Algorithm Hash digest
SHA256 b5d46b301eb595709461be370d9542dfc15a9fe234fdbb48aa6d87be7ff4527d
MD5 8465e674ef6094e00df0aec1247c9cc3
BLAKE2b-256 fc2ec9d460ef6293a249a95883f5beaba07f590ef8f81fcb63d819f802dbe866

See more details on using hashes here.

File details

Details for the file xml2arrow-0.5.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.5.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 aa77cdcfe188587994e458f8eeb49b983887e64f394146717f2ccc12b9d0492e
MD5 a4210dced68374a44460b6959b867e61
BLAKE2b-256 1d69f6c2dba22dbbae1e390cbcd650111f4b27bcf37449a1c73ae97524b421aa

See more details on using hashes here.

File details

Details for the file xml2arrow-0.5.0-cp310-abi3-win32.whl.

File metadata

  • Download URL: xml2arrow-0.5.0-cp310-abi3-win32.whl
  • Upload date:
  • Size: 732.2 kB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.1

File hashes

Hashes for xml2arrow-0.5.0-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 c2d67e6cd49276371d559805bda50466a2fe11d0bea6ff1bb34b1e57c9bc082b
MD5 332a32324a0eb966e82b66f9a289a2ad
BLAKE2b-256 4c213551dfc57d8f70dc901864f50dff6efcbcdbe5159fbbd2c4eb110017df2c

See more details on using hashes here.

File details

Details for the file xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 b729db7cd89dab900c915d24d830daabc6356e5c0b2634a92cda52c376b085a5
MD5 3b51702271a305c70c91fd1574898c03
BLAKE2b-256 1aae3fba090ec2079f7f6dfee3ae056546c43f89109dca843ddf6dd5d1784c5d

See more details on using hashes here.

File details

Details for the file xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 682d3ca964e7aa351c4a88b2af0c763b7ac91ecf736a6944e21734affd0df3af
MD5 dd616ecd1929a5a54f6db434a1d4994c
BLAKE2b-256 f5716477de70fff96787d81cd2e4ebab9b77078261f6e17eb598c05c384a74c4

See more details on using hashes here.

File details

Details for the file xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 7c3f319b32223c38abe3822e2a28fd89fc9bd3e5bed588d69896202a5d7f56f4
MD5 bc7f03b114131e32fbc4529ab813744d
BLAKE2b-256 50cfb35e1cf053e63a9a6be87516801a7563e2119709193f6ec95102760a74eb

See more details on using hashes here.

File details

Details for the file xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 6bcda84d93425f05691069f8ec70af4c0d9d3d8cc2efa931b3c793f237f12c95
MD5 25f62d284721ac92418753f1f7a9074d
BLAKE2b-256 94ef85a34f7fec259dff4e67b0b31b613bf0cad8041a0a0ddae43160e76aba40

See more details on using hashes here.

File details

Details for the file xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e77562a64c7ae0db01cff4594a9df85cf1550fdb0d8075013470a825e2e9843f
MD5 f48f2ef56fbe45e7b4c388e1385ec839
BLAKE2b-256 0660670d72f50e0115a54dafbc778082c26b088b817db58c76de15c25ddea52f

See more details on using hashes here.

File details

Details for the file xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 80c10f1081edda496099bbe2cf68c09bf20261ba4fb1526e253e2f77749dadec
MD5 00a5767eb207750ebfc905d75c369316
BLAKE2b-256 c22e29b92a858fbe3be119a1621320082e69066211d74b5e4385874e2c5331c9

See more details on using hashes here.

File details

Details for the file xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 95311a52493b78c30b133aed84fde11860a882749de6ad1abfbf14c0f4fa4ce2
MD5 9cd2f0dc0d1118649ac55942e52be472
BLAKE2b-256 5d8a61fcfefd556227a980f37c5057478cc3ec12b5b8a06c12d45f0aa5146dcd

See more details on using hashes here.

File details

Details for the file xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 7cf6e24d1a4713801e96ac23ddd501bae620337a6ab83e7640fc4c16973dd582
MD5 95286334356f1d1fd2fed4ed09833ef4
BLAKE2b-256 c9ab0dd6130491fd0f9251a21a55a4676a004990b776e5dd349d90b1d4e091ce

See more details on using hashes here.

File details

Details for the file xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9940bd328a40df998583dc9e757bb9a03e81f62c2df5f52eb303d4ad0106cf01
MD5 22102df0ef292e9081321843c2098aed
BLAKE2b-256 40b5ab619e55d054a3c2467140130583ef076cd1012d016a4b5bbb2019cf8b3c

See more details on using hashes here.

File details

Details for the file xml2arrow-0.5.0-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for xml2arrow-0.5.0-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 8c0332baffc93ab49ab1c4192e40bc900b5ab8b9364a4885936ccf58ef5caa3e
MD5 71e2fc55be9bbac8eb73bc87229d4b1e
BLAKE2b-256 66e79612cbac411e3244b3f4f42fa5d5fb9333bffc1215a9ddec65f92eefe45b

See more details on using hashes here.

File details

Details for the file xml2arrow-0.5.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.5.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d98ec4243321120e6e910029edfc7f90fb438717eb5d427a96a18ff14c1a15d8
MD5 8509c116d7b994a874642ad9563a61d9
BLAKE2b-256 8deba550828ecf3ce324deb300930681c76435eb67ed66fd4f29540bd92b54d6

See more details on using hashes here.

File details

Details for the file xml2arrow-0.5.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.5.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 54a3decc34b811a0201a5477659c43046d14d22bed270973e686c1378948e9cf
MD5 bbf193c347eac2bfbb052056d761bd90
BLAKE2b-256 9a1472f679601202849948037b2ef1e1233946e13c2019e473ed4a1bbd6da9c4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page