Skip to main content

Efficiently convert XML data to Apache Arrow format.

Project description

XML2ARROW-PYTHON

A Python package for efficiently converting XML files to Apache Arrow tables using a YAML configuration. This package leverages the xml2arrow Rust crate for high performance.

Installation

pip install xml2arrow

Usage

  1. Create a Configuration File (YAML):

The configuration file (YAML format) defines how your XML structure maps to Arrow tables and fields. Here's a detailed explanation of the configuration structure:

tables:
  - name: <table_name>         # The name of the resulting Arrow table
    xml_path: <xml_path>       # The XML path to the *parent* element of the table's row elements
    row_element: <row_element> # The name of the XML element that represents a row
    levels:                    # Index levels for nested XML structures.
    - <level1>
    - <level2> 
    fields:
    - name: <field_name>       # The name of the Arrow field
      xml_path: <field_path>   # The XML path to the field within a row
      data_type: <data_type>   # The Arrow data type (see below)
      nullable: <true|false>   # Whether the field can be null
      scale: <number>          # Optional scaling factor for floats. 
      offset: <number>         # Optional offset for numeric floats
  - name: ...
  • tables: A list of table configurations. Each entry defines a separate Arrow table to be extracted from the XML.
  • name: The name given to the resulting Arrow RecordBatch (which represents a table).
  • xml_path: An XPath-like string that specifies the XML element that is the parent of the elements representing rows in the table. For example, if your XML contains <library><book>...</book><book>...</book></library>, the xml_path would be /library. The book elements are then identified by the row_element configuration.
  • row_element: The element that represents a single row. For example, if the xml_path is /library/book, the row_element is book.
  • levels: An array of strings that represent parent tables to create an index for nested structures. If the XML structure is /library/shelfs/shelf/books/book you should define levels like this: levels: ["shelfs", "books"]. This will create indexes named <shelfs> and <books>.
  • fields: A list of field configurations for each column in the Arrow table.
    • name: The name of the field in the Arrow schema.
    • xml_path: An XPath-like string that specifies the XML element containing the field's value.
    • data_type: The Arrow data type of the field. Supported types are:
      • Boolean (true or false)
      • Int16
      • UInt16
      • Int32
      • UInt32
      • Int64
      • UInt64
      • Float32
      • Float64
      • Utf8 (Strings)
    • nullable: A boolean value indicating whether the field can contain null values.
    • scale (Optional): A scaling factor for float fields (e.g., to convert units).
    • offset (Optional): An offset value for float fields (e.g., to convert units).
  1. Parse the XML
from xml2arrow import XmlToArrowParser

parser = XmlToArrowParser("config.yaml")     # Load configuration
record_batches = parser.parse("data.xml")    # Parse XML using configuration

Example

Suppose we have the following XML file (stations.xml):

<report>
  <header>
    <title>Meteorological Station Data</title>
    <created_by>National Weather Service</created_by>
    <creation_time>2024-12-30T13:59:15Z</creation_time>
  </header>
  <monitoring_stations>
    <monitoring_station>
      <id>MS001</id>
      <location>
        <latitude>-61.39110459389277</latitude>
        <longitude>48.08662749089257</longitude>
        <elevation unit="m">547.1050788360882</elevation>
      </location>
      <measurements>
        <measurement>
          <timestamp>2024-12-30T12:39:15Z</timestamp>
          <temperature unit="C">35.486545480326114</temperature>
          <pressure unit="hPa">950.439973486407</pressure>
          <humidity unit="%">49.77716576844861</humidity>
        </measurement>
        <measurement>
          <timestamp>2024-12-30T12:44:15Z</timestamp>
          <temperature unit="C">29.095166644493865</temperature>
          <pressure unit="hPa">1049.3215015450517</pressure>
          <humidity unit="%">32.5687148391251</humidity>
        </measurement>
      </measurements>
      <metadata>
        <description>Located in the Arctic Tundra area, used for Scientific Research.</description>
        <install_date>2024-03-31</install_date>
      </metadata>
    </monitoring_station>
    <monitoring_station>
      <id>MS002</id>
      <location>
        <latitude>11.891496388319311</latitude>
        <longitude>135.09336983543022</longitude>
        <elevation unit="m">174.53349357280004</elevation>
      </location>
      <measurements>
        <measurement>
          <timestamp>2024-12-30T12:39:15Z</timestamp>
          <temperature unit="C">24.791842953632283</temperature>
          <pressure unit="hPa">989.4054287187706</pressure>
          <humidity unit="%">57.70794884397625</humidity>
        </measurement>
        <measurement>
          <timestamp>2024-12-30T12:44:15Z</timestamp>
          <temperature unit="C">15.153690541845911</temperature>
          <pressure unit="hPa">1001.413052919951</pressure>
          <humidity unit="%">45.45094598045342</humidity>
        </measurement>
        <measurement>
          <timestamp>2024-12-30T12:49:15Z</timestamp>
          <temperature unit="C">-4.022555715139081</temperature>
          <pressure unit="hPa">1000.5225751769922</pressure>
          <humidity unit="%">70.40117458947834</humidity>
        </measurement>
        <measurement>
          <timestamp>2024-12-30T12:54:15Z</timestamp>
          <temperature unit="C">25.852920542644185</temperature>
          <pressure unit="hPa">953.762785698162</pressure>
          <humidity unit="%">42.62088244545566</humidity>
        </measurement>
      </measurements>
      <metadata>
        <description>Located in the Desert area, used for Weather Forecasting.</description>
        <install_date>2024-01-17</install_date>
      </metadata>
    </monitoring_station>
  </monitoring_stations>
</report>

We can define a YAML configuration file (stations.yaml) to specify how to convert the XML data to Arrow tables:

tables:
  - name: report
    xml_path: /
    row_element: report
    levels: []
    fields:
    - name: title
      xml_path: /report/header/title
      data_type: Utf8
      nullable: false
    - name: created_by
      xml_path: /report/header/created_by
      data_type: Utf8
      nullable: false
    - name: creation_time
      xml_path: /report/header/creation_time
      data_type: Utf8
      nullable: false
  - name: stations
    xml_path: /report/monitoring_stations
    row_element: monitoring_station
    levels:
    - station
    fields:
    - name: id
      xml_path: /report/monitoring_stations/monitoring_station/id
      data_type: Utf8
      nullable: false
    - name: latitude
      xml_path: /report/monitoring_stations/monitoring_station/location/latitude
      data_type: Float32
      nullable: false
    - name: longitude
      xml_path: /report/monitoring_stations/monitoring_station/location/longitude
      data_type: Float32
      nullable: false
    - name: elevation
      xml_path: /report/monitoring_stations/monitoring_station/location/elevation
      data_type: Float32
      nullable: false
    - name: description
      xml_path: report/monitoring_stations/monitoring_station/metadata/description
      data_type: Utf8
      nullable: false
    - name: install_date
      xml_path: report/monitoring_stations/monitoring_station/metadata/install_date
      data_type: Utf8
      nullable: false
  - name: measurements
    xml_path: /report/monitoring_stations/monitoring_station/measurements
    row_element: measurement
    levels:
    - station
    - measurement
    fields:
    - name: timestamp
      xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/timestamp
      data_type: Utf8
      nullable: false
    - name: temperature
      xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/temperature
      data_type: Float64
      nullable: false
      offset: 273.15  # Convert from Celsius to Kelvin
    - name: pressure
      xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/pressure
      data_type: Float64
      nullable: false
      scale: 100.0  # Convert from hPa to Pa
    - name: humidity
      xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/humidity
      data_type: Float64
      nullable: false

Here's how to use xml2arrow to parse the XML and YAML files and get the resulting Arrow tables:

from xml2arrow import XmlToArrowParser

parser = XmlToArrowParser("stations.yaml")     # Load configuration
record_batches = parser.parse("stations.xml")  # Parse XML using configuration
- report:
 ┌─────────────────────────────┬──────────────────────────┬──────────────────────┐
 │ title                       ┆ created_by               ┆ creation_time        │
 │ ---                         ┆ ---                      ┆ ---                  │
 │ str                         ┆ str                      ┆ str                  │
 ╞═════════════════════════════╪══════════════════════════╪══════════════════════╡
 │ Meteorological Station Data ┆ National Weather Service ┆ 2024-12-30T13:59:15Z │
 └─────────────────────────────┴──────────────────────────┴──────────────────────┘
- stations:
 ┌───────────┬───────┬────────────┬────────────┬────────────┬────────────────────────┬──────────────┐
 │ <station> ┆ id    ┆ latitude   ┆ longitude  ┆ elevation  ┆ description            ┆ install_date │
 │ ---       ┆ ---   ┆ ---        ┆ ---        ┆ ---        ┆ ---                    ┆ ---          │
 │ u32       ┆ str   ┆ f32        ┆ f32        ┆ f32        ┆ str                    ┆ str          │
 ╞═══════════╪═══════╪════════════╪════════════╪════════════╪════════════════════════╪══════════════╡
 │ 0         ┆ MS001 ┆ -61.391106 ┆ 48.086628  ┆ 547.105103 ┆ Located in the Arctic  ┆ 2024-03-31   │
 │           ┆       ┆            ┆            ┆            ┆ Tundra a…              ┆              │
 │ 1         ┆ MS002 ┆ 11.891497  ┆ 135.093369 ┆ 174.533493 ┆ Located in the Desert  ┆ 2024-01-17   │
 │           ┆       ┆            ┆            ┆            ┆ area, us…              ┆              │
 └───────────┴───────┴────────────┴────────────┴────────────┴────────────────────────┴──────────────┘
- measurements:
 ┌───────────┬───────────────┬──────────────────────┬─────────────┬───────────────┬───────────┐
 │ <station> ┆ <measurement> ┆ timestamp            ┆ temperature ┆ pressure      ┆ humidity  │
 │ ---       ┆ ---           ┆ ---                  ┆ ---         ┆ ---           ┆ ---       │
 │ u32       ┆ u32           ┆ str                  ┆ f64         ┆ f64           ┆ f64       │
 ╞═══════════╪═══════════════╪══════════════════════╪═════════════╪═══════════════╪═══════════╡
 │ 0         ┆ 0             ┆ 2024-12-30T12:39:15Z ┆ 308.636545  ┆ 95043.997349  ┆ 49.777166 │
 │ 0         ┆ 1             ┆ 2024-12-30T12:44:15Z ┆ 302.245167  ┆ 104932.150155 ┆ 32.568715 │
 │ 1         ┆ 2             ┆ 2024-12-30T12:39:15Z ┆ 297.941843  ┆ 98940.542872  ┆ 57.707949 │
 │ 1         ┆ 3             ┆ 2024-12-30T12:44:15Z ┆ 288.303691  ┆ 100141.305292 ┆ 45.450946 │
 │ 1         ┆ 4             ┆ 2024-12-30T12:49:15Z ┆ 269.127444  ┆ 100052.257518 ┆ 70.401175 │
 │ 1         ┆ 5             ┆ 2024-12-30T12:54:15Z ┆ 299.002921  ┆ 95376.27857   ┆ 42.620882 │
 └───────────┴───────────────┴──────────────────────┴─────────────┴───────────────┴───────────┘

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xml2arrow-0.3.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

xml2arrow-0.3.0-cp310-abi3-win_amd64.whl (783.2 kB view details)

Uploaded CPython 3.10+Windows x86-64

xml2arrow-0.3.0-cp310-abi3-win32.whl (721.9 kB view details)

Uploaded CPython 3.10+Windows x86

xml2arrow-0.3.0-cp310-abi3-musllinux_1_2_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

xml2arrow-0.3.0-cp310-abi3-musllinux_1_2_i686.whl (1.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

xml2arrow-0.3.0-cp310-abi3-musllinux_1_2_armv7l.whl (1.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

xml2arrow-0.3.0-cp310-abi3-musllinux_1_2_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

xml2arrow-0.3.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (985.6 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

xml2arrow-0.3.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ s390x

xml2arrow-0.3.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ppc64le

xml2arrow-0.3.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (976.1 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARMv7l

xml2arrow-0.3.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (934.0 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

xml2arrow-0.3.0-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl (1.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.5+ i686

xml2arrow-0.3.0-cp310-abi3-macosx_11_0_arm64.whl (822.9 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

xml2arrow-0.3.0-cp310-abi3-macosx_10_12_x86_64.whl (906.6 kB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file xml2arrow-0.3.0.tar.gz.

File metadata

  • Download URL: xml2arrow-0.3.0.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.1

File hashes

Hashes for xml2arrow-0.3.0.tar.gz
Algorithm Hash digest
SHA256 114ec87af34bfa451289ce76dd622c0a68eb655e7c61848780e71dee45d113c4
MD5 6ddd9e7284d8af29de6771618aeb9221
BLAKE2b-256 6cf4a0953424acd472eebb1ef08f5664471dce92f5bdd7e2f474e05899385fe2

See more details on using hashes here.

File details

Details for the file xml2arrow-0.3.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.3.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 4d82750d3b4c0411a66be6019eac1fb7f8486e43118f318fcc1e32d05faeb372
MD5 5ce09f0dae7389500a4c70f1f9b66f0d
BLAKE2b-256 4678669715bc4d3637fab8c1f56d27efba07d27b6b8b121c9ee392f89b6d06e4

See more details on using hashes here.

File details

Details for the file xml2arrow-0.3.0-cp310-abi3-win32.whl.

File metadata

  • Download URL: xml2arrow-0.3.0-cp310-abi3-win32.whl
  • Upload date:
  • Size: 721.9 kB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.1

File hashes

Hashes for xml2arrow-0.3.0-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 c2002b105bef57fba6c59f4ac16bcd6d6f42d67a803ff97138c343e8943c3595
MD5 445af7e8ee0fd3f48216e4057be68ce4
BLAKE2b-256 b28d19f928678f0ecdd99d656b3b532239cbf62bb6e07225a0f8acdc90351927

See more details on using hashes here.

File details

Details for the file xml2arrow-0.3.0-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.3.0-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 7a644429333970189f23a2b612afa1990607a34a076c8308410312df6e398454
MD5 89ac4c8375521797e9c405ad4d423651
BLAKE2b-256 345f5768187da074f0cf73e7bb0ef6cb1a2d5673f344e245055dbf12e4d78a69

See more details on using hashes here.

File details

Details for the file xml2arrow-0.3.0-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for xml2arrow-0.3.0-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 9524774e0104679e280157e4c4f26483df3479ffa2d5f5e12bafe2b12f5a84cc
MD5 f506ffde3d32432f18cb62856b1ff87c
BLAKE2b-256 82e845c6d9ca6365f2359bb7e7a6dcef780fc4c9d56a7dc51b09164fa1ab5f7a

See more details on using hashes here.

File details

Details for the file xml2arrow-0.3.0-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for xml2arrow-0.3.0-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 459f9d9ba5ea2007bc7843d8cb3fd4d4c5b9c28d1bf9162df16986346b996503
MD5 a7e075190e8c8540065f0dd7882f8c6b
BLAKE2b-256 7531156f76b92871db292917633b3c02869ec4b414d35a0cbffaedadb99eec13

See more details on using hashes here.

File details

Details for the file xml2arrow-0.3.0-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.3.0-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 c7d10ce61486a599d1bd61c54359eb2acd065ea533c4e3cb272c74b3d0d73097
MD5 3ac4d8cd5124b7d3a13e1decfad0d01b
BLAKE2b-256 01381a3911af93d30a7188bd4e8540f2972b3132a37f280a6ac805f3b8c56a75

See more details on using hashes here.

File details

Details for the file xml2arrow-0.3.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.3.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 25c548884f60132cff1c9aa4720c5423e9d9e730f77975a8c9f78e6e48ee374a
MD5 3baf3e8f5fb039ea48be31f18c0fdefd
BLAKE2b-256 e52799cd77e097364327c7a37d6c7cacec1381ddd0556289910dce665e188a27

See more details on using hashes here.

File details

Details for the file xml2arrow-0.3.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for xml2arrow-0.3.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 364c7c0969a479169483eb4f7a6ce8fb716e700861af13decb575ddf46bfe206
MD5 3b8cb94bf6e1c8ab1c3e3bd7e83af6ab
BLAKE2b-256 c7557fd9fe39f6c1eef0c4a164d83ea6a1fffc6593de5be781d66fcbb6bbb05e

See more details on using hashes here.

File details

Details for the file xml2arrow-0.3.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for xml2arrow-0.3.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 0d94375f8c9918d91ad4edd3788e5833b0d5d2e596b0fdb75ddb682e23eaa912
MD5 c727ed2fb4560d0447f9fb1b4d0442e2
BLAKE2b-256 c63ff1b052d9718a32881f3a33856f871b9b1aa8975841fd99626173718e3821

See more details on using hashes here.

File details

Details for the file xml2arrow-0.3.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for xml2arrow-0.3.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 b51c50a1e345f6d0b54ab8ce2c37d171204982cf7622c24f79c5f3a950b2a7e9
MD5 bae288b720dcc59a5133f3652f448b5a
BLAKE2b-256 ca5ecf89ca7cc1e83f06fa8ad9fa173d3f560ab70781d5a6115e6dc73ad2fba0

See more details on using hashes here.

File details

Details for the file xml2arrow-0.3.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.3.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a08993c44a7b6e64f7b3a9147010b5be3b1ecff9ce49952f8ed1db840a698b6e
MD5 499d78d880c8d6aaefc6c9fecffdfffc
BLAKE2b-256 5db90f8a4d3237895ab58877ae1efc9a295dc95dfc1e3f7c49a86fb7b055aff8

See more details on using hashes here.

File details

Details for the file xml2arrow-0.3.0-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for xml2arrow-0.3.0-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 5e4dce01f452b879d4b4edd7701888a388da82e65b9bfd56ba5da150625c2e91
MD5 9b0b48258942c1988fe71e400db7a1ba
BLAKE2b-256 af8fd91b426d63991402395cc8e8a67f58293871c5b32f964c27b0a058850656

See more details on using hashes here.

File details

Details for the file xml2arrow-0.3.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.3.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0006f4266775cea99cd85370ed51bd419a198a9948dca6b2dc11203d23eb50af
MD5 355214741d6f01de36cedb85bef26cf7
BLAKE2b-256 92b5b086f7f1c3b9ebee91ed913010154e3fab748d0a8314c53321e1a037b1b8

See more details on using hashes here.

File details

Details for the file xml2arrow-0.3.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for xml2arrow-0.3.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 d29f68a3c27dc54555b78603a66c47a39d2356be4abd1296ea9a0bc81bba1ef1
MD5 d18e0d038681671820c4545bbbc0ae97
BLAKE2b-256 23dc440298939c3abb4c9eb612f2eb1aa3210356a1823991b6a0ad99b8cdc184

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page