Efficiently convert XML data to Apache Arrow format.
Project description
XML2ARROW-PYTHON
A Python package for efficiently converting XML files to Apache Arrow tables using a YAML configuration. This package leverages the xml2arrow Rust crate for high performance.
Installation
pip install xml2arrow
Usage
- Create a Configuration File (YAML):
The configuration file (YAML format) defines how your XML structure maps to Arrow tables and fields. Here's a detailed explanation of the configuration structure:
tables:
- name: <table_name> # The name of the resulting Arrow table
xml_path: <xml_path> # The XML path to the *parent* element of the table's row elements
levels: # Index levels for nested XML structures.
- <level1>
- <level2>
fields:
- name: <field_name> # The name of the Arrow field
xml_path: <field_path> # The XML path to the field within a row
data_type: <data_type> # The Arrow data type (see below)
nullable: <true|false> # Whether the field can be null
scale: <number> # Optional scaling factor for floats.
offset: <number> # Optional offset for numeric floats
- name: ...
tables: A list of table configurations. Each entry defines a separate Arrow table to be extracted from the XML.name: The name given to the resulting Arrow RecordBatch (which represents a table).xml_path: An XPath-like string that specifies the XML element that is the parent of the elements representing rows in the table. For example, if your XML contains<library><book>...</book><book>...</book></library>, thexml_pathwould be/library.levels: An array of strings that represent parent tables to create an index for nested structures. If the XML structure is/library/shelfs/shelf/books/bookyou should define levels like this:levels: ["shelfs", "books"]. This will create indexes named<shelfs>and<books>.fields: A list of field configurations for each column in the Arrow table.name: The name of the field in the Arrow schema.xml_path: An XPath-like string that specifies the XML element or attribute containing the field's value. To select an attribute, append@followed by the attribute name to the element's path. For example,/library/book/@idselects theidattribute of thebookelement.data_type: The Arrow data type of the field. Supported types are:Boolean(true or false)Int16UInt16Int32UInt32Int64UInt64Float32Float64Utf8(Strings)
nullable: A boolean value indicating whether the field can contain null values. This field is optional and defaults tofalseif not specified.scale(Optional): A scaling factor for float fields (e.g., to convert units).offset(Optional): An offset value for float fields (e.g., to convert units).
- Parse the XML
from xml2arrow import XmlToArrowParser
parser = XmlToArrowParser("config.yaml") # Load configuration
record_batches = parser.parse("data.xml") # Parse XML using configuration
Example
Suppose we have the following XML file (stations.xml):
<report>
<header>
<title>Meteorological Station Data</title>
<created_by>National Weather Service</created_by>
<creation_time>2024-12-30T13:59:15Z</creation_time>
</header>
<monitoring_stations>
<monitoring_station id="MS001">
<location>
<latitude>-61.39110459389277</latitude>
<longitude>48.08662749089257</longitude>
<elevation unit="m">547.1050788360882</elevation>
</location>
<measurements>
<measurement>
<timestamp>2024-12-30T12:39:15Z</timestamp>
<temperature unit="C">35.486545480326114</temperature>
<pressure unit="hPa">950.439973486407</pressure>
<humidity unit="%">49.77716576844861</humidity>
</measurement>
<measurement>
<timestamp>2024-12-30T12:44:15Z</timestamp>
<temperature unit="C">29.095166644493865</temperature>
<pressure unit="hPa">1049.3215015450517</pressure>
<humidity unit="%">32.5687148391251</humidity>
</measurement>
</measurements>
<metadata>
<description>Located in the Arctic Tundra area, used for Scientific Research.</description>
<install_date>2024-03-31</install_date>
</metadata>
</monitoring_station>
<monitoring_station id="MS002">
<location>
<latitude>11.891496388319311</latitude>
<longitude>135.09336983543022</longitude>
<elevation unit="m">174.53349357280004</elevation>
</location>
<measurements>
<measurement>
<timestamp>2024-12-30T12:39:15Z</timestamp>
<temperature unit="C">24.791842953632283</temperature>
<pressure unit="hPa">989.4054287187706</pressure>
<humidity unit="%">57.70794884397625</humidity>
</measurement>
<measurement>
<timestamp>2024-12-30T12:44:15Z</timestamp>
<temperature unit="C">15.153690541845911</temperature>
<pressure unit="hPa">1001.413052919951</pressure>
<humidity unit="%">45.45094598045342</humidity>
</measurement>
<measurement>
<timestamp>2024-12-30T12:49:15Z</timestamp>
<temperature unit="C">-4.022555715139081</temperature>
<pressure unit="hPa">1000.5225751769922</pressure>
<humidity unit="%">70.40117458947834</humidity>
</measurement>
<measurement>
<timestamp>2024-12-30T12:54:15Z</timestamp>
<temperature unit="C">25.852920542644185</temperature>
<pressure unit="hPa">953.762785698162</pressure>
<humidity unit="%">42.62088244545566</humidity>
</measurement>
</measurements>
<metadata>
<description>Located in the Desert area, used for Weather Forecasting.</description>
<install_date>2024-01-17</install_date>
</metadata>
</monitoring_station>
</monitoring_stations>
</report>
We can define a YAML configuration file (stations.yaml) to specify how to convert the XML data to Arrow tables:
tables:
- name: report
xml_path: /
levels: []
fields:
- name: title
xml_path: /report/header/title
data_type: Utf8
nullable: false
- name: created_by
xml_path: /report/header/created_by
data_type: Utf8
nullable: false
- name: creation_time
xml_path: /report/header/creation_time
data_type: Utf8
nullable: false
- name: stations
xml_path: /report/monitoring_stations
levels:
- station
fields:
- name: id
xml_path: /report/monitoring_stations/monitoring_station/@id # Path to an attribute
data_type: Utf8
nullable: false
- name: latitude
xml_path: /report/monitoring_stations/monitoring_station/location/latitude
data_type: Float32
nullable: false
- name: longitude
xml_path: /report/monitoring_stations/monitoring_station/location/longitude
data_type: Float32
nullable: false
- name: elevation
xml_path: /report/monitoring_stations/monitoring_station/location/elevation
data_type: Float32
nullable: false
- name: description
xml_path: report/monitoring_stations/monitoring_station/metadata/description
data_type: Utf8
nullable: false
- name: install_date
xml_path: report/monitoring_stations/monitoring_station/metadata/install_date
data_type: Utf8
nullable: false
- name: measurements
xml_path: /report/monitoring_stations/monitoring_station/measurements
levels:
- station
- measurement
fields:
- name: timestamp
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/timestamp
data_type: Utf8
nullable: false
- name: temperature
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/temperature
data_type: Float64
nullable: false
offset: 273.15 # Convert from Celsius to Kelvin
- name: pressure
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/pressure
data_type: Float64
nullable: false
scale: 100.0 # Convert from hPa to Pa
- name: humidity
xml_path: /report/monitoring_stations/monitoring_station/measurements/measurement/humidity
data_type: Float64
nullable: false
Here's how to use xml2arrow to parse the XML and YAML files and get the resulting Arrow tables:
from xml2arrow import XmlToArrowParser
parser = XmlToArrowParser("stations.yaml") # Load configuration
record_batches = parser.parse("stations.xml") # Parse XML using configuration
- report:
┌─────────────────────────────┬──────────────────────────┬──────────────────────┐
│ title ┆ created_by ┆ creation_time │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════════════════════════╪══════════════════════════╪══════════════════════╡
│ Meteorological Station Data ┆ National Weather Service ┆ 2024-12-30T13:59:15Z │
└─────────────────────────────┴──────────────────────────┴──────────────────────┘
- stations:
┌───────────┬───────┬────────────┬────────────┬────────────┬────────────────────────┬──────────────┐
│ <station> ┆ id ┆ latitude ┆ longitude ┆ elevation ┆ description ┆ install_date │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ f32 ┆ f32 ┆ f32 ┆ str ┆ str │
╞═══════════╪═══════╪════════════╪════════════╪════════════╪════════════════════════╪══════════════╡
│ 0 ┆ MS001 ┆ -61.391106 ┆ 48.086628 ┆ 547.105103 ┆ Located in the Arctic ┆ 2024-03-31 │
│ ┆ ┆ ┆ ┆ ┆ Tundra a… ┆ │
│ 1 ┆ MS002 ┆ 11.891497 ┆ 135.093369 ┆ 174.533493 ┆ Located in the Desert ┆ 2024-01-17 │
│ ┆ ┆ ┆ ┆ ┆ area, us… ┆ │
└───────────┴───────┴────────────┴────────────┴────────────┴────────────────────────┴──────────────┘
- measurements:
┌───────────┬───────────────┬──────────────────────┬─────────────┬───────────────┬───────────┐
│ <station> ┆ <measurement> ┆ timestamp ┆ temperature ┆ pressure ┆ humidity │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ str ┆ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════════╪══════════════════════╪═════════════╪═══════════════╪═══════════╡
│ 0 ┆ 0 ┆ 2024-12-30T12:39:15Z ┆ 308.636545 ┆ 95043.997349 ┆ 49.777166 │
│ 0 ┆ 1 ┆ 2024-12-30T12:44:15Z ┆ 302.245167 ┆ 104932.150155 ┆ 32.568715 │
│ 1 ┆ 2 ┆ 2024-12-30T12:39:15Z ┆ 297.941843 ┆ 98940.542872 ┆ 57.707949 │
│ 1 ┆ 3 ┆ 2024-12-30T12:44:15Z ┆ 288.303691 ┆ 100141.305292 ┆ 45.450946 │
│ 1 ┆ 4 ┆ 2024-12-30T12:49:15Z ┆ 269.127444 ┆ 100052.257518 ┆ 70.401175 │
│ 1 ┆ 5 ┆ 2024-12-30T12:54:15Z ┆ 299.002921 ┆ 95376.27857 ┆ 42.620882 │
└───────────┴───────────────┴──────────────────────┴─────────────┴───────────────┴───────────┘
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xml2arrow-0.5.0.tar.gz.
File metadata
- Download URL: xml2arrow-0.5.0.tar.gz
- Upload date:
- Size: 22.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5d46b301eb595709461be370d9542dfc15a9fe234fdbb48aa6d87be7ff4527d
|
|
| MD5 |
8465e674ef6094e00df0aec1247c9cc3
|
|
| BLAKE2b-256 |
fc2ec9d460ef6293a249a95883f5beaba07f590ef8f81fcb63d819f802dbe866
|
File details
Details for the file xml2arrow-0.5.0-cp310-abi3-win_amd64.whl.
File metadata
- Download URL: xml2arrow-0.5.0-cp310-abi3-win_amd64.whl
- Upload date:
- Size: 794.8 kB
- Tags: CPython 3.10+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa77cdcfe188587994e458f8eeb49b983887e64f394146717f2ccc12b9d0492e
|
|
| MD5 |
a4210dced68374a44460b6959b867e61
|
|
| BLAKE2b-256 |
1d69f6c2dba22dbbae1e390cbcd650111f4b27bcf37449a1c73ae97524b421aa
|
File details
Details for the file xml2arrow-0.5.0-cp310-abi3-win32.whl.
File metadata
- Download URL: xml2arrow-0.5.0-cp310-abi3-win32.whl
- Upload date:
- Size: 732.2 kB
- Tags: CPython 3.10+, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2d67e6cd49276371d559805bda50466a2fe11d0bea6ff1bb34b1e57c9bc082b
|
|
| MD5 |
332a32324a0eb966e82b66f9a289a2ad
|
|
| BLAKE2b-256 |
4c213551dfc57d8f70dc901864f50dff6efcbcdbe5159fbbd2c4eb110017df2c
|
File details
Details for the file xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10+, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b729db7cd89dab900c915d24d830daabc6356e5c0b2634a92cda52c376b085a5
|
|
| MD5 |
3b51702271a305c70c91fd1574898c03
|
|
| BLAKE2b-256 |
1aae3fba090ec2079f7f6dfee3ae056546c43f89109dca843ddf6dd5d1784c5d
|
File details
Details for the file xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_i686.whl.
File metadata
- Download URL: xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_i686.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10+, musllinux: musl 1.2+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
682d3ca964e7aa351c4a88b2af0c763b7ac91ecf736a6944e21734affd0df3af
|
|
| MD5 |
dd616ecd1929a5a54f6db434a1d4994c
|
|
| BLAKE2b-256 |
f5716477de70fff96787d81cd2e4ebab9b77078261f6e17eb598c05c384a74c4
|
File details
Details for the file xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_armv7l.whl.
File metadata
- Download URL: xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_armv7l.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.10+, musllinux: musl 1.2+ ARMv7l
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c3f319b32223c38abe3822e2a28fd89fc9bd3e5bed588d69896202a5d7f56f4
|
|
| MD5 |
bc7f03b114131e32fbc4529ab813744d
|
|
| BLAKE2b-256 |
50cfb35e1cf053e63a9a6be87516801a7563e2119709193f6ec95102760a74eb
|
File details
Details for the file xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_aarch64.whl.
File metadata
- Download URL: xml2arrow-0.5.0-cp310-abi3-musllinux_1_2_aarch64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.10+, musllinux: musl 1.2+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6bcda84d93425f05691069f8ec70af4c0d9d3d8cc2efa931b3c793f237f12c95
|
|
| MD5 |
25f62d284721ac92418753f1f7a9074d
|
|
| BLAKE2b-256 |
94ef85a34f7fec259dff4e67b0b31b613bf0cad8041a0a0ddae43160e76aba40
|
File details
Details for the file xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 998.9 kB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e77562a64c7ae0db01cff4594a9df85cf1550fdb0d8075013470a825e2e9843f
|
|
| MD5 |
f48f2ef56fbe45e7b4c388e1385ec839
|
|
| BLAKE2b-256 |
0660670d72f50e0115a54dafbc778082c26b088b817db58c76de15c25ddea52f
|
File details
Details for the file xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.
File metadata
- Download URL: xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ s390x
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80c10f1081edda496099bbe2cf68c09bf20261ba4fb1526e253e2f77749dadec
|
|
| MD5 |
00a5767eb207750ebfc905d75c369316
|
|
| BLAKE2b-256 |
c22e29b92a858fbe3be119a1621320082e69066211d74b5e4385874e2c5331c9
|
File details
Details for the file xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.
File metadata
- Download URL: xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
- Upload date:
- Size: 1.0 MB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ ppc64le
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95311a52493b78c30b133aed84fde11860a882749de6ad1abfbf14c0f4fa4ce2
|
|
| MD5 |
9cd2f0dc0d1118649ac55942e52be472
|
|
| BLAKE2b-256 |
5d8a61fcfefd556227a980f37c5057478cc3ec12b5b8a06c12d45f0aa5146dcd
|
File details
Details for the file xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.
File metadata
- Download URL: xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
- Upload date:
- Size: 988.2 kB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ ARMv7l
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7cf6e24d1a4713801e96ac23ddd501bae620337a6ab83e7640fc4c16973dd582
|
|
| MD5 |
95286334356f1d1fd2fed4ed09833ef4
|
|
| BLAKE2b-256 |
c9ab0dd6130491fd0f9251a21a55a4676a004990b776e5dd349d90b1d4e091ce
|
File details
Details for the file xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: xml2arrow-0.5.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 948.9 kB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9940bd328a40df998583dc9e757bb9a03e81f62c2df5f52eb303d4ad0106cf01
|
|
| MD5 |
22102df0ef292e9081321843c2098aed
|
|
| BLAKE2b-256 |
40b5ab619e55d054a3c2467140130583ef076cd1012d016a4b5bbb2019cf8b3c
|
File details
Details for the file xml2arrow-0.5.0-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl.
File metadata
- Download URL: xml2arrow-0.5.0-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.10+, manylinux: glibc 2.5+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c0332baffc93ab49ab1c4192e40bc900b5ab8b9364a4885936ccf58ef5caa3e
|
|
| MD5 |
71e2fc55be9bbac8eb73bc87229d4b1e
|
|
| BLAKE2b-256 |
66e79612cbac411e3244b3f4f42fa5d5fb9333bffc1215a9ddec65f92eefe45b
|
File details
Details for the file xml2arrow-0.5.0-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: xml2arrow-0.5.0-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 835.3 kB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d98ec4243321120e6e910029edfc7f90fb438717eb5d427a96a18ff14c1a15d8
|
|
| MD5 |
8509c116d7b994a874642ad9563a61d9
|
|
| BLAKE2b-256 |
8deba550828ecf3ce324deb300930681c76435eb67ed66fd4f29540bd92b54d6
|
File details
Details for the file xml2arrow-0.5.0-cp310-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: xml2arrow-0.5.0-cp310-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 918.3 kB
- Tags: CPython 3.10+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54a3decc34b811a0201a5477659c43046d14d22bed270973e686c1378948e9cf
|
|
| MD5 |
bbf193c347eac2bfbb052056d761bd90
|
|
| BLAKE2b-256 |
9a1472f679601202849948037b2ef1e1233946e13c2019e473ed4a1bbd6da9c4
|