<yaxp-cli ⚡> Yet Another XSD Parser
Project description
<yaxp ⚡> Yet Another XSD Parser
Introduction
Using roxmltree to parse XML files.
Converts xsd schema to:
- json
- arrow
- avro
- protobuf
- jsonschema
- json representation of spark schema
- duckdb (read_csv columns/types)
User Guide
Python
- create and activate a Python virtual environment (or use poetry, uv, etc.)
- install maturin (cargo install, pip install into venv, etc.)
(.venv) ~/projects/yaxp/crates/pyaxp $
🔗 Found pyo3 bindings
🐍 Found CPython 3.12 at ~/projects/yaxp/crates/pyaxp/.venv/bin/python
📡 Using build options features from pyproject.toml
warning: ~/projects/yaxp/Cargo.toml: unused manifest key: workspace.name
Blocking waiting for file lock on build directory
Compiling pyo3-build-config v0.23.4
Compiling pyo3-macros-backend v0.23.4
Compiling pyo3-ffi v0.23.4
Compiling pyo3 v0.23.4
Compiling pyo3-macros v0.23.4
Compiling yaxp-common v0.1.0 (~/Users/jeroen~/projects/yaxp/crates/yaxp-common)
Compiling pyaxp v0.1.0 (~/Users/jeroen~/projects/yaxp/crates/pyaxp)
Finished `dev` profile [unoptimized + debuginfo] target(s) in 5.03s
📦 Built wheel for CPython 3.12 to /var/folders/gr/gl3fzn_n0_g4fzpcfv2g40gh0000gn/T/.tmp3wQ0CY/pyaxp-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
✏️ Setting installed package as editable
🛠 Installed pyaxp-0.1.0
(.venv) ~/projects/yaxp/crates/pyaxp $
Python 3.12.3 (main, Apr 15 2024, 17:43:11) [Clang 17.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>>
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.types import (
... StructType, StructField, StringType, TimestampType, DateType, DecimalType, IntegerType
... )
>>> from pyaxp import parse_xsd
>>>
>>> from datetime import datetime, date
>>> from decimal import Decimal
>>>
>>> data = [
... ("A1", "B1", "C1", "D1", datetime(2024, 2, 1, 10, 30, 0), date(2024, 2, 1), date(2024, 1, 31),
... "E1", "F1", "G1", "H1", Decimal("123456789012345678.1234567"), "I1", "J1", "K1", "L1",
... date(2024, 2, 1), "M1", "N1", Decimal("100"), 10),
...
... ("A2", "B2", "C2", None, datetime(2024, 2, 1, 11, 0, 0), None, date(2024, 1, 30),
... "E2", None, "G2", "H2", None, "I2", "J2", "K2", "L2",
... date(2024, 2, 2), "M2", "N2", Decimal("200"), 20),
...
... ("A3", "B3", "C3", "D3", datetime(2024, 2, 1, 12, 15, 0), date(2024, 2, 3), None,
... "E3", "F3", None, "H3", Decimal("98765432109876543.7654321"), "I3", None, "K3", "L3",
... date(2024, 2, 3), "M3", "N3", None, None)
... ]
>>>
>>>
>>> spark = SparkSession.builder.master("local").appName("Test Data").getOrCreate()
25/02/01 16:27:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 16:27:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> 25/02/01 16:27:42 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
>>> j = parse_xsd("example.xsd", "spark")
>>> spark_schema = StructType.fromJson(json.loads(j))
>>> df = spark.createDataFrame(data, schema=spark_schema)
>>>
>>> df.printSchema()
root
|-- Field1: string (nullable = false)
|-- Field2: string (nullable = false)
|-- Field3: string (nullable = false)
|-- Field4: string (nullable = true)
|-- Field5: timestamp (nullable = false)
|-- Field6: date (nullable = true)
|-- Field7: date (nullable = true)
|-- Field8: string (nullable = false)
|-- Field9: string (nullable = true)
|-- Field10: string (nullable = true)
|-- Field11: string (nullable = true)
|-- Field12: decimal(25,7) (nullable = true)
|-- Field13: string (nullable = true)
|-- Field14: string (nullable = true)
|-- Field15: string (nullable = false)
|-- Field16: string (nullable = true)
|-- Field17: date (nullable = false)
|-- Field18: string (nullable = true)
|-- Field19: string (nullable = true)
|-- Field20: decimal(10,0) (nullable = true)
|-- Field21: integer (nullable = true)
>>> df.schema
StructType([StructField('Field1', StringType(), False), StructField('Field2', StringType(), False), StructField('Field3', StringType(), False), StructField('Field4', StringType(), True), StructField('Field5', TimestampType(), False), StructField('Field6', DateType(), True), StructField('Field7', DateType(), True), StructField('Field8', StringType(), False), StructField('Field9', StringType(), True), StructField('Field10', StringType(), True), StructField('Field11', StringType(), True), StructField('Field12', DecimalType(25,7), True), StructField('Field13', StringType(), True), StructField('Field14', StringType(), True), StructField('Field15', StringType(), False), StructField('Field16', StringType(), True), StructField('Field17', DateType(), False), StructField('Field18', StringType(), True), StructField('Field19', StringType(), True), StructField('Field20', DecimalType(10,0), True), StructField('Field21', IntegerType(), True)])
>>> df.dtypes
[('Field1', 'string'), ('Field2', 'string'), ('Field3', 'string'), ('Field4', 'string'), ('Field5', 'timestamp'), ('Field6', 'date'), ('Field7', 'date'), ('Field8', 'string'), ('Field9', 'string'), ('Field10', 'string'), ('Field11', 'string'), ('Field12', 'decimal(25,7)'), ('Field13', 'string'), ('Field14', 'string'), ('Field15', 'string'), ('Field16', 'string'), ('Field17', 'date'), ('Field18', 'string'), ('Field19', 'string'), ('Field20', 'decimal(10,0)'), ('Field21', 'int')]
>>>
>>> df.show()
+------+------+------+------+-------------------+----------+----------+------+------+-------+-------+--------------------+-------+-------+-------+-------+----------+-------+-------+-------+-------+
|Field1|Field2|Field3|Field4| Field5| Field6| Field7|Field8|Field9|Field10|Field11| Field12|Field13|Field14|Field15|Field16| Field17|Field18|Field19|Field20|Field21|
+------+------+------+------+-------------------+----------+----------+------+------+-------+-------+--------------------+-------+-------+-------+-------+----------+-------+-------+-------+-------+
| A1| B1| C1| D1|2024-02-01 10:30:00|2024-02-01|2024-01-31| E1| F1| G1| H1|12345678901234567...| I1| J1| K1| L1|2024-02-01| M1| N1| 100| 10|
| A2| B2| C2| NULL|2024-02-01 11:00:00| NULL|2024-01-30| E2| NULL| G2| H2| NULL| I2| J2| K2| L2|2024-02-02| M2| N2| 200| 20|
| A3| B3| C3| D3|2024-02-01 12:15:00|2024-02-03| NULL| E3| F3| NULL| H3|98765432109876543...| I3| NULL| K3| L3|2024-02-03| M3| N3| NULL| NULL|
+------+------+------+------+-------------------+----------+----------+------+------+-------+-------+--------------------+-------+-------+-------+-------+----------+-------+-------+-------+-------+
>>>
with duckdb
$ python
Python 3.12.3 (main, Apr 15 2024, 17:43:11) [Clang 17.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import duckdb
>>> from pyaxp import parse_xsd
>>>
>>> j = parse_xsd("example.xsd", "duckdb")
>>> res = duckdb.sql(f"select * from read_csv('example-data.csv', columns={j})")
>>> res
┌─────────┬─────────┬─────────┬─────────┬─────────────────────┬────────────┬────────────┬─────────┬───┬─────────┬─────────┬─────────┬─────────┬────────────┬─────────┬─────────┬───────────────┬─────────┐
│ Field1 │ Field2 │ Field3 │ Field4 │ Field5 │ Field6 │ Field7 │ Field8 │ … │ Field13 │ Field14 │ Field15 │ Field16 │ Field17 │ Field18 │ Field19 │ Field20 │ Field21 │
│ varchar │ varchar │ varchar │ varchar │ timestamp │ date │ date │ varchar │ │ varchar │ varchar │ varchar │ varchar │ date │ varchar │ varchar │ decimal(25,7) │ int32 │
├─────────┼─────────┼─────────┼─────────┼─────────────────────┼────────────┼────────────┼─────────┼───┼─────────┼─────────┼─────────┼─────────┼────────────┼─────────┼─────────┼───────────────┼─────────┤
│ A1 │ B1 │ C1 │ D1 │ 2024-02-01 09:30:00 │ 2024-02-01 │ 2024-01-31 │ E1 │ … │ I1 │ J1 │ K1 │ L1 │ 2024-02-01 │ M1 │ N1 │ 100.0000000 │ 10 │
│ A2 │ B2 │ C2 │ NULL │ 2024-02-01 10:00:00 │ NULL │ 2024-01-30 │ E2 │ … │ I2 │ J2 │ K2 │ L2 │ 2024-02-02 │ M2 │ N2 │ 200.0000000 │ 20 │
│ A3 │ B3 │ C3 │ D3 │ 2024-02-01 11:15:00 │ 2024-02-03 │ NULL │ E3 │ … │ I3 │ NULL │ K3 │ L3 │ 2024-02-03 │ M3 │ N3 │ NULL │ NULL │
├─────────┴─────────┴─────────┴─────────┴─────────────────────┴────────────┴────────────┴─────────┴───┴─────────┴─────────┴─────────┴─────────┴────────────┴─────────┴─────────┴───────────────┴─────────┤
│ 3 rows 21 columns (17 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
>>> j
{'Field1': 'VARCHAR(15)', 'Field2': 'VARCHAR(20)', 'Field3': 'VARCHAR(10)', 'Field4': 'VARCHAR(50)', 'Field5': 'TIMESTAMP', 'Field6': 'DATE', 'Field7': 'DATE', 'Field8': 'VARCHAR(10)', 'Field9': 'VARCHAR(3)', 'Field10': 'VARCHAR(30)', 'Field11': 'VARCHAR(10)', 'Field12': 'DECIMAL(25, 7)', 'Field13': 'VARCHAR(255)', 'Field14': 'VARCHAR(255)', 'Field15': 'VARCHAR(255)', 'Field16': 'VARCHAR(255)', 'Field17': 'DATE', 'Field18': 'VARCHAR(30)', 'Field19': 'VARCHAR(255)', 'Field20': 'DECIMAL(25, 7)', 'Field21': 'INTEGER'}
>>>
TODO
- Add pyo3/maturin support
- Add tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyaxp-0.1.6.tar.gz
(30.2 kB
view details)
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyaxp-0.1.6.tar.gz.
File metadata
- Download URL: pyaxp-0.1.6.tar.gz
- Upload date:
- Size: 30.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e33c68fd266267f656b56375f86d03466dcd3df4288cd62150dd9ff9eb32d50b
|
|
| MD5 |
1e6290d23327692936156e42985dd795
|
|
| BLAKE2b-256 |
dcf52d9391601813b71268b02b0746e38b752bed0c18903b55334178085d96ae
|
File details
Details for the file pyaxp-0.1.6-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: pyaxp-0.1.6-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 284.5 kB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f53f013bfae5d9d5375aafbf2c2fe3835adcb5efe968f6fa4442eb1254fa9a2
|
|
| MD5 |
09d502651d8f16e50715998c35fe4d3f
|
|
| BLAKE2b-256 |
a512d207c4aa753c500002336ff856711e11bd07020d9e4e16b239d1ed5c7057
|
File details
Details for the file pyaxp-0.1.6-cp311-cp311-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: pyaxp-0.1.6-cp311-cp311-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 365.6 kB
- Tags: CPython 3.11, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c273ff8cfc25fef29709cbeb412ca1c778d2bb0e83a9bcec7e0b073091d5c79
|
|
| MD5 |
2789c468052737e21dee103fa1f3b83f
|
|
| BLAKE2b-256 |
3250e60b5525435212c567d0285d236fd37be91375bb872ceed996e05fb46635
|