Skip to main content

Import complex XML files to a relational database

Project description

Loading XML files into a relational database

xml2db is a Python package which allows parsing and loading XML files into a relational database. It handles complex XML files which cannot be denormalized to flat tables, and works out of the box, without any custom mapping rules.

It can be used within an Extract, Load, Transform data pipeline pattern as it allows loading XML files into a relational data model which is very close from the source data, yet easy to work with.

Starting from an XSD schema which represents a given XML structure, xml2db builds a data model, i.e. a set of database tables linked to each other by foreign keys relationships. Then, it allows parsing and loading XML files into the database, and getting them back from the database into XML format if needed.

Loading XML files into a relational database with xml2db can be as simple as:

from xml2db import DataModel

# Create a data model of tables with relations based on the XSD file
data_model = DataModel(
    xsd_file="path/to/file.xsd", 
    connection_string="postgresql+psycopg2://testuser:testuser@localhost:5432/testdb",
)
# Parse an XML file based on this XSD
document = data_model.parse_xml(
    xml_file="path/to/file.xml"
)
# Insert the document content into the database
document.insert_into_target_tables()

The data model created by xml2db will be close to the XSD schema. However, xml2db will perform a few systematic simplifications aimed at limiting the complexity of the resulting data model and the storage footprint. The resulting data model can be configured, but the above code will work out of the box, with reasonable defaults.

The raw data loaded into the database can then be processed if need be, using for instance DBT, SQL views or stored procedures aimed at extracting, correcting and formatting the data into more user-friendly tables.

This package uses sqlalchemy to interact with the database, so it should work with different database backends. Automated integration tests run against PostgreSQL, MySQL, MS SQL Server and DuckDB. You may have to install additional packages to connect to your database (e.g. psycopg2 for PostgreSQL, pymysql for MySQL, pyodbc for MS SQL Server or duckdb_engine for DuckDB).

Please read the package documentation website for all the details!

Installation

The package can be installed, preferably in a virtual environment, using pip:

pip install xml2db

Testing

Running the tests requires installing additional development dependencies, after cloning the repo, with:

pip install -e .[tests,docs]

Run all tests with the following command:

python -m pytest

Integration tests require write access to a PostgreSQL or MS SQL Server database; the connection string is provided as an environment variable DB_STRING. If you want to run only conversion tests that do not require a database you can run:

pytest -m "not dbtest"

Contributing

xml2db is developed and used at the French energy regulation authority (CRE) to process complex XML data.

Contributions are welcome, as well as bug reports, starting on the project's issue page.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xml2db-0.13.0.tar.gz (42.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xml2db-0.13.0-py3-none-any.whl (47.6 kB view details)

Uploaded Python 3

File details

Details for the file xml2db-0.13.0.tar.gz.

File metadata

  • Download URL: xml2db-0.13.0.tar.gz
  • Upload date:
  • Size: 42.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xml2db-0.13.0.tar.gz
Algorithm Hash digest
SHA256 8d40561400089846f4052a0e6f5043c03ca70e7162ef2a2d51a0356788917500
MD5 a516f41c1f1c1bc0672016ddd8143e29
BLAKE2b-256 8f4ffe48aa964b7131e69e8b5c84fec61c5ace2de74750433663b84e337b2bae

See more details on using hashes here.

Provenance

The following attestation bundles were made for xml2db-0.13.0.tar.gz:

Publisher: publish-to-pypi.yml on cre-dev/xml2db

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file xml2db-0.13.0-py3-none-any.whl.

File metadata

  • Download URL: xml2db-0.13.0-py3-none-any.whl
  • Upload date:
  • Size: 47.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xml2db-0.13.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4b6907db177da0d7ea6f1419856340b55c3185a57067887cb2b6568127c3d629
MD5 0c73b2b159abcb6a6ac19cc2dd3e6107
BLAKE2b-256 94b716d5dda3df4e4d7b88de154010429ff399431dac8663b650ea26aa748de9

See more details on using hashes here.

Provenance

The following attestation bundles were made for xml2db-0.13.0-py3-none-any.whl:

Publisher: publish-to-pypi.yml on cre-dev/xml2db

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page