Skip to main content

No project description provided

Project description

xml-to-pydantic

CI pypi versions license

xml-to-pydantic is a library for Python to convert XML or HTML to pydantic models. This can be used to:

  • Parse and validate a scraped HTML page into a python object
  • Parse and validate an XML response from an XML-based API
  • Parse and validate data stored in XML format

(Please note that this project is not affiliated in any way with the great team at pydantic.)

pydantic is a Python library for data validation, applying type hints / annotations. It enables the creation of easy or complex data validation rules for processing external data. That data usually comes in JSON format or from a Python dictionary.

But to process and validate HTML or XML into pydantic models would then require two steps: convert the HTML or XML to a Python dictionary, then convert to the pydantic model. This libary provides a convenient way to combine those steps.

Note: if you are using this library to parse external, uncontrolled HTML or XML, you should be aware of possible attack vectors through XML: [https://github.com/tiran/defusedxml]. This library uses lxml under the hood.

Installation

Use pip, or your favorite Python package manager (pipenv, poetry, pdm, ...):

pip install xml-to-pydantic

Usage

The HTML or XML data is extracted using XPath. For simple documents, the XPath can be calcualted from the model:

from xml_to_pydantic import ConfigDict, XmlBaseModel

html_bytes = b"""
<!doctype html>
<html lang="en-US">
  <head>
    <meta charset="utf-8" />
    <title>My page title</title>
  </head>

  <body>
    <header>
      <h1>Header</h1>
    </header>

    <main>
      <p>Paragraph1</p>
      <p>Paragraph2</p>
      <p>Paragraph3</p>
    </main>
  </body>
</html>
"""

class MainContent(XmlBaseModel):
    model_config = ConfigDict(xpath_root="/html/body/main")
    p: list[str]

result = MainContent.model_validate_html(html_bytes)
print(result)
#> p=['Paragraph1', 'Paragraph2', 'Paragraph3']
from xml_to_pydantic import XmlBaseModel


xml_bytes = b"""<?xml version="1.0" encoding="UTF-8"?>
<root>
    <element>4.53</element>
    <element>3.25</element>
</root>
"""


class MyModel(XmlBaseModel):
    element: list[float]


model = MyModel.model_validate_xml(xml_bytes)
print(model)
#> element=[4.53, 3.25]

However, for more complicated XML, this one-to-one correspondance may not be convenient, and a better approach is supplying the xpath directly (similar to how pydantic allows specifying an alias for a field):

from xml_to_pydantic import XmlBaseModel, XmlField


xml_bytes = b"""<?xml version="1.0" encoding="UTF-8"?>
<root>
    <element>4.53</element>
    <a href="https://example.com">Link</a>
</root>
"""


class MyModel(XmlBaseModel):
    number: float = XmlField(xpath="./element/text()")
    href: str = XmlField(xpath="./a/@href")


model = MyModel.model_validate_xml(xml_bytes)
print(model)
#> number=4.53 href='https://example.com'

The parsing can also deal with nested models and lists:

from xml_to_pydantic import XmlBaseModel, XmlField


xml_bytes = b"""<?xml version="1.0" encoding="UTF-8"?>
<root>
    <level1>
        <level2>value1</level2>
        <level2>value2</level2>
        <level2>value3</level2>
    </level1>
    <level11>value11</level11>
</root>
"""

class NextLevel(XmlBaseModel):
    level2: list[str] = XmlField(xpath="./level2/text()")


class MyModel(XmlBaseModel):
    next_level: NextLevel = XmlField(xpath="./level1")
    level_11: list[str] = XmlField(xpath="./level11/text()")


model = MyModel.model_validate_xml(xml_bytes)
print(model)
#> next_level=NextLevel(level2=['value1', 'value2', 'value3']) level_11=['value11']

Development

Prerequisites:

  • Any Python 3.8 through 3.12
  • poetry for dependency management
  • git
  • make (to use the helper scripts in the Makefile)

Autoformatting can be applied by running

make lintable

Before commiting, remember to run

make lint
make test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xml_to_pydantic-0.2.tar.gz (6.5 kB view details)

Uploaded Source

Built Distribution

xml_to_pydantic-0.2-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file xml_to_pydantic-0.2.tar.gz.

File metadata

  • Download URL: xml_to_pydantic-0.2.tar.gz
  • Upload date:
  • Size: 6.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.4

File hashes

Hashes for xml_to_pydantic-0.2.tar.gz
Algorithm Hash digest
SHA256 37b80e56f1c088c05fc9c1058d40d49ebcb790d7c582fe02736deafb720e4f3a
MD5 cb244bfc1c8d1b4d1779d39b95281afa
BLAKE2b-256 462f0a389a51b816da5bb64fc186e364dccb844077c50f6d89f27dc902e18e53

See more details on using hashes here.

File details

Details for the file xml_to_pydantic-0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for xml_to_pydantic-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8adb87fa4e9ce5918bf4375770857b8f4d90c5c87d827399af5080b76f3fd956
MD5 15121fb723f2b490bac56d20fbd4f0a5
BLAKE2b-256 cd163318065d22b96a2e65bbf7e55cf20c0ab02eebb86c65c7801e1deb6e85eb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page