Skip to main content

Conversion from pydantic models to pyarrow schemas

Project description

pydantic-to-pyarrow

CI pypi versions license Download Stats

pydantic-to-pyarrow is a library for Python to help with conversion of pydantic models to pyarrow schemas.

(Please note that this project is not affiliated in any way with the great teams at pydantic or pyarrow.)

pydantic is a Python library for data validation, applying type hints / annotations. It enables the creation of easy or complex data validation rules.

pyarrow is a Python library for using Apache Arrow, a development platform for in-memory analytics. The library also enables easy writing to parquet files.

Why might you want to convert models to schemas? One scenario is for a data processing pipeline:

  1. Import / extract the data from its source
  2. Validate the data using pydantic
  3. Process the data in pyarrow / pandas / polars
  4. Store the raw and / or processed data in parquet.

The easiest approach for steps 3 and 4 above is to let pyarrow infer the schema from the data. The most involved approach is to specify the pyarrow schema separate from the pydantic model. In the middle, many applications could benefit from converting the pydantic model to a pyarrow schema. This library aims to achieve that.

Installation

pip install pydantic-to-pyarrow

Note: PyArrow versions < 15 are only compatible with NumPy 1.x, but they do not express this in their dependency constraints. If other constraints are forcing you to use PyArrow < 15 on Python 3.9+, and you see errors like 'A module that was compiled using NumPy 1.x cannot be run in Numpy 2.x ...', then try forcing NumPy 1.x in your project's dependencies.

Conversion Table

The below conversions still run into the possibility of overflows in the Pyarrow types. For example, in Python 3 the int type is unbounded, whereas the pa.int64() type has a fixed maximum. In most cases, this should not be an issue, but if you are concerned about overflows, you should not use this library and should manually specify the full schema.

Python / Pydantic Pyarrow Overflow
str pa.string()
Literal[strings] pa.dictionary(pa.int32(), pa.string())
. . .
int pa.int64() if no minimum constraint, pa.uint64() if minimum is zero Yes, at 2^63 (for signed) or 2^64 (for unsigned)
Literal[ints] pa.int64()
float pa.float64() Yes
decimal.Decimal pa.decimal128 ONLY if supplying max_digits and decimal_places for pydantic field Yes
. . .
datetime.date pa.date32()
datetime.time pa.time64("us")
datetime.datetime pa.timestamp("ms", tz=None) ONLY if param allow_losing_tz=True
pydantic.types.NaiveDatetime pa.timestamp("ms", tz=None)
pydantic.types.AwareDatetime pa.timestamp("ms", tz=None) ONLY if param allow_losing_tz=True
. .
Optional[...] The pyarrow field is nullable
Pydantic Model pa.struct()
List[...] pa.list_(...)
Dict[..., ...] pa.map_(pa key_type, pa value_type)
Enum of str pa.dictionary(pa.int32(), pa.string())
Enum of int pa.int64()

Settings

In a model, if a field is marked as exclude, Field(exclude=True), then it will be excluded from the pyarrow schema if get_pyarrow_schema is called with exclude_fields=True (defaults to False).

If get_pyarrow_schema is called with allow_losing_tz=True, then it will allow conversion of timezone-aware python datetimes to non-timezone aware pyarrow timestamps (defaults to False - and loss of timezone information will raise an exception).

By default, get_pyarrow_schema will use the field names for the pyarrow schema fields. If by_alias=True is supplied, then the serialization_alias is used. More information about aliases is available in the Pydantic documentation.

An Example

from typing import Dict, List, Optional

from pydantic import BaseModel, Field
from pydantic_to_pyarrow import get_pyarrow_schema

class NestedModel(BaseModel):
    str_field: str


class MyModel(BaseModel):
    int_field: int
    opt_str_field: Optional[str]
    py310_opt_str_field: str | None
    nested: List[NestedModel]
    dict_field: Dict[str, int]
    excluded_field: str = Field(exclude=True)


pa_schema = get_pyarrow_schema(MyModel)
print(pa_schema)
#> int_field: int64 not null
#> opt_str_field: string
#> py310_opt_str_field: string
#> nested: list<item: struct<str_field: string not null>> not null
#>   child 0, item: struct<str_field: string not null>
#>       child 0, str_field: string not null
#> dict_field: map<string, int64> not null
#>   child 0, entries: struct<key: string not null, value: int64> not null
#>       child 0, key: string not null
#>       child 1, value: int64

Development

Prerequisites:

  • Any Python 3.8 through 3.11
  • poetry for dependency management
  • git
  • make

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydantic_to_pyarrow-0.1.5.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

pydantic_to_pyarrow-0.1.5-py3-none-any.whl (7.5 kB view details)

Uploaded Python 3

File details

Details for the file pydantic_to_pyarrow-0.1.5.tar.gz.

File metadata

  • Download URL: pydantic_to_pyarrow-0.1.5.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for pydantic_to_pyarrow-0.1.5.tar.gz
Algorithm Hash digest
SHA256 92cb91d9b87cd37dca5b1b4c33cdf503dfa66817d838fd9a46cb54cedeaffd62
MD5 e74b9201b3bd056e1212e68cda172932
BLAKE2b-256 5468fec88bd0eb845bc03fe53df19fca8aed0a6942824b9300d1432c64a44492

See more details on using hashes here.

Provenance

The following attestation bundles were made for pydantic_to_pyarrow-0.1.5.tar.gz:

Publisher: publish.yml on simw/pydantic-to-pyarrow

Attestations:

File details

Details for the file pydantic_to_pyarrow-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for pydantic_to_pyarrow-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c8aeb9179a2c1df296ef3d144a84bafdf47e1ff697089259c76846f4d68e74be
MD5 575e8fe18376f8928032ba1060b000a6
BLAKE2b-256 ef648c35812aa74bff0bbef060b266b2473f6c5c9982fdd46b7adeed99b2b2d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for pydantic_to_pyarrow-0.1.5-py3-none-any.whl:

Publisher: publish.yml on simw/pydantic-to-pyarrow

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page