Skip to main content

Conversion from pydantic models to pyarrow schemas

Project description

pydantic-to-pyarrow

CI

pydantic-to-pyarrow is a library for Python to help with conversion of pydantic models to pyarrow schemas.

pydantic is a Python library for data validation, applying type hints / annotations. It enables the creation of easy or complex data validation rules.

pyarrow is a Python library for using Apache Arrow, a development platform for in-memory analytics. The library also enables easy writing to parquet files.

Why might you want to convert models to schemas? One scenario is for a data processing pipeline:

  1. Import / extract the data from its source
  2. Validate the data using pydantic
  3. Process the data in pyarrow / pandas / polars
  4. Store the raw and / or processed data in parquet.

The easiest approach for steps 3 and 4 above is to let pyarrow infer the schema from the data. The most involved approach is to specify the pyarrow schema separate from the pydantic model. In the middle, many application could benefit from converting the pydantic model to a pyarrow schema. This library aims to achieve that.

Installation

This library is not yet availabe on PyPI.

Conversion Table

The below conversions still run into the possibility of overflows in the Pyarrow types. For example, in Python 3 the int type is unbounded, whereas the pa.int64() type has a fixed maximum. In most cases, this should not be an issue, but if you are concerned about overflows, you should not use this library and should manually specify the full schema.

Python / Pydantic Pyarrow Overflow
str pa.string()
Literal[strings] pa.dictionary(pa.int32(), pa.string())
. . .
int pa.int64() if no minimum constraint, pa.uint64() if minimum is zero Yes, at 2^63 (for signed) or 2^64 (for unsigned)
Literal[ints] pa.int64() Yes, at 2^63
float pa.float64() Yes
decimal.Decimal pa.decimal128 ONLY if supplying max_digits and decimal_places for pydantic field Yes
. . .
datetime.date pa.date32()
datetime.time pa.time64("us")
datetime.datetime pa.timestamp("ms", tz=None) ONLY if param allow_losing_tz=True
pydantic.types.NaiveDatetime pa.timestamp("ms", tz=None)
pydantic.types.AwareDatetime pa.timestamp("ms", tz=None) ONLY if param allow_losing_tz=True
. .
Optional[...] The pyarrow field is nullable
Pydantic Model pa.struct()
List[...] pa.list_(...)

An Example

from typing import List, Optional

from pydantic import BaseModel
from pydantic_to_pyarrow import get_pyarrow_schema

class NestedModel(BaseModel):
    str_field: str


class MyModel(BaseModel):
    int_field: int
    opt_str_field: Optional[str]
    py310_opt_str_field: str | None
    nested: List[NestedModel]


pa_schema = get_pyarrow_schema(MyModel)
print(pa_schema)
#> int_field: int64 not null
#> opt_str_field: string
#> py310_opt_str_field: string
#> nested: list<item: struct<str_field: string not null>> not null
#>   child 0, item: struct<str_field: string not null>
#>       child 0, str_field: string not null

Development

Prerequisites:

  • Any Python 3.8 through 3.11
  • poetry for dependency management
  • git
  • make

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydantic_to_pyarrow-0.1.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydantic_to_pyarrow-0.1-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file pydantic_to_pyarrow-0.1.tar.gz.

File metadata

  • Download URL: pydantic_to_pyarrow-0.1.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for pydantic_to_pyarrow-0.1.tar.gz
Algorithm Hash digest
SHA256 27fc13633c718e953b2856a387353487dca14c7a73c08967ffcfc6c2ea8b65f7
MD5 e508072e474000b40c6ccf94ef66bb18
BLAKE2b-256 f2938d184823cbec054e3d7e1eb087c8314c64dba5a1f8be947d40dd24712c59

See more details on using hashes here.

File details

Details for the file pydantic_to_pyarrow-0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pydantic_to_pyarrow-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b0a724715d4db12cef2ea27bfdee3c0bb4ee5b03611370837e766fe42957e23a
MD5 19aff919dbd112a56f031e67230dd13b
BLAKE2b-256 4b35eebe946b073e7857eb1923ba77c93eeb76de3fee5288a06c7d5db0e2b3c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page