Conversion from pydantic models to pyarrow schemas

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

simw

These details have not been verified by PyPI

Project description

pydantic-to-pyarrow

pydantic-to-pyarrow is a library for Python to help with conversion of pydantic models to pyarrow schemas.

(Please note that this project is not affiliated in any way with the great teams at pydantic or pyarrow.)

pydantic is a Python library for data validation, applying type hints / annotations. It enables the creation of easy or complex data validation rules.

pyarrow is a Python library for using Apache Arrow, a development platform for in-memory analytics. The library also enables easy writing to parquet files.

Why might you want to convert models to schemas? One scenario is for a data processing pipeline:

Import / extract the data from its source
Validate the data using pydantic
Process the data in pyarrow / pandas / polars
Store the raw and / or processed data in parquet.

The easiest approach for steps 3 and 4 above is to let pyarrow infer the schema from the data. The most involved approach is to specify the pyarrow schema separate from the pydantic model. In the middle, many applications could benefit from converting the pydantic model to a pyarrow schema. This library aims to achieve that.

Installation

pip install pydantic-to-pyarrow

Note: PyArrow versions < 15 are only compatible with NumPy 1.x, but they do not express this in their dependency constraints. If other constraints are forcing you to use PyArrow < 15 on Python 3.9+, and you see errors like 'A module that was compiled using NumPy 1.x cannot be run in Numpy 2.x ...', then try forcing NumPy 1.x in your project's dependencies.

Conversion Table

The below conversions still run into the possibility of overflows in the Pyarrow types. For example, in Python 3 the int type is unbounded, whereas the pa.int64() type has a fixed maximum. In most cases, this should not be an issue, but if you are concerned about overflows, you should not use this library and should manually specify the full schema.

Python / Pydantic	Pyarrow	Overflow
str	pa.string()
Literal[strings]	pa.dictionary(pa.int32(), pa.string())
.	.	.
int	pa.int64() if no minimum constraint, pa.uint64() if minimum is zero	Yes, at 2^63 (for signed) or 2^64 (for unsigned)
Literal[ints]	pa.int64()
float	pa.float64()	Yes
decimal.Decimal	pa.decimal128 ONLY if supplying max_digits and decimal_places for pydantic field	Yes
.	.	.
datetime.date	pa.date32()
datetime.time	pa.time64("us")
datetime.datetime	pa.timestamp("ms", tz=None) ONLY if param allow_losing_tz=True
pydantic.types.NaiveDatetime	pa.timestamp("ms", tz=None)
pydantic.types.AwareDatetime	pa.timestamp("ms", tz=None) ONLY if param allow_losing_tz=True
.	.
Optional[...]	The pyarrow field is nullable
Pydantic Model	pa.struct()
List[...]	pa.list_(...)
Dict[..., ...]	pa.map_(pa key_type, pa value_type)
Enum of str	pa.dictionary(pa.int32(), pa.string())
Enum of int	pa.int64()
UUID (uuid.UUID or pydantic.types.UUID*)	pa.uuid()	SEE NOTE BELOW!

Note on UUIDs: the UUID type is only supported in pyarrow 18.0 and above. However, as of pyarrow 19.0, when pyarrow creates a table in eg pa.Table.from_pylist(objs, schema=schema), it expects bytes not a uuid.UUID type. Hence, if you are using .model_dump() to create the data for pyarrow, you need to add a serializer on your pydantic model to convert to bytes. This may be fixed in later versions (see [https://github.com/apache/arrow/issues/43855]).

eg (with pyarrow >= 18.0):

import uuid
from typing import Annotated

import pyarrow as pa
from pydantic import BaseModel, PlainSerializer
from pydantic_to_pyarrow import get_pyarrow_schema

class ModelWithUuid(BaseModel):
    uuid: Annotated[uuid.UUID, PlainSerializer(lambda x: x.bytes, return_type=bytes)]


schema = get_pyarrow_schema(ModelWithUuid)

model1 = ModelWithUuid(uuid=uuid.uuid1())
model2 = ModelWithUuid(uuid=uuid.uuid4())
data = [model1.model_dump(), model2.model_dump()]
table = pa.Table.from_pylist(data)
print(table)
#> pyarrow.Table
#> uuid: binary
#> ----
#> uuid: [[BF206AC0DA4711EF8271EF4F4B7A3587,211C4C5D94C74876AE5E32DBCCDC16C7]]

Settings

In a model, if a field is marked as exclude, Field(exclude=True), then it will be excluded from the pyarrow schema if get_pyarrow_schema is called with exclude_fields=True (defaults to False).

If get_pyarrow_schema is called with allow_losing_tz=True, then it will allow conversion of timezone-aware python datetimes to non-timezone aware pyarrow timestamps (defaults to False - and loss of timezone information will raise an exception).

By default, get_pyarrow_schema will use the field names for the pyarrow schema fields. If by_alias=True is supplied, then the serialization_alias is used. More information about aliases is available in the Pydantic documentation.

An Example

from typing import Dict, List, Optional

from pydantic import BaseModel, Field
from pydantic_to_pyarrow import get_pyarrow_schema

class NestedModel(BaseModel):
    str_field: str


class MyModel(BaseModel):
    int_field: int
    opt_str_field: Optional[str]
    py310_opt_str_field: str | None
    nested: List[NestedModel]
    dict_field: Dict[str, int]
    excluded_field: str = Field(exclude=True)


pa_schema = get_pyarrow_schema(MyModel)
print(pa_schema)
#> int_field: int64 not null
#> opt_str_field: string
#> py310_opt_str_field: string
#> nested: list<item: struct<str_field: string not null>> not null
#>   child 0, item: struct<str_field: string not null>
#>       child 0, str_field: string not null
#> dict_field: map<string, int64> not null
#>   child 0, entries: struct<key: string not null, value: int64> not null
#>       child 0, key: string not null
#>       child 1, value: int64

Development

Prerequisites:

Any Python 3.8 through 3.13
uv for dependency management
git
make
nox (to run tests across dependency versions)

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

simw

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.6

Jan 31, 2025

0.1.5

Nov 8, 2024

0.1.4

Nov 5, 2024

0.1.3

May 23, 2024

0.1.2

Mar 5, 2024

0.1.1

Nov 13, 2023

0.1

Nov 2, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydantic_to_pyarrow-0.1.6.tar.gz (57.5 kB view details)

Uploaded Jan 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pydantic_to_pyarrow-0.1.6-py3-none-any.whl (8.3 kB view details)

Uploaded Jan 31, 2025 Python 3

File details

Details for the file pydantic_to_pyarrow-0.1.6.tar.gz.

File metadata

Download URL: pydantic_to_pyarrow-0.1.6.tar.gz
Upload date: Jan 31, 2025
Size: 57.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for pydantic_to_pyarrow-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`519285df1eff07d606c46aa54f619da8e3eccb09d023593e02c90e2e2bef378c`
MD5	`88558c4ef87227e5a2b333feedfd3ca1`
BLAKE2b-256	`8de977c78ac01de8e4a7b27d928aa3ae4377e271a74d5fe2c49c04c89002d48e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pydantic_to_pyarrow-0.1.6.tar.gz:

Publisher: publish.yml on simw/pydantic-to-pyarrow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pydantic_to_pyarrow-0.1.6.tar.gz
- Subject digest: 519285df1eff07d606c46aa54f619da8e3eccb09d023593e02c90e2e2bef378c
- Sigstore transparency entry: 167475850
- Sigstore integration time: Jan 31, 2025
Source repository:
- Permalink: simw/pydantic-to-pyarrow@6f0c164dce7b511471122b8b500c0a11ef9daaae
- Branch / Tag: refs/tags/v0.1.6
- Owner: https://github.com/simw
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6f0c164dce7b511471122b8b500c0a11ef9daaae
- Trigger Event: release

File details

Details for the file pydantic_to_pyarrow-0.1.6-py3-none-any.whl.

File metadata

Download URL: pydantic_to_pyarrow-0.1.6-py3-none-any.whl
Upload date: Jan 31, 2025
Size: 8.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for pydantic_to_pyarrow-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`60ec966433a7856e84a69113b1ae9bc248e9ca1a9b688f1b9f344079fa9dddd3`
MD5	`e711b82860d2c836ac2aa8abf531e810`
BLAKE2b-256	`f14ba14496f596e6161622ee06a9cd0d6fbfc721fec8d11898407899b60c3e6d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pydantic_to_pyarrow-0.1.6-py3-none-any.whl:

Publisher: publish.yml on simw/pydantic-to-pyarrow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pydantic_to_pyarrow-0.1.6-py3-none-any.whl
- Subject digest: 60ec966433a7856e84a69113b1ae9bc248e9ca1a9b688f1b9f344079fa9dddd3
- Sigstore transparency entry: 167475851
- Sigstore integration time: Jan 31, 2025
Source repository:
- Permalink: simw/pydantic-to-pyarrow@6f0c164dce7b511471122b8b500c0a11ef9daaae
- Branch / Tag: refs/tags/v0.1.6
- Owner: https://github.com/simw
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6f0c164dce7b511471122b8b500c0a11ef9daaae
- Trigger Event: release

pydantic-to-pyarrow 0.1.6

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pydantic-to-pyarrow

Installation

Conversion Table

Settings

An Example

Development

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance