Skip to main content

Convert pydantic model to aws glue schema for terraform

Project description

JSON Schema to AWS Glue schema converter

prek Ruff uv image image image Actions status

Installation

pip install pydantic-glue

What?

Converts pydantic schemas to json schema and then to AWS glue schema, so in theory anything that can be converted to JSON Schema could also work.

Why?

When using AWS Kinesis Firehose in a configuration that receives JSONs and writes parquet files on S3, one needs to define a AWS Glue table so Firehose knows what schema to use when creating the parquet files.

AWS Glue lets you define a schema using Avro or JSON Schema and then to create a table from that schema, but as of May 2022 there are limitations on AWS that tables that are created that way can't be used with Kinesis Firehose.

https://stackoverflow.com/questions/68125501/invalid-schema-error-in-aws-glue-created-via-terraform

This is also confirmed by AWS support.

What one could do is create a table set the columns manually, but this means you now have two sources of truth to maintain.

This tool allows you to define a table in pydantic and generate a JSON with column types that can be used with terraform to create a Glue table.

Example

Take the following pydantic class

from pydantic import BaseModel
from typing import List


class Bar(BaseModel):
    name: str
    age: int


class Foo(BaseModel):
    nums: List[int]
    bars: List[Bar]
    other: str

Running pydantic-glue

pydantic-glue -f example.py -c Foo

you get this JSON in the terminal:

{
  "//": "Generated by pydantic-glue at 2022-05-25 12:35:55.333570. DO NOT EDIT",
  "columns": {
    "nums": "array<int>",
    "bars": "array<struct<name:string,age:int>>",
    "other": "string"
  }
}

and can be used in terraform like that

locals {
  columns = jsondecode(file("${path.module}/glue_schema.json")).columns
}

resource "aws_glue_catalog_table" "table" {
  name          = "table_name"
  database_name = "db_name"

  storage_descriptor {
    dynamic "columns" {
      for_each = local.columns

      content {
        name = columns.key
        type = columns.value
      }
    }
  }
}

Alternatively you can run CLI with -o flag to set output file location:

pydantic-glue -f example.py -c Foo -o example.json -l

If your Pydantic models use field aliases, but you prefer to display the field names in the JSON schema, you can enable this behavior by using the --schema-by-name flag.

Here you can find the details regarding pydantic aliases.

The following model will be converted differently with --schema-by-name argument.

from pydantic import BaseModel, Field

class A(BaseModel):
    hey: str = Field(alias="h")
    ho: str
pydantic-glue -f tests/data/input.py -c A

2025-02-01 00:08:45,046 - INFO - Generated file content:
{
  "//": "Generated by pydantic-glue at 2025-01-31 23:08:45.046012+00:00. DO NOT EDIT",
  "columns": {
    "h": "string",
    "ho": "string"
  }
}
 pydantic-glue -f tests/data/input.py -c A --schema-by-name
2025-02-01 00:09:18,381 - INFO - Generated file content:
{
  "//": "Generated by pydantic-glue at 2025-01-31 23:09:18.380586+00:00. DO NOT EDIT",
  "columns": {
    "hey": "string",
    "ho": "string"
  }
}

Override the type for the AWS Glue Schema

Wherever there is a type key in the input JSON Schema, an additional key glue_type may be defined to override the type that is used in the AWS Glue Schema. This is, for example, useful for a pydantic model that has a field of type int that is unix epoch time, while the column type you would like in Glue is of type timestamp.

Additional JSON Schema keys to a pydantic model can be added by using the Field function with the argument json_schema_extra like so:

from pydantic import BaseModel, Field

class A(BaseModel):
    epoch_time: int = Field(
        ...,
        json_schema_extra={
            "glue_type": "timestamp",
        },
    )

The resulting JSON Schema will be:

{
    "properties": {
        "epoch_time": {
            "glue_type": "timestamp",
            "title": "Epoch Time",
            "type": "integer"
        }
    },
    "required": [
        "epoch_time"
    ],
    "title": "A",
    "type": "object"
}

And the result after processing with pydantic-glue:

{
  "//": "Generated by pydantic-glue at 2022-05-25 12:35:55.333570. DO NOT EDIT",
  "columns": {
    "epoch_time": "timestamp",
  }
}

Recursing through object properties terminates when you supply a glue_type to use. If the type is complex, you must supply the full complex type yourself.

How it works?

  • pydantic gets converted to JSON Schema
  • the JSON Schema types get mapped to Glue types recursively

Future work

  • Not all types are supported, I just add types as I need them, but adding types is very easy, feel free to open issues or send a PR if you stumbled upon a non-supported use case
  • the tool could be easily extended to working with JSON Schema directly
  • thus, anything that can be converted to a JSON Schema should also work.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydantic_glue-0.7.1.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydantic_glue-0.7.1-py3-none-any.whl (7.5 kB view details)

Uploaded Python 3

File details

Details for the file pydantic_glue-0.7.1.tar.gz.

File metadata

  • Download URL: pydantic_glue-0.7.1.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pydantic_glue-0.7.1.tar.gz
Algorithm Hash digest
SHA256 50150d0cb5d4533469e992e6ba2bacba11630fc0621629622e0ecc6e0d1102e1
MD5 6207b0f02b6e32903cfa768d64204f35
BLAKE2b-256 4bbcbbdf2ba0ac11c57f1028e2490f4ce961120d34b8ae7a383908d2e9026b0b

See more details on using hashes here.

Provenance

The following attestation bundles were made for pydantic_glue-0.7.1.tar.gz:

Publisher: publish.yml on svdimchenko/pydantic-glue

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pydantic_glue-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: pydantic_glue-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 7.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pydantic_glue-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4003efce05e685f4e208c382d6e0e393923cf9c5fb351a232ba4b69cc955c1c6
MD5 d9b076517530834a632343a9352c1f17
BLAKE2b-256 0e6e8272ee49ac613709beff84cf665a4ac3b9237c0df62e128e674d98dfcf28

See more details on using hashes here.

Provenance

The following attestation bundles were made for pydantic_glue-0.7.1-py3-none-any.whl:

Publisher: publish.yml on svdimchenko/pydantic-glue

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page