Skip to main content

A lightweight library for managing and validating data schemas from YAML specifications

Project description

yads

CI Python 3.10+

yads is an expressive, canonical data specification to solve schema management throughout your data stack. Proudly open source and built in the open with and for the data community.

Check our documentation to know more, and the quick start guide to get started.

Installation

# With pip
pip install yads
# With uv
uv add yads

yads is a lightweight dependency designed to run in your existing Python workflows. Each loader and converter is designed to support a wide range of versions for your source or target format.

You can install yads Python API alongside the required optional dependency for your use case.

uv add yads[pyarrow]

Or simply add yads to your project that is already using the optional dependency within the supported version range. See the supported versions here.

Overview

As the universal format for columnar data representation, Arrow is central to yads, but the specification is expressive enough to be derivable from the most common data formats used by data teams.

Format Loader Converter Installation
PyArrow yads.from_pyarrow yads.to_pyarrow pip install yads[pyarrow]
PySpark yads.from_pyspark yads.to_pyspark pip install yads[pyspark]
Polars yads.from_polars yads.to_polars pip install yads[polars]
Pydantic Not implemented yads.to_pydantic pip install yads[pydantic]
SQL Not implemented yads.to_sql pip install yads[sql]
YAML yads.from_yaml Not implemented pip install yads

See the loaders and converters API for advanced usage. A list of supported SQL dialects is available here.

yads specification

Typical workflows start with an expressive yads specification that can then be used throughout the data lifecycle.

The latest yads specification JSON schema is available here.

# docs/src/specs/customers.yaml
name: "catalog.crm.customers"
version: 1
yads_spec_version: "0.0.2"

columns:
  - name: "id"
    type: "bigint"
    constraints:
      not_null: true

  - name: "email"
    type: "string"

  - name: "created_at"
    type: "timestamptz"

  - name: "spend"
    type: "decimal"
    params:
      precision: 10
      scale: 2

  - name: "tags"
    type: "array"
    element:
      type: "string"

Load a yads spec and generate a Pydantic BaseModel

import yads

spec = yads.from_yaml("docs/src/specs/customers.yaml")

# Generate a Pydantic BaseModel
Customers = yads.to_pydantic(spec, model_name="Customers")

print(Customers)
print(list(Customers.model_fields.keys()))
<class 'yads.converters.pydantic_converter.Customers'>
['id', 'email', 'created_at', 'spend', 'tags']

To validate and serialize data

from datetime import datetime, timezone

record = Customers(
    id=123,
    email="alice@example.com",
    created_at=datetime(2024, 5, 1, 12, 0, 0, tzinfo=timezone.utc),
    spend="42.50",
    tags=["vip", "beta"],
)

print(record.model_dump())
{'id': 123, 'email': 'alice@example.com', 'created_at': datetime.datetime(2024, 5, 1, 12, 0, tzinfo=datetime.timezone.utc), 'spend': Decimal('42.50'), 'tags': ['vip', 'beta']}

Emit DDL for multiple SQL dialects from the same spec

spark_ddl = yads.to_sql(spec, dialect="spark", pretty=True)
print(spark_ddl)
CREATE TABLE catalog.crm.customers (
  id BIGINT NOT NULL,
  email STRING,
  created_at TIMESTAMP,
  spend DECIMAL(10, 2),
  tags ARRAY<STRING>
)
duckdb_ddl = yads.to_sql(spec, dialect="duckdb", pretty=True)
print(duckdb_ddl)
CREATE TABLE catalog.crm.customers (
  id BIGINT NOT NULL,
  email TEXT,
  created_at TIMESTAMPTZ,
  spend DECIMAL(10, 2),
  tags TEXT[]
)

Create a Polars DataFrame schema

import yads

pl_schema = yads.to_polars(spec)
print(pl_schema)
Schema({'id': Int64, 'email': String, 'created_at': Datetime(time_unit='ns', time_zone='UTC'), 'spend': Decimal(precision=10, scale=2), 'tags': List(String)})

Create a PyArrow schema with constraint preservation

import yads

pa_schema = yads.to_pyarrow(spec)
print(pa_schema)
id: int64 not null
email: string
created_at: timestamp[ns, tz=UTC]
spend: decimal128(10, 2)
tags: list<item: string>
  child 0, item: string

Configurable conversions

The canonical yads spec is immutable, but conversions can be customized with configuration options.

import yads

spec = yads.from_yaml("docs/src/specs/customers.yaml")
ddl_min = yads.to_sql(
    spec,
    dialect="spark",
    include_columns={"id", "email"},
    pretty=True,
)

print(ddl_min)
CREATE TABLE catalog.crm.customers (
  id BIGINT NOT NULL,
  email STRING
)

Column overrides can be used to apply custom validation to specific columns, or to supersede default conversions.

from pydantic import Field

def email_override(field, conv):
    # Enforce example.com domain with a regex pattern
    return str, Field(pattern=r"^.+@example\.com$")

Model = yads.to_pydantic(spec, column_overrides={"email": email_override})

try:
    Model(
        id=1,
        email="user@other.com",
        created_at="2024-01-01T00:00:00+00:00",
        spend="42.50",
        tags=["beta"],
    )
except Exception as e:
    print(type(e).__name__ + ":\n" + str(e))
ValidationError:
1 validation error for catalog_crm_customers
email
  String should match pattern '^.+@example\.com$' [type=string_pattern_mismatch, input_value='user@other.com', input_type=str]
    For further information visit https://errors.pydantic.dev/2.12/v/string_pattern_mismatch

Round-trip conversions

yads attempts to preserve the complete representation of data schemas across conversions. The following example demonstrates a round-trip from a PyArrow schema to a yads spec, then to a DuckDB DDL and PySpark schema, while preserving metadata and column constraints.

import yads
import pyarrow as pa

schema = pa.schema(
    [
        pa.field(
            "id",
            pa.int64(),
            nullable=False,
            metadata={"description": "Customer ID"},
        ),
        pa.field(
            "name",
            pa.string(),
            metadata={"description": "Customer preferred name"},
        ),
        pa.field(
            "email",
            pa.string(),
            metadata={"description": "Customer email address"},
        ),
        pa.field(
            "created_at",
            pa.timestamp("ns", tz="UTC"),
            metadata={"description": "Customer creation timestamp"},
        ),
    ]
)

spec = yads.from_pyarrow(schema, name="catalog.crm.customers", version=1)
print(spec)
spec catalog.crm.customers(version=1)(
  columns=[
    id: integer(bits=64)(
      description='Customer ID',
      constraints=[NotNullConstraint()]
    )
    name: string(
      description='Customer preferred name'
    )
    email: string(
      description='Customer email address'
    )
    created_at: timestamptz(unit=ns, tz=UTC)(
      description='Customer creation timestamp'
    )
  ]
)

Nullability and metadata are preserved as long as the target format supports them.

duckdb_ddl = yads.to_sql(spec, dialect="duckdb", pretty=True)
print(duckdb_ddl)
CREATE TABLE catalog.crm.customers (
  id BIGINT NOT NULL,
  name TEXT,
  email TEXT,
  created_at TIMESTAMPTZ
)
pyspark_schema = yads.to_pyspark(spec)
for field in pyspark_schema.fields:
    print(f"{field.name}, {field.dataType}, {field.nullable=}")
    print(f"{field.metadata=}\n")
id, LongType(), field.nullable=False
field.metadata={'description': 'Customer ID'}

name, StringType(), field.nullable=True
field.metadata={'description': 'Customer preferred name'}

email, StringType(), field.nullable=True
field.metadata={'description': 'Customer email address'}

created_at, TimestampType(), field.nullable=True
field.metadata={'description': 'Customer creation timestamp'}

Design Philosophy

yads is spec-first, deterministic, and safe-by-default: given the same spec and backend, converters and loaders produce the same schema and the same validation diagnostics.

Conversions proceed silently only when they are lossless and fully semantics-preserving. When a backend cannot represent type parameters but preserves semantics (constraint loss, e.g. String(length=10)String()), yads converts and emits structured warnings per affected field.

Backend type gaps are handled with value-preserving substitutes only; otherwise conversion requires an explicit fallback_type. Potentially lossy or reinterpreting changes (range narrowing, precision downgrades, sign changes, or unit changes) are never applied implicitly. Types with no value-preserving representation fail fast with clear errors and extension guidance.

Single rule: preserve semantics or notify; never lose or reinterpret data without explicit opt-in.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yads-0.0.4.tar.gz (502.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yads-0.0.4-py3-none-any.whl (116.8 kB view details)

Uploaded Python 3

File details

Details for the file yads-0.0.4.tar.gz.

File metadata

  • Download URL: yads-0.0.4.tar.gz
  • Upload date:
  • Size: 502.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for yads-0.0.4.tar.gz
Algorithm Hash digest
SHA256 4cdb1878c1f621e4cf109853772517bafb8b3838acb00c31a0e13f7613142c5a
MD5 c2ec69ff813f7d3a780225d8d5be56cc
BLAKE2b-256 98ebf7af9724a547d662fbe7e7b7cf91bd091fd6a46b544bc1f6e098c4b8ddd2

See more details on using hashes here.

File details

Details for the file yads-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: yads-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 116.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for yads-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 41d395d4f4f6f08619d7dd0821fabcfadb4ef5c82458a600d1440cc90b2bcc27
MD5 ca4a83ef70e48e0b8b893d2edb7c2cec
BLAKE2b-256 57fa1f7d65cff2e7b050852b6c936586fa8319a8d0d90429d9aeb210386981a8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page