Skip to main content

A lightweight library for managing and validating data schemas from YAML specifications

Project description

yads

yads: Yet Another Data Spec YAML-Augmented Data Specification is a Python library for managing data specs using YAML. It helps you define and manage your data warehouse tables, schemas, and documentation in a structured, version-controlled way. With yads, you can define your data assets once in YAML and then generate various outputs like DDL statements for different databases, data schemas for tools like Avro or PyArrow, and human-readable, LLM-ready documentation.

Why yads?

The modern data stack is complex, with data assets defined across a multitude of platforms and tools. This often leads to fragmented and inconsistent documentation, making data discovery and governance a challenge. yads was created to address this by providing a centralized, version-controllable, and extensible way to manage metadata for modern data platforms.

The main goal of yads is to provide a single source of truth for your data assets using simple YAML files. These files can capture everything from table schemas and column descriptions to governance policies and usage notes. From these specifications, yads can transpile the information into various formats, such as DDL statements for different SQL dialects, Avro or PyArrow schemas, and generate documentation that is ready for both humans and Large Language Models (LLMs).

Getting Started

Installation

pip install yads

To include support for PySpark DataFrame schema generation, install the pyspark additional dependency with:

pip install 'yads[pyspark]'

Usage

Defining a Specification

Create a YAML file to define your table schema and properties. For example, users.yaml:

# specs/dim_user.yaml

table_name: "dim_user"
database: "dm_product_performance"
database_schema: "curated"
description: "Dimension table for users."
dimensional_table_type: "dimension"
owner: "data_engineering"
version: "1.0.0"
scd_type: 2

location: "s3://lakehouse/dm_product_performance/curated/dim_user"
partitioning:
  - column: "created_date"
    strategy: "month"

properties:
  table_type: "ICEBERG"
  format: "parquet"
  write_compression: "snappy"

table_schema:
  - name: "id"
    type: "integer"
    description: "Unique identifier for the user"
    constraints:
      - not_null: true
  - name: "username"
    type: "string"
    description: "Username for the user"
    constraints:
      - not_null: true
  - name: "email"
    type: "string"
    description: "Email address for the user"
    constraints:
      - not_null: true
  - name: "preferences"
    type: "map"
    key_type: "string"
    value_type: "string"
  - name: "created_at"
    type: "timestamp"
    description: "Timestamp of user creation"
    constraints:
      - not_null: true

Generating Spark DDL

You can generate a Spark DDL CREATE TABLE statement from the specification:

from yads import TableSpecification

# Load the specification
spec = TableSpecification("specs/dim_user.yaml")

# Generate the DDL
ddl = spec.to_ddl(dialect="spark")

print(ddl)
CREATE OR REPLACE TABLE dm_product_performance.curated.dim_user (
  `id` INTEGER NOT NULL,
  `username` STRING NOT NULL,
  `email` STRING NOT NULL,
  `preferences` MAP<STRING, STRING>,
  `created_at` TIMESTAMP NOT NULL
)
USING ICEBERG
PARTITIONED BY (month(`created_date`))
LOCATION 's3://lakehouse/dm_product_performance/curated/dim_user'
TBLPROPERTIES (
  'table_type' = 'ICEBERG',
  'format' = 'parquet',
  'write_compression' = 'snappy'
);
>>>

Generating a PySpark DataFrame Schema

You can generate a pyspark.sql.types.StructType schema for a PySpark DataFrame:

from yads import TableSpecification
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Load the specification
spec = TableSpecification("specs/dim_user.yaml")

# Generate the PySpark schema
spark_schema = spec.to_spark_schema()

df = spark.createDataFrame([], schema=spark_schema)
df.printSchema()
root
 |-- id: integer (nullable = false)
 |-- username: string (nullable = false)
 |-- email: string (nullable = false)
 |-- preferences: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- created_at: timestamp (nullable = false)
>>>

Contributing

Contributions are welcome! Please feel free to open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yads-0.0.1.tar.gz (44.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yads-0.0.1-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file yads-0.0.1.tar.gz.

File metadata

  • Download URL: yads-0.0.1.tar.gz
  • Upload date:
  • Size: 44.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.9

File hashes

Hashes for yads-0.0.1.tar.gz
Algorithm Hash digest
SHA256 e627ac7531b0293aaf3cf30d5f67a452b86b9b79dd912c9fe84debbb8e7fa1de
MD5 ca6f872ae61f6828059fff152edd834f
BLAKE2b-256 1d10d2e1e3f8847641e7a5ff6515f97c3d1e03f5795c4e5de7addb0fd8d08643

See more details on using hashes here.

File details

Details for the file yads-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: yads-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.9

File hashes

Hashes for yads-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4088bed95aa402f123461925507d7c8ed67141dd828b887a8362cf80851f2ea3
MD5 49c52b15abd58b5a7101b3cae48fa07d
BLAKE2b-256 ebe51911476f81cf5eb33a3ceb042c81fd1993445e192751f2b6f7d9b96438ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page