Skip to main content

Production-ready Dataform SQLX templates for data warehouse patterns

Project description

Dataform Warehouse Blueprints

Production-ready Dataform SQLX templates for data warehouse patterns

Generate clean, maintainable Dataform SQLX files from simple YAML configurations. Supports staging, dimension, fact, and aggregate table patterns with best practices built-in.

PyPI version Python 3.8+ License: MIT

🎯 Features

  • Multiple Table Patterns: Staging, Dimension, Fact, Aggregate/Mart tables
  • Best Practices: Partitioning, clustering, incremental loading built-in
  • Simple YAML Config: Define tables without writing boilerplate SQL
  • CLI & Python API: Use as command-line tool or Python library
  • Validation: Catch configuration errors before generation
  • Batch Generation: Process multiple configs at once
  • Dataform-Ready: Generates valid SQLX for Dataform projects

📦 Installation

pip install dataform-warehouse-blueprints

🚀 Quick Start

CLI Usage

1. Initialize a configuration file:

dataform-blueprints init \
  -t dim_employee \
  -l warehouse \
  --table-type dimension \
  -o dim_employee_config.yaml

2. Edit the generated YAML:

table_name: dim_employee
layer: warehouse
table_type: dimension
source_table: ${ref('stg_employees')}
columns:
  - employee_id
  - first_name
  - last_name
  - email
partition_by: updated_at
cluster_by: [employee_id]
primary_keys: [employee_id]
incremental: true
tags: [dimension, hr]

3. Generate SQLX:

dataform-blueprints generate \
  -c dim_employee_config.yaml \
  -o definitions/dim_employee.sqlx

4. Generated SQLX output:

config {
  type: "incremental",
  schema: "warehouse",
  name: "dim_employee",
  bigquery: {
    partitionBy: "updated_at",
    clusterBy: ["employee_id"],
  },
  tags: ["dimension", "hr"],
}

-- Dimension Table: dim_employee
-- Layer: WAREHOUSE (Gold)

SELECT
  employee_id,
  first_name,
  last_name,
  email,
  CURRENT_TIMESTAMP() AS _loaded_at
FROM ${ref('stg_employees')}

${ when(incremental(), `
  WHERE updated_at > (SELECT MAX(updated_at) FROM ${self()})
`) }

Python API

from dataform_blueprints import TableConfig, BlueprintGenerator, LayerType, TableType

# Define configuration
config = TableConfig(
    table_name="dim_employee",
    layer=LayerType.WAREHOUSE,
    table_type=TableType.DIMENSION,
    source_table="${ref('stg_employees')}",
    columns=["employee_id", "first_name", "last_name"],
    primary_keys=["employee_id"],
    partition_by="updated_at",
    cluster_by=["employee_id"],
    incremental=True,
    tags=["dimension", "hr"]
)

# Generate SQLX
generator = BlueprintGenerator(config)
sqlx_content = generator.generate_sqlx()

# Write to file
generator.write_sqlx("definitions/dim_employee.sqlx")

📚 Table Patterns

1. Staging Tables (Raw → Staging)

Clean and standardize data from raw sources.

table_name: stg_employees
layer: staging
table_type: source
source_table: ${ref('raw_employees')}
columns:
  - employee_id
  - first_name
  - last_name
incremental: true

2. Dimension Tables (SCD Type 1)

Business entities for analytics.

table_name: dim_employee
layer: warehouse
table_type: dimension
source_table: ${ref('stg_employees')}
columns:
  - employee_id
  - first_name
  - last_name
primary_keys: [employee_id]
partition_by: updated_at
cluster_by: [employee_id]

3. Fact Tables

Transactional business events.

table_name: fact_payroll_run
layer: warehouse
table_type: fact
source_table: ${ref('stg_payroll')}
columns:
  - payroll_id
  - employee_id
  - gross_pay
  - net_pay
primary_keys: [payroll_id]
partition_by: pay_date
cluster_by: [employee_id, pay_date]

4. Aggregate/Mart Tables

Pre-aggregated data for reporting.

table_name: mart_payroll_summary
layer: marts
table_type: aggregate
source_table: ${ref('fact_payroll_run')}
columns:
  - department
  - SUM(gross_pay) as total_pay
  - COUNT(*) as payroll_count
primary_keys: [department]

🛠️ CLI Commands

init - Create Configuration

dataform-blueprints init -t <table_name> -l <layer> --table-type <type>

Options:

  • -t, --table-name: Table name (required)
  • -l, --layer: Layer (raw/staging/warehouse/marts) (required)
  • --table-type: Pattern (source/dimension/fact/aggregate) (required)
  • -o, --output: Output file path (default: table_config.yaml)

generate - Generate SQLX

dataform-blueprints generate -c <config.yaml> -o <output.sqlx>

Options:

  • -c, --config: YAML config file (required)
  • -o, --output: Output SQLX file
  • --validate-only: Only validate, don't generate

batch - Batch Generate

dataform-blueprints batch -d <configs_dir> -o <output_dir>

Options:

  • -d, --directory: Directory with YAML configs (required)
  • -o, --output-dir: Output directory (default: definitions)

examples - Show Examples

dataform-blueprints examples

📋 Configuration Reference

Required Fields

table_name: string       # Table name
layer: string           # raw | staging | warehouse | marts
table_type: string      # source | dimension | fact | aggregate | view
columns: list           # Column names

Optional Fields

source_table: string         # Source table reference (e.g., ${ref('table')})
partition_by: string         # Partition column
cluster_by: list            # Clustering columns
primary_keys: list          # Primary key columns
description: string         # Table description
tags: list                  # Tags for organization
incremental: bool           # Enable incremental (default: true)
dataset_id: string          # Override default dataset
dependencies: list          # Explicit dependencies
assertions: list            # Data quality assertions

🎯 Use Cases

Data Warehouse Modernization

Generate consistent SQLX files across your entire warehouse:

# Generate all staging tables
dataform-blueprints batch -d configs/staging -o definitions/staging

# Generate all dimensions
dataform-blueprints batch -d configs/dimensions -o definitions/warehouse

# Generate all facts
dataform-blueprints batch -d configs/facts -o definitions/warehouse

Rapid Prototyping

Quickly scaffold new tables:

dataform-blueprints init -t fact_sales -l warehouse --table-type fact
# Edit config...
dataform-blueprints generate -c fact_sales.yaml

CI/CD Integration

Validate configs in CI:

dataform-blueprints generate -c dim_employee.yaml --validate-only

🔧 Development

Setup

git clone https://github.com/yourusername/dataform-warehouse-blueprints.git
cd dataform-warehouse-blueprints
pip install -e ".[dev]"

Run Tests

pytest tests/ -v --cov=dataform_blueprints

Code Quality

black src/
flake8 src/

📖 Examples

See the examples/ directory for:

  • dim_employee_config.yaml - Dimension table
  • fact_payroll_config.yaml - Fact table
  • mart_payroll_summary_config.yaml - Aggregate table

🤝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new features
  4. Submit a pull request

📄 License

MIT License - see LICENSE file for details

🔗 Links

💡 Why Dataform Blueprints?

  • Consistency: Standardize table patterns across your warehouse
  • Speed: Generate tables 10x faster than writing SQL by hand
  • Best Practices: Partitioning, clustering, incremental built-in
  • Maintainability: YAML configs easier to review than SQL
  • Reusability: Share table patterns across projects/teams

🎓 Related Projects

This module is part of the Payroll & Workforce Analytics Modernization project:

  • synthetic-payroll-lab - Generate test data
  • scd2-bq-engine - SCD Type 2 automation
  • bq-finops-cli - BigQuery cost optimization

Built with ❤️ for data engineers who value clean, maintainable SQL.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataform_warehouse_blueprints-0.1.0.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataform_warehouse_blueprints-0.1.0-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file dataform_warehouse_blueprints-0.1.0.tar.gz.

File metadata

File hashes

Hashes for dataform_warehouse_blueprints-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6d34623ad5874dcfd5a1d51d9d28b13bed619ddc9817115f6dfbf5cec4fda5c5
MD5 281ec8bc6e64f0b31dd7f9c38fd0ee0e
BLAKE2b-256 fe53d333d09e0327eded745622e71400fa385ff39f8a97d5a9915d5b46ebe0af

See more details on using hashes here.

File details

Details for the file dataform_warehouse_blueprints-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dataform_warehouse_blueprints-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b6bc435def260979d21e6d88511d4ed77c0267b7e76566f4c352ddc277710087
MD5 f73444379eb9d692df16e38a60ef16c3
BLAKE2b-256 e177bee79c4440378edfc91e5464b4e1fa929b7c176f84ab25afb65542d1de7e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page