Skip to main content

Data Vault 2 SQL generator using sqlglot

Project description

datavault4sqlglot

A Python library for generating Data Vault 2 SQL using sqlglot. Produces dialect-agnostic SQL for Hubs, Links, Satellites, and Staging layers — no dbt, no live database connection required.

Quick Start

from datavault4sqlglot.generators.hub import HubGenerator
from datavault4sqlglot.metadata import SourceModel, SourceBinding

# Describe the staged source table
source = SourceModel(
    schema="stage",
    table_name="stg_orders",
    load_date_col="ldts",
    record_source_col="rsrc",
)

# Bind the source to a Hub: declare which column is the business key
binding = SourceBinding(
    source=source,
    business_keys=["order_id"],
)

# Generate the Hub SQL
generator = HubGenerator(
    sources=[binding],
    hashkey="hk_order_h",
    target_schema="dv",
    target_table="order_h",
    is_incremental=True,
)

print(generator.to_sql())

Configuration

The library uses a global configuration object (datavault4sqlglot.config) with the following defaults:

Key Default Description
dialect snowflake Target SQL dialect
hash MD5 Hash algorithm (MD5, SHA256, SHA1)
ldts_alias ldts Load date timestamp column name
rsrc_alias rsrc Record source column name
ledts_alias ledts Load end date column name
end_of_all_times 9999-12-31 Sentinel for open-ended records
beginning_of_all_times 0001-01-01 Sentinel for ghost records
hashkey_input_case_sensitive false Apply UPPER() before hashing hash keys
hashdiff_input_case_sensitive false Apply UPPER() before hashing hash diffs
use_trim true Apply TRIM() before hashing
quote_identifiers true Quote table and column identifiers
ghost_record_rsrc SYSTEM Record source value for the unknown ghost row
ghost_record_error_rsrc ERROR Record source value for the error ghost row

You can override any of these by placing a config.json file in your current working directory. The library will automatically load and apply these settings when imported.

Example config.json:

{
  "ldts_alias": "load_date_timestamp",
  "hash": "SHA256",
  "dialect": "bigquery"
}

You can also manually load a configuration file from a specific path:

from datavault4sqlglot.config import config, load_config

load_config(config, "/path/to/my/custom_config.json")

Models

Three distinct classes, each for a different layer:

StageModel — used exclusively by StageGenerator

Describes a raw source table that needs to be hashed and prepared. It owns everything about that transformation: which columns to hash, derived expressions, missing columns for schema evolution, etc. It's a self-contained description of one staging job — no binding to anything else.

StageModel(
    table_name="raw.orders",
    hashed_columns={"HK_ORDER_H": ["ORDER_ID"]},
    derived_columns={"ldts": "CURRENT_TIMESTAMP()"},
)

SourceModel — the physical table pointer used by vault generators (Hub, Link, Sat)

Just says where to find the already-staged data: table name, optional schema/database, and which columns are ldts/rsrc if they differ from the config defaults. No transformation logic.

SourceModel(
    database="RAW_DB",
    schema="STAGE",
    table_name="STG_ORDERS",
    load_date_col="LOAD_DATE",
    record_source_col="RECORD_SOURCE",
)

SourceBinding — wraps a SourceModel with DV-loading intent

Answers what to extract from that staged table for a specific vault entity: per-source physical bk_columns (when they differ from the hub's canonical names), foreign hash keys (for links), rsrc_statics for HWM scoping, etc. A single SourceModel can be wrapped in different SourceBindings for different vault entities.

The hub-level canonical business_keys live on HubGenerator itself (not on the binding) — every binding into the same hub maps onto that same canonical name set.

SourceBinding(
    source=_SRC_ORDERS_MODEL,        # the SourceModel
    bk_columns=["SAP_ORDER_ID"],     # only needed when the source's column
                                     # name differs from the hub canonical
    rsrc_statics=["ERP/ORDERS"],
)

Conceptual split

→ Staging table
 → StageModelStageGenerator → staged table (to calculate hash keys)

→ Raw Data Vault table
 → SourceModel — where is the data?
  → SourceBinding — what to extract, for which vault entity?
   → HubGenerator / LinkGenerator / SatelliteGenerator

StageModel and SourceModel are both Pydantic models (validated on construction). SourceBinding is a plain dataclass — it's just a lightweight container pairing a SourceModel with extraction metadata.

Built by Scalefree

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datavault4sqlglot-0.1.0.tar.gz (57.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datavault4sqlglot-0.1.0-py3-none-any.whl (36.7 kB view details)

Uploaded Python 3

File details

Details for the file datavault4sqlglot-0.1.0.tar.gz.

File metadata

  • Download URL: datavault4sqlglot-0.1.0.tar.gz
  • Upload date:
  • Size: 57.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for datavault4sqlglot-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e4b477a1d98d81e7b837e176f14f84ab99c43854e5b472ff2a30430a2f524d6b
MD5 2060e11b04bc58dd29005a007f6d5fbb
BLAKE2b-256 8ea66470e884ca41bb70147e1452c937493fd96202ce87e00d9a70371166f965

See more details on using hashes here.

File details

Details for the file datavault4sqlglot-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for datavault4sqlglot-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fb5eccf96405347b439d065c2f2b8b5ece515769e90f9a6f219bd91575be3054
MD5 ae8c7a6710e1c43baa222554cf2c276e
BLAKE2b-256 7fdf24823d69905dd6227a3b36db16f8887289eac42bc9b41e84cc7756238322

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page