Data Vault 2 SQL generator using sqlglot
Project description
datavault4sqlglot
A Python library for generating Data Vault 2 SQL using sqlglot. Produces dialect-agnostic SQL for Hubs, Links, Satellites, and Staging layers — no dbt, no live database connection required.
Quick Start
from datavault4sqlglot.generators.hub import HubGenerator
from datavault4sqlglot.metadata import SourceModel, SourceBinding
# Describe the staged source table
source = SourceModel(
schema="stage",
table_name="stg_orders",
load_date_col="ldts",
record_source_col="rsrc",
)
# Bind the source to a Hub: declare which column is the business key
binding = SourceBinding(
source=source,
business_keys=["order_id"],
)
# Generate the Hub SQL
generator = HubGenerator(
sources=[binding],
hashkey="hk_order_h",
target_schema="dv",
target_table="order_h",
is_incremental=True,
)
print(generator.to_sql())
Configuration
The library uses a global configuration object (datavault4sqlglot.config) with the following defaults:
| Key | Default | Description |
|---|---|---|
dialect |
snowflake |
Target SQL dialect |
hash |
MD5 |
Hash algorithm (MD5, SHA256, SHA1) |
ldts_alias |
ldts |
Load date timestamp column name |
rsrc_alias |
rsrc |
Record source column name |
ledts_alias |
ledts |
Load end date column name |
end_of_all_times |
9999-12-31 |
Sentinel for open-ended records |
beginning_of_all_times |
0001-01-01 |
Sentinel for ghost records |
hashkey_input_case_sensitive |
false |
Apply UPPER() before hashing hash keys |
hashdiff_input_case_sensitive |
false |
Apply UPPER() before hashing hash diffs |
use_trim |
true |
Apply TRIM() before hashing |
quote_identifiers |
true |
Quote table and column identifiers |
ghost_record_rsrc |
SYSTEM |
Record source value for the unknown ghost row |
ghost_record_error_rsrc |
ERROR |
Record source value for the error ghost row |
You can override any of these by placing a config.json file in your current working directory.
The library will automatically load and apply these settings when imported.
Example config.json:
{
"ldts_alias": "load_date_timestamp",
"hash": "SHA256",
"dialect": "bigquery"
}
You can also manually load a configuration file from a specific path:
from datavault4sqlglot.config import config, load_config
load_config(config, "/path/to/my/custom_config.json")
Models
Three distinct classes, each for a different layer:
StageModel — used exclusively by StageGenerator
Describes a raw source table that needs to be hashed and prepared. It owns everything about that transformation: which columns to hash, derived expressions, missing columns for schema evolution, etc. It's a self-contained description of one staging job — no binding to anything else.
StageModel(
table_name="raw.orders",
hashed_columns={"HK_ORDER_H": ["ORDER_ID"]},
derived_columns={"ldts": "CURRENT_TIMESTAMP()"},
)
SourceModel — the physical table pointer used by vault generators (Hub, Link, Sat)
Just says where to find the already-staged data: table name, optional schema/database, and which
columns are ldts/rsrc if they differ from the config defaults. No transformation logic.
SourceModel(
database="RAW_DB",
schema="STAGE",
table_name="STG_ORDERS",
load_date_col="LOAD_DATE",
record_source_col="RECORD_SOURCE",
)
SourceBinding — wraps a SourceModel with DV-loading intent
Answers what to extract from that staged table for a specific vault entity: per-source physical
bk_columns (when they differ from the hub's canonical names), foreign hash keys (for links),
rsrc_statics for HWM scoping, etc. A single SourceModel can be wrapped in different
SourceBindings for different vault entities.
The hub-level canonical business_keys live on HubGenerator itself (not on the binding) —
every binding into the same hub maps onto that same canonical name set.
SourceBinding(
source=_SRC_ORDERS_MODEL, # the SourceModel
bk_columns=["SAP_ORDER_ID"], # only needed when the source's column
# name differs from the hub canonical
rsrc_statics=["ERP/ORDERS"],
)
Conceptual split
→ Staging table
→ StageModel → StageGenerator → staged table (to calculate hash keys)
→ Raw Data Vault table
→ SourceModel — where is the data?
→ SourceBinding — what to extract, for which vault entity?
→ HubGenerator / LinkGenerator / SatelliteGenerator
StageModel and SourceModel are both Pydantic models (validated on construction). SourceBinding is a plain dataclass — it's just a lightweight container pairing a SourceModel with extraction metadata.
Built by Scalefree
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datavault4sqlglot-0.1.0.tar.gz.
File metadata
- Download URL: datavault4sqlglot-0.1.0.tar.gz
- Upload date:
- Size: 57.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4b477a1d98d81e7b837e176f14f84ab99c43854e5b472ff2a30430a2f524d6b
|
|
| MD5 |
2060e11b04bc58dd29005a007f6d5fbb
|
|
| BLAKE2b-256 |
8ea66470e884ca41bb70147e1452c937493fd96202ce87e00d9a70371166f965
|
File details
Details for the file datavault4sqlglot-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datavault4sqlglot-0.1.0-py3-none-any.whl
- Upload date:
- Size: 36.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb5eccf96405347b439d065c2f2b8b5ece515769e90f9a6f219bd91575be3054
|
|
| MD5 |
ae8c7a6710e1c43baa222554cf2c276e
|
|
| BLAKE2b-256 |
7fdf24823d69905dd6227a3b36db16f8887289eac42bc9b41e84cc7756238322
|