Data Virtualization Tool - Federated queries across data sources

These details have not been verified by PyPI

Project links

Project description

DVT Logo

DVT — Data Virtualization Tool

Connect every database. Move data across engines. Transform in place.
The declarative pipeline tool that brings dbt's power to the ETLT domain.

What is DVT?

DVT (Data Virtualization Tool) is a fork of dbt-core that extends the dbt experience into the ETLT (Extract, Transform, Load, Transform) domain. If you know dbt, you already know DVT -- same CLI, same SQL models, same Jinja templating, same DAG-based execution. The difference is what happens under the hood.

Where dbt connects to one data warehouse at a time, DVT connects to all of them at once. You define sources across PostgreSQL, Snowflake, Databricks, and any other database in a single project, and DVT figures out how to get the data where it needs to go.

The ETLT Pattern

Traditional ETL tools force you to write imperative extraction scripts. dbt gave us declarative transformations but only within a single warehouse. DVT closes the gap:

Extract -- DVT reads from any source database using Spark JDBC connectors, automatically resolving which sources need to be pulled from which targets.
Load (Staging) -- Extracted data lands in a Delta Lake staging layer, giving you ACID transactions, time travel, and efficient incremental reads.
Transform -- Your SQL models run against the target warehouse, exactly like dbt. Same ref(), same source(), same materializations.
Load (Final) -- Results materialize in the target warehouse as tables or views, just as you'd expect.

Adapter Pushdown vs. Spark Federation

DVT is smart about when to use Spark and when not to:

Same-engine query? DVT pushes the SQL directly to the target database adapter -- zero Spark overhead, identical to dbt behavior.
Cross-engine query? DVT spins up a Spark session, extracts the relevant sources via JDBC, registers them as DataFrames, and executes your SQL across engines using Spark's federation capabilities.

You never have to think about this. You write SQL. DVT decides the execution path.

Key Features

Multi-Source Federation -- Join data from PostgreSQL, Snowflake, Databricks, and more in a single SQL model. DVT handles the cross-engine orchestration transparently.
Adapter Pushdown -- When your model and its sources live on the same database, DVT pushes execution directly to the adapter with zero overhead. Identical performance to dbt.
Delta Lake Staging -- All intermediate data flows through Delta Lake, giving you ACID transactions, schema enforcement, and efficient incremental processing with append/overwrite modes.
Incremental Models -- Full support for merge, append, and delete+insert strategies. {{ this }} resolves to the staging Delta table for correct anti-join patterns across engines.
Predicate Pushdown -- DVT's optimizer analyzes your SQL and pushes WHERE clauses down to the source databases during extraction, minimizing data movement over the wire.
dvt sync -- One command to set up your entire environment: installs database adapters, PySpark, JDBC drivers, and native cloud connectors. No manual JAR hunting.
dvt debug -- Comprehensive diagnostic command that validates all targets, computes, buckets, and database connections in one shot. Know exactly what's configured and what's reachable before you run anything.
dbt-Compatible -- Your existing dbt projects, packages, macros, and tests work with DVT. The CLI commands you know (run, build, test, seed, docs) all work the same way.

How DVT Works

                            DVT Execution Flow
                            ==================

  +--------------+    +--------------+    +--------------+    +-------+
  |  PostgreSQL  |    |  Snowflake   |    |  Databricks  |    |  ...  |
  +--------------+    +--------------+    +--------------+    +-------+
         |                   |                   |                |
         +-------------------+-------------------+----------------+
                             |  JDBC
                             v
                   +--------------------+
                   |  Execution Router  |
                   +--------------------+
                      /              \
                     /                \
            Same engine?         Cross engine?
                   /                    \
                  v                      v
       +------------------+    +-------------------+
       | Adapter Pushdown |    | Spark Federation  |
       | (direct SQL,     |    | (JDBC extract,    |
       |  zero overhead)  |    |  Delta staging,   |
       +------------------+    |  cross-engine SQL) |
                  \            +-------------------+
                   \                   /
                    \                 /
                     v               v
              +-------------------------+
              |    Target Warehouse     |
              | (Tables, Views, Incr.)  |
              +-------------------------+

Execution flow:

The Execution Router inspects every model and its upstream sources. If the model's target matches all source targets, it uses Adapter Pushdown (direct SQL, no Spark). If sources span multiple engines, it routes through Spark Federation, which extracts data via JDBC, stages it in Delta Lake, executes the cross-engine SQL, and loads the results into the Target Warehouse.

Tested Adapters

DVT has been tested and validated against the following database adapters across all core commands (seed, run, compile, test, build, show, snapshot, docs generate, and cross-engine federation):

Adapter	Pushdown	Federation	Notes
PostgreSQL	Tested	Tested	Full support -- pushdown and federation
MySQL	Tested	Tested	Full support via `dbt-mysql` adapter
MariaDB	Tested	Tested	Full support via `dbt-mysql` adapter (MariaDB mode)
SQL Server	Tested	Tested	Full support via `dbt-sqlserver` adapter
Oracle	Tested	Tested	Full support via `dbt-oracle` adapter
Snowflake	Tested	Tested	Full support -- JDBC warehouse passthrough
Databricks	Tested	Tested	Full support -- native connector for bulk transfer

DVT's architecture is adapter-agnostic -- any database with a dbt adapter and a JDBC driver can be integrated. BigQuery, Redshift, and others are on the roadmap.

Quick Start

Prerequisites

Python 3.10 -- 3.13 (Python 3.14 is not yet supported by the ecosystem)
Java 17 (required for Apache Spark -- Spark 4.x needs Java 17 or 21)
uv (recommended) or pip for package management

1. Install DVT

Using uv (recommended):

# Create a new project with a supported Python version
uv init my-dvt-project --python 3.12
cd my-dvt-project

# Add dvt-ce
uv add dvt-ce

Important: If your system default Python is 3.14+, you must specify --python 3.12 (or 3.10/3.11/3.13) when creating the project. Python 3.14 is pre-release and not supported by DVT's upstream dependencies.

Using pip:

# Ensure you're using Python 3.10-3.13
python3.12 -m venv .venv
source .venv/bin/activate

pip install dvt-ce

2. Initialize Your Project

dvt init

This scaffolds a new DVT project and creates the configuration directory at ~/.dvt/ with:

File	Purpose
`profiles.yml`	Database connection profiles (targets)
`computes.yml`	Spark compute engine configuration
`buckets.yml`	Staging storage configuration (local filesystem)

3. Configure Your Connections

Edit ~/.dvt/profiles.yml to define your database targets. See Configuration Files below for full annotated examples.

4. Sync Your Environment

dvt sync

This is where DVT shines. A single command that:

Installs the correct database adapters (dbt-postgres, dbt-snowflake, etc.) based on your profiles
Installs the correct PySpark version from your computes.yml
Downloads JDBC drivers for every database in your profile
Downloads native cloud connectors (Snowflake, Databricks) for optimized data transfer
Validates Java version compatibility with your Spark version

No more hunting for JAR files, no more version mismatches. One command, everything set up.

5. Validate Your Setup

Before running anything against real databases, validate that everything is properly configured:

Check all connections, computes, and buckets:

dvt debug

dvt debug gives you a full diagnostic report:

All configured targets and whether they're reachable
Compute engine configuration and Spark readiness
Bucket configuration and storage accessibility
Connection test for any specific target with --connection <target_name>

You can also filter the output:

dvt debug --debug-target          # Show only target configurations
dvt debug --debug-compute         # Show only compute configurations
dvt debug --debug-bucket          # Show only bucket configurations
dvt debug --connection pg_dev     # Test a specific database connection

Validate your project parses correctly:

dvt parse

dvt parse reads every model, source, test, seed, and macro in your project and validates that:

All ref() and source() references resolve correctly
The DAG has no cycles
All configurations are valid
Jinja compilation succeeds

If dvt debug says your connections are good and dvt parse says your project is valid, you're ready to run.

6. Run Your Models

# Run all models
dvt run

# Run a specific model
dvt run --select my_model

# Run with full refresh (DROP + CREATE instead of TRUNCATE + INSERT)
dvt run --full-refresh

# Run everything: seeds, models, and tests in DAG order
dvt build

Configuration Files

`profiles.yml` -- Database Connections

Location: ~/.dvt/profiles.yml

my_project:
  target: pg_dev                         # Default target
  outputs:

    # PostgreSQL target (read/write)
    pg_dev:
      type: postgres
      host: localhost
      port: 5432
      user: analytics
      password: "{{ env_var('PG_PASSWORD') }}"
      dbname: warehouse
      schema: public
      threads: 4

    # Snowflake target (read/write)
    sf_prod:
      type: snowflake
      account: xy12345.us-east-1
      user: DVT_USER
      password: "{{ env_var('SF_PASSWORD') }}"
      role: TRANSFORM_ROLE
      warehouse: COMPUTE_WH
      database: ANALYTICS
      schema: PUBLIC
      threads: 4

    # Databricks target (read/write)
    dbx_dev:
      type: databricks
      host: adb-1234567890.azuredatabricks.net
      http_path: /sql/1.0/warehouses/abc123
      token: "{{ env_var('DBX_TOKEN') }}"
      schema: default
      catalog: main
      threads: 4

Multi-source federation: In your schema YAML files, each source block can specify which target to read from using the connection property:

# models/staging/sources.yml
sources:
  - name: crm_data
    connection: pg_dev                   # Read from PostgreSQL
    tables:
      - name: customers
      - name: orders

  - name: finance_data
    connection: sf_prod                  # Read from Snowflake
    tables:
      - name: transactions
      - name: accounts

Now a model can join data from both sources:

-- models/marts/customer_transactions.sql
SELECT
    c.customer_id,
    c.name,
    t.amount,
    t.transaction_date
FROM {{ source('crm_data', 'customers') }} c
JOIN {{ source('finance_data', 'transactions') }} t
    ON c.customer_id = t.customer_id

DVT detects that crm_data lives on PostgreSQL and finance_data lives on Snowflake, extracts both via Spark JDBC, joins them in Spark, and loads the result into your default target.

`computes.yml` -- Spark Configuration

Location: ~/.dvt/computes.yml

my_project:
  target: local_spark                    # Default compute
  computes:
    local_spark:
      type: spark
      version: "4.0.0"                  # PySpark version (run 'dvt sync' after changing)
      master: "local[*]"                # Use all available CPU cores

      config:
        spark.driver.memory: "4g"
        spark.executor.memory: "4g"
        spark.sql.adaptive.enabled: "true"
        spark.sql.parquet.compression.codec: "zstd"

      delta:
        optimize_on_write: false
        cleanup_retention_hours: 0
        parallel_cleanup: false

      jdbc_extraction:                   # Settings for reading from databases
        num_partitions: 8
        fetch_size: 10000

      jdbc_load:                         # Settings for writing to databases
        num_partitions: 4
        batch_size: 10000

Java compatibility: Spark 3.x requires Java 8, 11, or 17. Spark 4.x requires Java 17 or 21. DVT sync will warn you if your Java version is incompatible.

`buckets.yml` -- Staging Storage

Location: ~/.dvt/buckets.yml

my_project:
  target: local_staging                  # Default bucket
  buckets:
    local_staging:
      type: filesystem                   # Local disk staging (default)

`dbt_project.yml` -- Project Configuration

Location: Project root

name: "my_project"
version: "1.0.0"
profile: "my_project"                    # References a profile in profiles.yml

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

clean-targets:
  - "target"
  - "dvt_packages"

# Optional: pin adapter versions for dvt sync
# require-adapters:
#   postgres: ">=1.0.0"
#   snowflake: ">=1.5.0"

CLI Command Reference

DVT provides all the commands you know from dbt, plus DVT-specific additions for multi-engine orchestration.

DVT-Specific Commands

Command	Description
`dvt sync`	Sync your environment: install adapters, PySpark, JDBC drivers, and cloud connectors from your config files. Run this after changing `profiles.yml` or `computes.yml`.
`dvt debug`	Comprehensive diagnostics: validate all targets, computes, buckets, and test database connections. Supports `--connection <target>` to test a specific target.
`dvt clean`	Extended cleanup: removes build artifacts and supports `--optimize` for Delta Lake OPTIMIZE + VACUUM, and `--older-than` for age-based staging cleanup.

Core Commands

Command	Description
`dvt run`	Compile SQL models and execute against the target database. Supports `--select`, `--exclude`, `--full-refresh`.
`dvt build`	Run all seeds, models, and tests in DAG order. The "do everything" command.
`dvt seed`	Load CSV seed files into any target warehouse. By default, uses the database adapter (same as dbt). Use `--spark` for Spark-based bulk loading, or `--compute <name>` for a specific Spark compute engine.
`dvt compile`	Generate executable SQL without running it. Output goes to `target/`.
`dvt show`	Compile and run a query, returning a preview without materializing results.
`dvt retry`	Retry only the nodes that failed in the previous run.
`dvt docs generate`	Generate the documentation website for your project.
`dvt init`	Initialize a new DVT project with scaffolded directory structure and config files.
`dvt parse`	Parse the full project and validate all references, configs, and Jinja compilation.
`dvt deps`	Install dbt packages from `packages.yml`. Supports `--add-package` and `--upgrade`.
`dvt list` / `dvt ls`	List all resources (models, tests, sources, etc.) in your project.
`dvt test`	Run data tests (schema tests, singular tests, custom tests) on deployed models. Tested across all supported adapters.
`dvt snapshot`	Execute snapshot definitions for slowly changing dimensions. Uses adapter-only execution (same as dbt). See Snapshot Notes below.
`dvt docs serve`	Serve the documentation website locally for browsing.

Inherited from dbt (not yet validated in DVT)

These dbt-core commands are available in DVT but have not been fully tested or adapted for multi-engine federation. They may work as expected for single-target use cases, but behavior across federated sources is not guaranteed. Use at your own discretion.

Command	Description
`dvt clone`	Create clones of selected nodes based on a prior manifest state.
`dvt run-operation`	Run a named macro with supplied arguments.
`dvt source freshness`	Check the current freshness of your project's sources against defined thresholds.

DVT-Specific Flags

These flags are available on commands that interact with the federation engine:

Flag	Available On	Description
`--compute` / `-c`	`run`, `build`, `seed`	Select which compute engine to use from `computes.yml`. For `seed`, activates Spark-based bulk loading with the named compute.
`--spark`	`seed`, `build`	Use Spark for seeding (bulk JDBC load) with the default compute from `computes.yml`. Shorthand for `--compute <default>`.
`--bucket` / `-b`	`run`, `build`	Select which staging bucket to use from `buckets.yml`
`--connection <target>`	`debug`	Test connectivity to a specific database target
`--debug-target`	`debug`	Show only target configurations in debug output
`--debug-compute`	`debug`	Show only compute configurations in debug output
`--debug-bucket`	`debug`	Show only bucket configurations in debug output
`--optimize`	`clean`	Run Delta Lake OPTIMIZE + VACUUM on staging tables
`--older-than`	`clean`	Clean staging files older than the specified age

dbt Compatibility Note

All standard dbt-core CLI commands work as expected in DVT. The commands listed above represent the full set available today. Some dbt Cloud-specific features (such as dbt cloud subcommands) are not included, as DVT is focused on the open-source, self-hosted experience.

Cross-Dialect Compilation

DVT compiles each model using the adapter that matches its target. When a model has config(target='mssql_docker'), dvt compile uses the MSSQL adapter for macro resolution and Jinja rendering, then transpiles the SQL to the target dialect via SQLGlot. This means:

Macros resolve per-target: {{ ref() }}, {{ source() }}, and adapter-specific macros all render using the correct adapter for each model's target.
No live connection needed: dvt compile only needs the adapter plugin installed, not a live database connection.
Automatic transpilation: If a model's source dialect differs from its execution dialect, SQLGlot handles the translation (identifier quoting, function names, cast syntax, etc.).

Write your SQL in the dialect of the target database. DVT handles the rest.

For adapter dispatch (writing macros that work across engines), use dbt's standard adapter.dispatch() pattern:

{% macro my_timestamp_macro() %}
    {{ return(adapter.dispatch('my_timestamp_macro')()) }}
{% endmacro %}

{% macro postgres__my_timestamp_macro() %}
    current_timestamp
{% endmacro %}

{% macro sqlserver__my_timestamp_macro() %}
    GETDATE()
{% endmacro %}

Snapshot Notes

DVT snapshots work via adapter-only execution (same as dbt) -- no federation path. Key points:

Same-engine only: All sources referenced by a snapshot must be on the same engine as the snapshot's target. Cross-engine snapshots are not supported.
Automatic schema resolution: When a snapshot has config(target='mssql_docker'), DVT automatically resolves the correct schema from the target's profile config (e.g., dbo for MSSQL, SYSTEM for Oracle, devdb for MySQL). You don't need to hardcode target_schema.
Omit target_schema: For cross-engine portability, omit target_schema from your snapshot config and let DVT resolve it from the target's profile. If you do set target_schema, it will be respected as an explicit override.

{% snapshot my_snapshot %}
{{
    config(
        unique_key='id',
        strategy='check',
        check_cols=['name', 'status']
    )
}}
SELECT * FROM {{ ref('my_model') }}
{% endsnapshot %}

Built on dbt-core

DVT is a proud fork of dbt-core by dbt Labs. We stand on the shoulders of the incredible work done by the dbt Labs team and the dbt community.

dbt pioneered the idea that data analysts should be able to transform data using the same practices that software engineers use to build applications -- version control, modularity, testing, and documentation. DVT extends this vision to the ETLT domain, where data needs to move across multiple databases before it can be transformed.

What DVT inherits from dbt-core:

The full CLI framework and command structure
SQL model compilation with Jinja templating
DAG-based dependency resolution and execution
Incremental materializations and seeds
Testing framework (schema tests, data tests, custom tests)
Documentation generation
Package management

What DVT adds:

Multi-datasource federation via Apache Spark
JDBC-based extraction from any database
Delta Lake staging with ACID transactions
Intelligent execution routing (pushdown vs. federation)
dvt sync for automated environment setup
Extended dvt debug with multi-target diagnostics
Cross-engine incremental models with {{ this }} resolution

We are grateful to dbt Labs for open-sourcing dbt-core under the Apache License 2.0, which makes projects like DVT possible.

License

DVT is free and open-source software licensed under the Apache License 2.0.

Copyright 2025-2026 Hesham Badawi.

Licensed under the Apache License, Version 2.0
You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

This product includes software developed by dbt Labs, Inc. See the NOTICE file for full attribution details.

You are free to use DVT for any purpose -- personal, commercial, or enterprise. Build on it, extend it, deploy it in production. The Apache 2.0 license ensures DVT will remain free and open-source.

Community

We're building DVT in the open and we'd love for you to be part of it.

Discord -- Join the DVT community server for discussions, support, and announcements.
GitHub Issues -- Found a bug or have a feature request? Open an issue.
Contributing -- Want to help build DVT? Contributions are welcome. Start by opening an issue to discuss your idea.

_{Built with conviction by data engineers, for data engineers.}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.47

Apr 7, 2026

0.1.46

Apr 5, 2026

0.1.45

Apr 3, 2026

0.1.44

Apr 2, 2026

0.1.42

Apr 2, 2026

0.1.41

Apr 1, 2026

0.1.40

Apr 1, 2026

0.1.34

Mar 30, 2026

0.1.33

Mar 30, 2026

0.1.32

Mar 30, 2026

0.1.31

Mar 30, 2026

0.1.30

Mar 30, 2026

0.1.29

Mar 30, 2026

0.1.28

Mar 30, 2026

0.1.27

Mar 30, 2026

0.1.26

Mar 30, 2026

0.1.25

Mar 30, 2026

0.1.24

Mar 30, 2026

0.1.23

Mar 30, 2026

0.1.22

Mar 8, 2026

This version

0.1.21

Mar 8, 2026

0.1.8

Mar 25, 2026

0.1.7

Mar 25, 2026

0.1.6

Mar 22, 2026

0.1.5

Mar 16, 2026

0.1.4

Mar 16, 2026

0.1.3

Mar 16, 2026

0.1.2

Mar 7, 2026

0.1.1

Mar 7, 2026

0.1.0

Mar 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dvt_ce-0.1.21.tar.gz (1.2 MB view details)

Uploaded Mar 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dvt_ce-0.1.21-py3-none-any.whl (1.3 MB view details)

Uploaded Mar 8, 2026 Python 3

File details

Details for the file dvt_ce-0.1.21.tar.gz.

File metadata

Download URL: dvt_ce-0.1.21.tar.gz
Upload date: Mar 8, 2026
Size: 1.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for dvt_ce-0.1.21.tar.gz
Algorithm	Hash digest
SHA256	`9997596ddad73c5c0701d3dfaa230e48c5d17750a7d2bbe824ba9f60ddd20da2`
MD5	`3f8b13985d9f3b48319f090585b1aade`
BLAKE2b-256	`3dcb6798d98c6129a5c5957dd6490c6b281bc599717e1e97cdc42e31c920c359`

See more details on using hashes here.

File details

Details for the file dvt_ce-0.1.21-py3-none-any.whl.

File metadata

Download URL: dvt_ce-0.1.21-py3-none-any.whl
Upload date: Mar 8, 2026
Size: 1.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for dvt_ce-0.1.21-py3-none-any.whl
Algorithm	Hash digest
SHA256	`430d8da81eba3e7941b14c5fd43523cd0c69dd383a9302756dfc69a1702faf8f`
MD5	`f0fd8ba15ba67343f2b35387ef2d394e`
BLAKE2b-256	`3ed154f7b49421fa91911cda34aeee6efa631366becb52d6fe81bea6194741e8`

See more details on using hashes here.

dvt-ce 0.1.21

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DVT — Data Virtualization Tool

What is DVT?

The ETLT Pattern

Adapter Pushdown vs. Spark Federation

Key Features

How DVT Works

Tested Adapters

Quick Start

Prerequisites

1. Install DVT

2. Initialize Your Project

3. Configure Your Connections

4. Sync Your Environment

5. Validate Your Setup

6. Run Your Models

Configuration Files

profiles.yml -- Database Connections

computes.yml -- Spark Configuration

buckets.yml -- Staging Storage

dbt_project.yml -- Project Configuration

CLI Command Reference

DVT-Specific Commands

Core Commands

Inherited from dbt (not yet validated in DVT)

DVT-Specific Flags

dbt Compatibility Note

Cross-Dialect Compilation

Snapshot Notes

Built on dbt-core

License

Community

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`profiles.yml` -- Database Connections

`computes.yml` -- Spark Configuration

`buckets.yml` -- Staging Storage

`dbt_project.yml` -- Project Configuration