Command-line tool and Python library to efficiently diff rows across two different databases.
Project description
data-diff: Compare datasets fast, within or across SQL databases
Use Cases
Data Migration & Replication Testing
Compare source to target and check for discrepancies when moving data between systems:
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)
Data Development Testing
Test SQL code and preview changes by comparing development/staging environment data to production:
- Make a change to some SQL code
- Run the SQL code to create a new dataset
- Compare the dataset with its production version or another iteration
data-diff integrates with dbt Core to seamlessly compare local development to production datasets
dbt Cloud users should check out Datafold's out-of-the-box deployment testing integration
:eyes: Watch 4-min demo video
Get started with data-diff & dbt
Reach out on the dbt Slack in #tools-datafold for advice and support
How it works
When comparing the data, data-diff
utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:
joindiff
- Recommended for comparing data within the same database
- Uses the outer join operation to diff the rows as efficiently as possible within the same database
- Fully relies on the underlying database engine for computation
- Requires both datasets to be queryable with a single SQL query
- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
hashdiff
- Recommended for comparing datasets across different databases
- Can also be helpful in diffing very large tables with few expected differences within the same database
- Employs a divide-and-conquer algorithm based on hashing and binary search
- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
- Time complexity approximates COUNT(*) operation when there are few differences
- Performance degrades when datasets have a large number of differences
More information about the algorithm and performance considerations can be found here
Get started
Validating dbt model changes between dev and prod
⚡ Looking to use data-diff
in dbt development? Head over to our data-diff
+ dbt
documentation to get started!
Compare data tables between databases
🔀 To compare data between databases, install data-diff
with specific database adapters, e.g.:
pip install data-diff 'data-diff[postgresql,snowflake]' -U
Run data-diff
with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm:
data-diff \
postgresql://<username>:'<password>'@localhost:5432/<database> \
<table> \
"snowflake://<username>:<password>@<account>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
<TABLE> \
-k <primary key column> \
-c <columns to compare> \
-w <filter condition>
Run data-diff
with a toml
configuration file. In the following example, we compare tables between MotherDuck(hosted DuckDB) and Snowflake using the hashdiff algorithm:
## DATABASE CONNECTION ##
[database.duckdb_connection]
driver = "duckdb"
# filepath = "datafold_demo.duckdb" # local duckdb file example
# filepath = "md:" # default motherduck connection example
filepath = "md:datafold_demo?motherduck_token=${motherduck_token}" # API token recommended for motherduck connection
database = "datafold_demo"
[database.snowflake_connection]
driver = "snowflake"
database = "DEV"
user = "sung"
password = "${SNOWFLAKE_PASSWORD}" # or "<PASSWORD_STRING>"
# the info below is only required for snowflake
account = "${ACCOUNT}" # by33919
schema = "DEVELOPMENT"
warehouse = "DEMO"
role = "DEMO_ROLE"
## RUN PARAMETERS ##
[run.default]
verbose = true
## EXAMPLE DATA DIFF JOB ##
[run.demo_xdb_diff]
# Source 1 ("left")
1.database = "duckdb_connection"
1.table = "development.raw_orders"
# Source 2 ("right")
2.database = "snowflake_connection"
2.table = "RAW_ORDERS" # note that snowflake table names are case-sensitive
verbose = false
# export relevant environment variables, example below
export motherduck_token=<MOTHERDUCK_TOKEN>
# run the configured data-diff job
data-diff --conf datadiff.toml \
--run demo_xdb_diff \
-k "id" \
-c status
# output example
- 1, completed
+ 1, returned
Check out documentation for the full command reference.
Supported databases
Database | Status | Connection string |
---|---|---|
PostgreSQL >=10 | 🟢 | postgresql://<user>:<password>@<host>:5432/<database> |
MySQL | 🟢 | mysql://<user>:<password>@<hostname>:5432/<database> |
Snowflake | 🟢 | "snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]" |
BigQuery | 🟢 | bigquery://<project>/<dataset> |
Redshift | 🟢 | redshift://<username>:<password>@<hostname>:5439/<database> |
DuckDB | 🟢 | duckdb://<dbname>@<filepath> |
MotherDuck | 🟢 | duckdb://<dbname>@<filepath> |
Oracle | 🟡 | oracle://<username>:<password>@<hostname>/servive_or_sid |
Presto | 🟡 | presto://<username>:<password>@<hostname>:8080/<database> |
Databricks | 🟡 | databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema> |
Trino | 🟡 | trino://<username>:<password>@<hostname>:8080/<database> |
Clickhouse | 🟡 | clickhouse://<username>:<password>@<hostname>:9000/<database> |
Vertica | 🟡 | vertica://<username>:<password>@<hostname>:5433/<database> |
ElasticSearch | 📝 | |
Planetscale | 📝 | |
Pinot | 📝 | |
Druid | 📝 | |
Kafka | 📝 | |
SQLite | 📝 |
- 🟢: Implemented and thoroughly tested.
- 🟡: Implemented, but not thoroughly tested yet.
- ⏳: Implementation in progress.
- 📝: Implementation planned. Contributions welcome.
Your database not listed here?
- Contribute a new database adapter – we accept pull requests!
- Get in touch about enterprise support and adding new adapters and features
Contributors
We thank everyone who contributed so far!
Analytics
License
This project is licensed under the terms of the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for data_diff-0.10.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c7540efc61cf07554e339d601fdc252d4d99411f64866b31512771c1ec96a034 |
|
MD5 | 72c384ab893f27900a2d87f3651730c1 |
|
BLAKE2b-256 | 6d320b47b44874b3cb5ce7e9de3b1b3c765057e58fae2b4edcef93fbccd1fc3b |