Skip to main content

Command-line tool and Python library to efficiently diff rows across two different databases.

Project description

Datafold

data-diff: Compare datasets fast, within or across SQL databases

data-diff-logo


Use Cases

Data Migration & Replication Testing

Compare source to target and check for discrepancies when moving data between systems:

  • Migrating to a new data warehouse (e.g., Oracle > Snowflake)
  • Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
  • Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)

Data Development Testing

Test SQL code and preview changes by comparing development/staging environment data to production:

  1. Make a change to some SQL code
  2. Run the SQL code to create a new dataset
  3. Compare the dataset with its production version or another iteration

dbt

data-diff integrates with dbt Core to seamlessly compare local development to production datasets

data-development-testing

dbt Cloud users should check out Datafold's out-of-the-box deployment testing integration

:eyes: Watch 4-min demo video

Get started with data-diff & dbt

Also available in a VS Code Extension

Reach out on the dbt Slack in #tools-datafold for advice and support

How it works

When comparing the data, data-diff utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:

joindiff

  • Recommended for comparing data within the same database
  • Uses the outer join operation to diff the rows as efficiently as possible within the same database
  • Fully relies on the underlying database engine for computation
  • Requires both datasets to be queryable with a single SQL query
  • Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset

hashdiff

  • Recommended for comparing datasets across different databases
  • Can also be helpful in diffing very large tables with few expected differences within the same database
  • Employs a divide-and-conquer algorithm based on hashing and binary search
  • Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
  • Time complexity approximates COUNT(*) operation when there are few differences
  • Performance degrades when datasets have a large number of differences

More information about the algorithm and performance considerations can be found here

Get started

Validating dbt model changes between dev and prod

⚡ Looking to use data-diff in dbt development? Head over to our data-diff + dbt documentation to get started!

Compare data tables between databases

🔀 To compare data between databases, install data-diff with specific database adapters, e.g.:

pip install data-diff 'data-diff[postgresql,snowflake]' -U

Run data-diff with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using hashdiff algorithm:

data-diff \
  postgresql://<username>:'<password>'@localhost:5432/<database> \
  <table> \
  "snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
  <TABLE> \
  -k <primary key column> \
  -c <columns to compare> \
  -w <filter condition>

Check out documentation for the full command reference.

Supported databases

Database Status Connection string
PostgreSQL >=10 🟢 postgresql://<user>:<password>@<host>:5432/<database>
MySQL 🟢 mysql://<user>:<password>@<hostname>:5432/<database>
Snowflake 🟢 "snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"
BigQuery 🟢 bigquery://<project>/<dataset>
Redshift 🟢 redshift://<username>:<password>@<hostname>:5439/<database>
Oracle 🟡 oracle://<username>:<password>@<hostname>/servive_or_sid
Presto 🟡 presto://<username>:<password>@<hostname>:8080/<database>
Databricks 🟡 databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>
Trino 🟡 trino://<username>:<password>@<hostname>:8080/<database>
Clickhouse 🟡 clickhouse://<username>:<password>@<hostname>:9000/<database>
Vertica 🟡 vertica://<username>:<password>@<hostname>:5433/<database>
DuckDB 🟡
ElasticSearch 📝
Planetscale 📝
Pinot 📝
Druid 📝
Kafka 📝
SQLite 📝
  • 🟢: Implemented and thoroughly tested.
  • 🟡: Implemented, but not thoroughly tested yet.
  • ⏳: Implementation in progress.
  • 📝: Implementation planned. Contributions welcome.

Your database not listed here?


Contributors

We thank everyone who contributed so far!


Analytics


License

This project is licensed under the terms of the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cz_data_diff-0.0.4.tar.gz (101.0 kB view details)

Uploaded Source

Built Distribution

cz_data_diff-0.0.4-py3-none-any.whl (129.4 kB view details)

Uploaded Python 3

File details

Details for the file cz_data_diff-0.0.4.tar.gz.

File metadata

  • Download URL: cz_data_diff-0.0.4.tar.gz
  • Upload date:
  • Size: 101.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.9

File hashes

Hashes for cz_data_diff-0.0.4.tar.gz
Algorithm Hash digest
SHA256 ca1012a56909ff7c18aacba018d1e027393cf83bc7d2b846c586657f5edbf037
MD5 a3f930debd42b3b40af5e1b252f07301
BLAKE2b-256 452d251a80e4db4bce0cf4f582ff61c9b8e24c289b1f2222445f4f4f056f304f

See more details on using hashes here.

File details

Details for the file cz_data_diff-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: cz_data_diff-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 129.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.9

File hashes

Hashes for cz_data_diff-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 8f749e498d377ddd1275a6de110eb41c3868ad9b4b27a674d798d351c674c092
MD5 fa726fd7014f55eb632cc3d0ec6dac34
BLAKE2b-256 e5c79d7eb3e3fc72f7ce09f9e1495fb6d10baa1c7cb2dc7fc41c73e783d77aff

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page