Skip to main content

Command-line tool and Python library to efficiently diff rows across two different databases.

Project description

 

Reladiff is a high-performance tool and library designed for diffing large datasets across databases. By executing the diff calculation within the database itself, Reladiff minimizes data transfer and achieves optimal performance.

This tool is specifically tailored for data professionals, DevOps engineers, and system administrators.

Reladiff is free, open-source, user-friendly, extensively tested, and delivers fast results, even at massive scale.

Key Features:

  1. Cross-Database Diff: Reladiff employs a divide-and-conquer algorithm, based on matching hashes, to efficiently identify modified segments and download only the necessary data for comparison. This approach ensures exceptional performance when differences are minimal.

    • ⇄ Diffs across over a dozen different databases (e.g. PostgreSQL -> Snowflake) !

    • 🧠 Gracefully handles reduced precision (e.g., timestamp(9) -> timestamp(3)) by rounding according to the database specification.

    • 🔥 Benchmarked to diff over 25M rows in under 10 seconds and over 1B rows in approximately 5 minutes, given no differences.

    • ♾️ Capable of handling tables with tens of billions of rows.

  2. Intra-Database Diff: When both tables reside in the same database, Reladiff compares them using a join operation, with additional optimizations for enhanced speed.

    • Supports materializing the diff into a local table.
    • Can collect various extra statistics about the tables.
  3. Threaded: Utilizes multiple threads to significantly boost performance during diffing operations.

  4. Configurable: Offers numerous options for power-users to customize and optimize their usage.

  5. Automation-Friendly: Outputs both JSON and git-like diffs (with + and -), facilitating easy integration into CI/CD pipelines.

  6. Over a dozen databases supported. MySQL, Postgres, Snowflake, Bigquery, Oracle, Clickhouse, and more. See full list

Reladiff is a fork of an archived project called data-diff.

Get Started

🗎 Read the Documentation - our detailed documentation has everything you need to start diffing.

Quickstart

For the impatient ;)

Install

Reladiff is available on PyPI. You may install it by running:

pip install reladiff

Requires Python 3.8+ with pip.

We advise to install it within a virtual-env.

How to Use

Once you've installed Reladiff, you can run it from the command-line:

# Cross-DB diff, using hashes
reladiff  DB1_URI  TABLE1_NAME  DB2_URI  TABLE2_NAME  [OPTIONS]

When both tables belong to the same database, a shorter syntax is available:

# Same-DB diff, using outer join
reladiff  DB1_URI  TABLE1_NAME  TABLE2_NAME  [OPTIONS]

Or, you can import and run it from Python:

from reladiff import connect_to_table, diff_tables

table1 = connect_to_table("postgresql:///", "table_name", "id")
table2 = connect_to_table("mysql:///", "table_name", "id")

sign: Literal['+' | '-']
row: tuple[str, ...]
for sign, row in diff_tables(table1, table2):
    print(sign, row)

Read our detailed instructions:

"Real-world" example: Diff "events" table between Postgres and Snowflake

reladiff \
  postgresql:/// \
  events \
  "snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
  events \
  -k event_id \         # Identifier of event
  -c event_data \       # Extra column to compare
  -w "event_time < '2024-10-10'"    # Filter the rows on both dbs

"Real-world" example: Diff "events" and "old_events" tables in the same Postgres DB

Materializes the results into a new table, containing the current timestamp in its name.

reladiff \
  postgresql:///  events  old_events \
  -k org_id \
  -c created_at -c is_internal \
  -w "org_id != 1 and org_id < 2000" \
  -m test_results_%t \
  --materialize-all-rows \
  --table-write-limit 10000

Technical Explanation

Check out this technical explanation of how cross-database reladiff works.

We're here to help!

How to Contribute

  • Please read the contributing guidelines to get started.
  • Feel free to open a new issue or work on an existing one.

Big thanks to everyone who contributed so far:

License

This project is licensed under the terms of the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reladiff-0.6.0.tar.gz (35.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

reladiff-0.6.0-py3-none-any.whl (41.3 kB view details)

Uploaded Python 3

File details

Details for the file reladiff-0.6.0.tar.gz.

File metadata

  • Download URL: reladiff-0.6.0.tar.gz
  • Upload date:
  • Size: 35.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/23.6.0

File hashes

Hashes for reladiff-0.6.0.tar.gz
Algorithm Hash digest
SHA256 5d392af40c771305487321fad375769312d3d4adbfab0cd5b56675f3ab50df77
MD5 a7e32751803e5ab60bae9504b612b732
BLAKE2b-256 b416ff31f24d48f1175a5c6d0aa016f94390259f67ded8f4280472f1e7f65bcd

See more details on using hashes here.

File details

Details for the file reladiff-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: reladiff-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 41.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/23.6.0

File hashes

Hashes for reladiff-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 997e7c561532cf094dfd2eb71f624ed319ccdad055b7ec88b8b76bf9d6e12568
MD5 fc5e2a0251328d137d73f8ee0f607438
BLAKE2b-256 ee730fc8675cdb30367e22a863b228fc5b0d8f8b5be0da3f067210bbfce0f7ee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page