Skip to main content

Git diff for datasets: compare datasets and understand what changed.

Project description

Dift

Dift Logo

Dift is an open-source CLI tool that helps data professionals compare two datasets and instantly understand:

  • what changed
  • why it matters
  • whether the new data is safe to trust

What's New in v0.2.1

Dift v0.2.1 introduces a more polished CLI experience and broader file support.

New Improvements

  • Better console formatting
  • Rich terminal colors
  • Cleaner summary tables
  • Risk level highlighting
  • Percentage row change display
  • Better missing file error messages
  • JSON dataset support
  • JSON example datasets
  • Excel example datasets
  • Parquet example datasets
  • Improved installation instructions

Why Dift?

Bad data breaks:

  • dashboards
  • reports
  • ETL pipelines
  • analytics workflows
  • ML models
  • business decisions

Dift helps teams catch risky data changes before they cause damage.


Features (v0.2.0)

Compare two datasets in seconds.

Supported Formats

  • CSV
  • Parquet
  • Excel (.xlsx, .xls)
  • JSON

Detect Changes

  • Schema diff
  • Row count diff
  • Added rows
  • Removed rows
  • Changed rows (with key column)
  • Column type changes
  • Null spikes
  • Duplicate increases
  • Numeric stats diff
  • Categorical value changes
  • Risk scoring (low, medium, high)

Output

  • Rich CLI report
  • JSON report export

Requirements

  • Python 3.10+

Quick Install

pip install dift-cli

Then run:

dift --help

Cross Platform Setup

Windows (Git Bash)

python -m venv .venv
source .venv/Scripts/activate
pip install dift-cli

Windows (PowerShell)

python -m venv .venv
.venv\Scripts\Activate.ps1
pip install dift-cli

Mac / Linux

python3 -m venv .venv
source .venv/bin/activate
pip install dift-cli

pipx (Recommended for CLI Tools)

pipx install dift-cli

If pipx is not installed:

python -m pip install pipx
python -m pipx ensurepath

Verify Install

dift --help

or

python -m dift.cli --help

Upgrade Later

pip install --upgrade dift-cli

If Command Not Found

Use:

python -m dift.cli --help

Or restart your terminal.


Quick Start

Compare CSV Files

dift examples/old.csv examples/new.csv --key customer_id

Compare Parquet Files

dift examples/old.parquet examples/new.parquet --key customer_id

Compare Excel Files

dift examples/old.xlsx examples/new.xlsx --key customer_id

Compare JSON Files

dift examples/old.json examples/new.json --key customer_id

Generate JSON Report

dift examples/old.csv examples/new.csv --key customer_id --report json --output report.json

Example Output

╭─────────────────────────╮
│ Dift Dataset Comparison │
│ Risk Level: HIGH        │
╰─────────────────────────╯

Summary
Rows old: 10
Rows new: 11
Row delta: +1
Row change %: +10.00%

Warnings:
Nulls increased in revenue by 9.09%

Example Files Included

examples/
├── old.csv
├── new.csv
├── old.parquet
├── new.parquet
├── old.xlsx
├── new.xlsx
├── old.json
└── new.json

Use them to test instantly.


Example Use Cases

ETL Validation

dift before.csv after.csv

Daily Snapshot Checks

dift yesterday.parquet today.parquet

Excel File Audits

dift old.xlsx new.xlsx --key id

JSON API Export Checks

dift old.json new.json --key id

Production vs Staging

dift prod.csv staging.csv --key id

ML Dataset Drift Checks

dift train_v1.csv train_v2.csv

Project Structure

dift/
├── cli.py
├── core/
│   ├── comparator.py
│   ├── schema_diff.py
│   ├── row_diff.py
│   ├── quality_diff.py
│   ├── risk.py
│   └── stats_diff.py
├── io/
│   └── readers.py
├── reports/
│   ├── console_report.py
│   ├── json_report.py
│   └── models.py
└── utils/

tests/
examples/

Run Tests

pytest

Lint code:

ruff check .

Roadmap

v0.3.0 — Report Exports

  • HTML report export
  • CSV summary export
  • Excel report export
  • Better JSON report structure
  • Report templates
  • --output-dir

v0.4.0

  • Improve null spike detection
  • Improve duplicate detection

v0.5.0

  • Outlier detection
  • Numeric drift thresholds
  • Categorical shift warnings
  • Better risk scoring

v0.6.0

  • SQL database support
  • Postgres connector

v0.7.0

  • Snowflake connector
  • BigQuery connector

v0.8.0

  • CI/CD fail checks
  • dbt integration

v0.9.0

  • Drift alerts
  • Python API
  • Plugin system

v1.0.0

  • Stable CLI
  • Stable Python API
  • Full test coverage
  • Full docs site
  • Benchmarks
  • Security review
  • Production-ready install

Contributing

Contributions are welcome.

Please read:

CONTRIBUTING.md

Ways to help:

  • Fix bugs
  • Improve docs
  • Add tests
  • Improve performance
  • Add connectors
  • Improve CLI UX

License

MIT License


Vision

Dift aims to become the standard open-source tool for dataset comparison and trust checks.

If Git has git diff, data teams should have dift.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dift_cli-0.2.1.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dift_cli-0.2.1-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file dift_cli-0.2.1.tar.gz.

File metadata

  • Download URL: dift_cli-0.2.1.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for dift_cli-0.2.1.tar.gz
Algorithm Hash digest
SHA256 927ac36c24c3381486c594fbfa43a01cd4017ed8e8a76c2bc923b17f163a6f4c
MD5 fa714652f44bfa7582f7a1dd68210263
BLAKE2b-256 60b17d6f6a63a6148aa8d92cf8a4f624dcb9f361a7e9c7c80160b730470214fe

See more details on using hashes here.

File details

Details for the file dift_cli-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: dift_cli-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 14.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for dift_cli-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a38e453fc29b16b8d1d765713c7b565402885059a6ccbaa5f017c372e00514f0
MD5 61906ae4d2fa6740a959685c12209b04
BLAKE2b-256 ed354a0c93f1a731709a01bff330ddc8e377a085c5a1f0d17d201336ddd563d9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page