Skip to main content

Git diff for datasets: compare datasets and understand what changed.

Project description

Dift

Dift Logo

Dift is an open-source CLI tool that helps data professionals compare two datasets and instantly understand:

  • what changed
  • why it matters
  • whether the new data is safe to trust

What's New in v0.3.0

Dift v0.3.0 introduces powerful reporting and export capabilities, making it easier to analyze and share dataset changes.

New Features

  • HTML report export
  • CSV summary export
  • Excel report export
  • Improved JSON report structure
  • Report templates (HTML)
  • --output-dir support for directory-based exports

Why Dift?

Bad data breaks:

  • dashboards
  • reports
  • ETL pipelines
  • analytics workflows
  • ML models
  • business decisions

Dift helps teams catch risky data changes before they cause damage.


Features (v0.3.0)

Compare two datasets in seconds.

Supported Formats

  • CSV
  • Parquet
  • Excel (.xlsx, .xls)
  • JSON

Detect Changes

  • Schema diff
  • Row count diff
  • Added rows
  • Removed rows
  • Changed rows (with key column)
  • Column type changes
  • Null spikes
  • Duplicate increases
  • Numeric stats diff
  • Categorical value changes
  • Risk scoring (low, medium, high)

Output Options

  • Rich CLI report
  • JSON report
  • CSV summary
  • Excel report
  • HTML report

HTML Templates

Customize your HTML reports:

dift old.csv new.csv --report html --template clean

Available templates:

  • default
  • clean
  • compact
  • enterprise
  • dark

Output Directory Support

Save reports to a directory without specifying filenames:

dift old.csv new.csv --report json --output-dir reports/

Auto-generated filenames:

  • dift_report.json
  • dift_report.csv
  • dift_report.xlsx
  • dift_report.html

Requirements

  • Python 3.10+

Quick Install

pip install dift-cli

Then run:

dift --help

Quick Update (Latest version: 0.3.0)

pip install --upgrade dift-cli

Cross Platform Setup

Windows (Git Bash)

python -m venv .venv
source .venv/Scripts/activate
pip install dift-cli

Windows (PowerShell)

python -m venv .venv
.venv\Scripts\Activate.ps1
pip install dift-cli

Mac / Linux

python3 -m venv .venv
source .venv/bin/activate
pip install dift-cli

pipx (Recommended)

pipx install dift-cli

Verify Install

dift --help

or

python -m dift.cli --help

Quick Start

Compare CSV Files

dift examples/old.csv examples/new.csv --key customer_id

Generate Reports

JSON

dift examples/old.csv examples/new.csv --key customer_id --report json --output report.json

CSV

dift examples/old.csv examples/new.csv --key customer_id --report csv --output report.csv

Excel

dift examples/old.csv examples/new.csv --key customer_id --report excel --output report.xlsx

HTML

dift examples/old.csv examples/new.csv --key customer_id --report html --output report.html

HTML with Template

dift examples/old.csv examples/new.csv --key customer_id --report html --template dark --output report.html

Example Output

╭─────────────────────────╮
│ Dift Dataset Comparison │
│ Risk Level: HIGH        │
╰─────────────────────────╯

Summary
Rows old: 10
Rows new: 11
Row delta: +1
Row change %: +10.00%

Warnings:
Nulls increased in revenue by 9.09%

Example Files

examples/
├── old.csv
├── new.csv
├── old.parquet
├── new.parquet
├── old.xlsx
├── new.xlsx
├── old.json
└── new.json

Use Cases

ETL Validation

dift before.csv after.csv

ML Dataset Drift

dift train_v1.csv train_v2.csv

Production vs Staging

dift prod.csv staging.csv --key id

Project Structure

dift/
├── cli.py
├── core/
├── io/
├── reports/
│   ├── console_report.py
│   ├── json_report.py
│   ├── csv_report.py
│   ├── excel_report.py
│   ├── html_report.py
│   └── models.py
└── utils/

tests/
examples/

Run Tests

pytest

Lint:

ruff check .

Roadmap

v0.4.0

  • Improve null detection
  • Improve duplicate detection

v0.5.0

  • Drift thresholds
  • Outlier detection
  • Numeric drift thresholds
  • Categorical shift warnings
  • Better risk scoring

v0.6.0

  • SQL database support
  • Postgres connector

Contributing

Contributions are welcome.

See:

CONTRIBUTING.md

Ways to help:

  • Fix bugs
  • Improve docs
  • Add tests
  • Improve performance
  • Add connectors
  • Improve CLI UX

License

MIT License


Vision

Dift aims to become the standard open-source tool for dataset comparison and trust checks.

If Git has git diff, data teams should have dift.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dift_cli-0.3.0.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dift_cli-0.3.0-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file dift_cli-0.3.0.tar.gz.

File metadata

  • Download URL: dift_cli-0.3.0.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for dift_cli-0.3.0.tar.gz
Algorithm Hash digest
SHA256 37fd7341e4a700a46aee07a97451098c32bdaa6534e434f5cbc97b0645b737a8
MD5 e5a510438f01ed778014beeb9479f1ea
BLAKE2b-256 f2fe392fe623d5f94c8557c20cf45e0e8600a0d14256771995f9bef1d93a0c9b

See more details on using hashes here.

File details

Details for the file dift_cli-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: dift_cli-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for dift_cli-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1b1396ad7780b85c269b81203301b76dbd78f0b7e24fcf06ab7e19701a433e62
MD5 f24a0868fca8b8ef4c6f02118fa6ac56
BLAKE2b-256 003e26bca1582dcfce9b49fdafd9b662ac4a937c28061d02e9d63c6a4a7be11b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page