Git diff for datasets: compare datasets and understand what changed.
Project description
Dift
Dift is an open-source CLI tool that helps data professionals compare two datasets and instantly understand:
- what changed
- why it matters
- whether the new data is safe to trust
What's New in v0.3.0
Dift v0.3.0 introduces powerful reporting and export capabilities, making it easier to analyze and share dataset changes.
New Features
- HTML report export
- CSV summary export
- Excel report export
- Improved JSON report structure
- Report templates (HTML)
--output-dirsupport for directory-based exports
Why Dift?
Bad data breaks:
- dashboards
- reports
- ETL pipelines
- analytics workflows
- ML models
- business decisions
Dift helps teams catch risky data changes before they cause damage.
Features (v0.3.0)
Compare two datasets in seconds.
Supported Formats
- CSV
- Parquet
- Excel (
.xlsx,.xls) - JSON
Detect Changes
- Schema diff
- Row count diff
- Added rows
- Removed rows
- Changed rows (with key column)
- Column type changes
- Null spikes
- Duplicate increases
- Numeric stats diff
- Categorical value changes
- Risk scoring (
low,medium,high)
Output Options
- Rich CLI report
- JSON report
- CSV summary
- Excel report
- HTML report
HTML Templates
Customize your HTML reports:
dift old.csv new.csv --report html --template clean
Available templates:
defaultcleancompactenterprisedark
Output Directory Support
Save reports to a directory without specifying filenames:
dift old.csv new.csv --report json --output-dir reports/
Auto-generated filenames:
dift_report.jsondift_report.csvdift_report.xlsxdift_report.html
Requirements
- Python 3.10+
Quick Install
pip install dift-cli
Then run:
dift --help
Quick Update (Latest version: 0.3.0)
pip install --upgrade dift-cli
Cross Platform Setup
Windows (Git Bash)
python -m venv .venv
source .venv/Scripts/activate
pip install dift-cli
Windows (PowerShell)
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install dift-cli
Mac / Linux
python3 -m venv .venv
source .venv/bin/activate
pip install dift-cli
pipx (Recommended)
pipx install dift-cli
Verify Install
dift --help
or
python -m dift.cli --help
Quick Start
Compare CSV Files
dift examples/old.csv examples/new.csv --key customer_id
Generate Reports
JSON
dift examples/old.csv examples/new.csv --key customer_id --report json --output report.json
CSV
dift examples/old.csv examples/new.csv --key customer_id --report csv --output report.csv
Excel
dift examples/old.csv examples/new.csv --key customer_id --report excel --output report.xlsx
HTML
dift examples/old.csv examples/new.csv --key customer_id --report html --output report.html
HTML with Template
dift examples/old.csv examples/new.csv --key customer_id --report html --template dark --output report.html
Example Output
╭─────────────────────────╮
│ Dift Dataset Comparison │
│ Risk Level: HIGH │
╰─────────────────────────╯
Summary
Rows old: 10
Rows new: 11
Row delta: +1
Row change %: +10.00%
Warnings:
Nulls increased in revenue by 9.09%
Example Files
examples/
├── old.csv
├── new.csv
├── old.parquet
├── new.parquet
├── old.xlsx
├── new.xlsx
├── old.json
└── new.json
Use Cases
ETL Validation
dift before.csv after.csv
ML Dataset Drift
dift train_v1.csv train_v2.csv
Production vs Staging
dift prod.csv staging.csv --key id
Project Structure
dift/
├── cli.py
├── core/
├── io/
├── reports/
│ ├── console_report.py
│ ├── json_report.py
│ ├── csv_report.py
│ ├── excel_report.py
│ ├── html_report.py
│ └── models.py
└── utils/
tests/
examples/
Run Tests
pytest
Lint:
ruff check .
Roadmap
v0.4.0
- Improve null detection
- Improve duplicate detection
v0.5.0
- Drift thresholds
- Outlier detection
- Numeric drift thresholds
- Categorical shift warnings
- Better risk scoring
v0.6.0
- SQL database support
- Postgres connector
Contributing
Contributions are welcome.
See:
CONTRIBUTING.md
Ways to help:
- Fix bugs
- Improve docs
- Add tests
- Improve performance
- Add connectors
- Improve CLI UX
License
MIT License
Vision
Dift aims to become the standard open-source tool for dataset comparison and trust checks.
If Git has git diff, data teams should have dift.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dift_cli-0.3.0.tar.gz.
File metadata
- Download URL: dift_cli-0.3.0.tar.gz
- Upload date:
- Size: 19.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37fd7341e4a700a46aee07a97451098c32bdaa6534e434f5cbc97b0645b737a8
|
|
| MD5 |
e5a510438f01ed778014beeb9479f1ea
|
|
| BLAKE2b-256 |
f2fe392fe623d5f94c8557c20cf45e0e8600a0d14256771995f9bef1d93a0c9b
|
File details
Details for the file dift_cli-0.3.0-py3-none-any.whl.
File metadata
- Download URL: dift_cli-0.3.0-py3-none-any.whl
- Upload date:
- Size: 19.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b1396ad7780b85c269b81203301b76dbd78f0b7e24fcf06ab7e19701a433e62
|
|
| MD5 |
f24a0868fca8b8ef4c6f02118fa6ac56
|
|
| BLAKE2b-256 |
003e26bca1582dcfce9b49fdafd9b662ac4a937c28061d02e9d63c6a4a7be11b
|