Git diff for datasets: compare datasets and understand what changed.
Project description
Dift
Dift is an open-source CLI tool that helps data professionals compare two datasets and instantly understand:
- what changed
- why it matters
- whether the new data is safe to trust
What's New in v0.2.1
Dift v0.2.1 introduces a more polished CLI experience and broader file support.
New Improvements
- Better console formatting
- Rich terminal colors
- Cleaner summary tables
- Risk level highlighting
- Percentage row change display
- Better missing file error messages
- JSON dataset support
- JSON example datasets
- Excel example datasets
- Parquet example datasets
- Improved installation instructions
Why Dift?
Bad data breaks:
- dashboards
- reports
- ETL pipelines
- analytics workflows
- ML models
- business decisions
Dift helps teams catch risky data changes before they cause damage.
Features (v0.2.0)
Compare two datasets in seconds.
Supported Formats
- CSV
- Parquet
- Excel (
.xlsx,.xls) - JSON
Detect Changes
- Schema diff
- Row count diff
- Added rows
- Removed rows
- Changed rows (with key column)
- Column type changes
- Null spikes
- Duplicate increases
- Numeric stats diff
- Categorical value changes
- Risk scoring (
low,medium,high)
Output
- Rich CLI report
- JSON report export
Requirements
- Python 3.10+
Quick Install
pip install dift-cli
Then run:
dift --help
Cross Platform Setup
Windows (Git Bash)
python -m venv .venv
source .venv/Scripts/activate
pip install dift-cli
Windows (PowerShell)
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install dift-cli
Mac / Linux
python3 -m venv .venv
source .venv/bin/activate
pip install dift-cli
pipx (Recommended for CLI Tools)
pipx install dift-cli
If pipx is not installed:
python -m pip install pipx
python -m pipx ensurepath
Verify Install
dift --help
or
python -m dift.cli --help
Upgrade Later
pip install --upgrade dift-cli
If Command Not Found
Use:
python -m dift.cli --help
Or restart your terminal.
Quick Start
Compare CSV Files
dift examples/old.csv examples/new.csv --key customer_id
Compare Parquet Files
dift examples/old.parquet examples/new.parquet --key customer_id
Compare Excel Files
dift examples/old.xlsx examples/new.xlsx --key customer_id
Compare JSON Files
dift examples/old.json examples/new.json --key customer_id
Generate JSON Report
dift examples/old.csv examples/new.csv --key customer_id --report json --output report.json
Example Output
╭─────────────────────────╮
│ Dift Dataset Comparison │
│ Risk Level: HIGH │
╰─────────────────────────╯
Summary
Rows old: 10
Rows new: 11
Row delta: +1
Row change %: +10.00%
Warnings:
Nulls increased in revenue by 9.09%
Example Files Included
examples/
├── old.csv
├── new.csv
├── old.parquet
├── new.parquet
├── old.xlsx
├── new.xlsx
├── old.json
└── new.json
Use them to test instantly.
Example Use Cases
ETL Validation
dift before.csv after.csv
Daily Snapshot Checks
dift yesterday.parquet today.parquet
Excel File Audits
dift old.xlsx new.xlsx --key id
JSON API Export Checks
dift old.json new.json --key id
Production vs Staging
dift prod.csv staging.csv --key id
ML Dataset Drift Checks
dift train_v1.csv train_v2.csv
Project Structure
dift/
├── cli.py
├── core/
│ ├── comparator.py
│ ├── schema_diff.py
│ ├── row_diff.py
│ ├── quality_diff.py
│ ├── risk.py
│ └── stats_diff.py
├── io/
│ └── readers.py
├── reports/
│ ├── console_report.py
│ ├── json_report.py
│ └── models.py
└── utils/
tests/
examples/
Run Tests
pytest
Lint code:
ruff check .
Roadmap
v0.3.0 — Report Exports
- HTML report export
- CSV summary export
- Excel report export
- Better JSON report structure
- Report templates
--output-dir
v0.4.0
- Improve null spike detection
- Improve duplicate detection
v0.5.0
- Outlier detection
- Numeric drift thresholds
- Categorical shift warnings
- Better risk scoring
v0.6.0
- SQL database support
- Postgres connector
v0.7.0
- Snowflake connector
- BigQuery connector
v0.8.0
- CI/CD fail checks
- dbt integration
v0.9.0
- Drift alerts
- Python API
- Plugin system
v1.0.0
- Stable CLI
- Stable Python API
- Full test coverage
- Full docs site
- Benchmarks
- Security review
- Production-ready install
Contributing
Contributions are welcome.
Please read:
CONTRIBUTING.md
Ways to help:
- Fix bugs
- Improve docs
- Add tests
- Improve performance
- Add connectors
- Improve CLI UX
License
MIT License
Vision
Dift aims to become the standard open-source tool for dataset comparison and trust checks.
If Git has git diff, data teams should have dift.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dift_cli-0.2.1.tar.gz.
File metadata
- Download URL: dift_cli-0.2.1.tar.gz
- Upload date:
- Size: 15.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
927ac36c24c3381486c594fbfa43a01cd4017ed8e8a76c2bc923b17f163a6f4c
|
|
| MD5 |
fa714652f44bfa7582f7a1dd68210263
|
|
| BLAKE2b-256 |
60b17d6f6a63a6148aa8d92cf8a4f624dcb9f361a7e9c7c80160b730470214fe
|
File details
Details for the file dift_cli-0.2.1-py3-none-any.whl.
File metadata
- Download URL: dift_cli-0.2.1-py3-none-any.whl
- Upload date:
- Size: 14.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a38e453fc29b16b8d1d765713c7b565402885059a6ccbaa5f017c372e00514f0
|
|
| MD5 |
61906ae4d2fa6740a959685c12209b04
|
|
| BLAKE2b-256 |
ed354a0c93f1a731709a01bff330ddc8e377a085c5a1f0d17d201336ddd563d9
|