Git diff for datasets: compare datasets and understand what changed.
Project description
Dift
Git diff for datasets.
Dift is an open-source CLI tool that helps data professionals compare two datasets and instantly understand:
- what changed
- why it matters
- whether the new data is safe to trust
Why Dift?
Bad data breaks:
- dashboards
- reports
- ETL pipelines
- analytics workflows
- ML models
- business decisions
Dift helps teams catch risky data changes before they cause damage.
Features (v0.1 MVP)
Compare two datasets in seconds
Supported Formats
- CSV
- Parquet
Detect Changes
- Schema diff
- Row count diff
- Added / removed rows
- Changed rows (with key column)
- Column type changes
- Null spikes
- Duplicate increases
- Numeric stats diff
- Categorical value changes
Output
- Rich CLI report
- JSON report export
Installation
Clone Repository
git clone https://github.com/ReginaldErzoah/Dift.git
cd Dift
Create Virtual Environment
python -m venv .venv
source .venv/Scripts/activate
Install Dependencies
pip install -r requirements.txt
Install CLI Locally
pip install -e .
Quick Start
Run a comparison:
dift examples/old.csv examples/new.csv --key customer_id
Or:
python -m dift.cli examples/old.csv examples/new.csv --key customer_id
Generate JSON report:
dift examples/old.csv examples/new.csv --key customer_id --report json --output report.json
Example Output
Dift Comparison Report
Rows old: 10
Rows new: 11
Added rows: 2
Removed rows: 1
Changed rows: 6
Schema changes: 1
Null spikes: 1
Risk Level: MEDIUM
Example Use Cases
ETL Validation
dift before.csv after.csv
Daily Snapshot Checks
dift yesterday.parquet today.parquet
Production vs Staging
dift prod.csv staging.csv --key id
ML Dataset Drift Checks
dift train_v1.csv train_v2.csv
Project Structure
dift/
├── cli.py
├── core/
│ ├── comparator.py
│ ├── schema_diff.py
│ ├── row_diff.py
│ └── stats_diff.py
├── io/
├── reports/
└── utils/
tests/
examples/
Run Tests
pytest
Roadmap
v0.2
- HTML reports
- Better console formatting
- Performance improvements
v0.5
- SQL database support
- Postgres connector
- Snowflake connector
- BigQuery connector
v1.0
- CI/CD fail checks
- dbt integration
- Drift alerts
- Python API
- Plugin system
Contributing
Contributions are welcome.
Please read:
CONTRIBUTING.md
Ways to help:
- Fix bugs
- Improve docs
- Add tests
- Improve performance
- Add connectors
- Improve CLI UX
License
MIT License
Vision
Dift aims to become the standard open-source tool for dataset comparison and trust checks.
If Git has git diff, data teams should have dift.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dift_cli-0.1.0.tar.gz.
File metadata
- Download URL: dift_cli-0.1.0.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a416add423552ad7017e53dc2940cd41c62a8a1ca452c4819c0da6a6be6c0020
|
|
| MD5 |
4f197bd10b840ce77cac50d763913894
|
|
| BLAKE2b-256 |
c8f0c43b0365ecce271f3a5638fbcb33b7894f519bec9cc3ee96962a74e8df77
|
File details
Details for the file dift_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dift_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ced958a18ea113c8fbd5326804e180786eb38d1cd84756beaf69ae6ac65eef2
|
|
| MD5 |
3e8d885729537fd6548f97a43e05d104
|
|
| BLAKE2b-256 |
de52db2c1ebaec052026205a5b2375915418affe548967ec469e49b1f77d7db4
|