A git-like state management system for data pipelines
Project description
🕰️ Data Time Machine
Git for Your Data Pipelines
Never lose track of your data states again. Roll back, debug, and restore with confidence.
Features • Installation • Quick Start • Cloud & Remote • Dashboard • Integrations
🌟 Overview
Data Time Machine (DTM) is a revolutionary state management system for data pipelines, inspired by Git's version control philosophy. When complex data transformations fail in production, DTM enables you to snapshot entire data environments and roll back to known-good states instantly.
Why DTM?
- 🔍 Debug Complex Failures: Capture exact data states before and after pipeline runs
- ☁️ Cloud Native: Push snapshots to S3, GCS, or Azure Blob Storage
- � Visual Insights: Explore commit history and diffs via a built-in Web Dashboard
- ⚡ Optimized Storage: Deduplication and gzip compression for handling large datasets efficienty
- � Pipeline Ready: Native integrations for Apache Airflow and Prefect
✨ Features
Core Capabilities
- 🔐 Content-Addressable Storage: Efficient deduplication and compression
- 📊 Metadata & Diffs: View unified diffs of data changes between snapshots
- ⚡ Incremental Snapshots: Only stores changed files automatically
- 🌐 Remote Support: Push/Pull to S3, Google Cloud Storage, and Azure Blob
- 🎨 Web Dashboard: Interactive browser-based visualization of your data history
Command Set
dtm init # Initialize a new DTM repository
dtm snapshot -m "message" # Snapshot current state
dtm checkout <commit-id> # Restore to a specific snapshot
dtm diff <commit_a> <commit_b> # Compare two snapshots
dtm log # View snapshot history
dtm web # Launch Visualization Dashboard
dtm remote add origin s3://... # Add a remote storage backend
dtm push origin # Push snapshots to cloud
dtm pull origin # Pull snapshots from cloud
🚀 Installation
Prerequisites
- Python 3.10 or higher
- pip package manager
Install from PyPI
pip install data-time-machine
Install with Cloud Support
To enable S3, GCS, or Azure support, install the necessary extras (conceptually):
pip install boto3 google-cloud-storage azure-storage-blob
(Or install fastapi uvicorn for the dashboard)
🏁 Quick Start
1️⃣ Initialize
cd /path/to/data
dtm init
2️⃣ Snapshot
echo "important data" > dataset.csv
dtm snapshot -m "Initial baseline"
3️⃣ Visualize Changes
echo "bad data" >> dataset.csv
cid=$(dtm snapshot -m "Corrupted run")
dtm diff HEAD^ HEAD
4️⃣ Use the Dashboard
dtm web
# Open http://localhost:8000 to browse history visually!
☁️ Cloud & Remote
Push your data snapshots to the cloud for backup or sharing.
# S3
dtm remote add s3-backup s3://my-bucket/dtm-repo
dtm push s3-backup
# Google Cloud Storage
dtm remote add gcs-origin gs://my-data-lake/dtm
dtm pull gcs-origin
🔌 Integrations
Apache Airflow
Use DTMSnapshotOperator to automatically snapshot data in your DAGs.
from src.integrations.airflow import DTMSnapshotOperator
snapshot_task = DTMSnapshotOperator(
task_id='snapshot_data',
message='Post-transformation snapshot',
repo_path='/data/project'
)
Prefect
Use the create_dtm_snapshot task in your flows.
from src.integrations.prefect import create_dtm_snapshot
@flow
def data_pipeline():
# ... processing ...
create_dtm_snapshot(message="Pipeline Success", repo_path=".")
🏗️ Project Structure
data-time-machine/
├── src/
│ ├── cli.py # CLI Entry point
│ ├── core/
│ │ ├── backends.py # Storage Backends (Local, S3, GCS, Azure)
│ │ ├── remote.py # Remote Manager (Push/Pull)
│ │ ├── storage.py # Storage Engine & Compression
│ │ └── controller.py # Business Logic
│ ├── web/ # FastAPI Web Dashboard
│ └── integrations/ # Airflow & Prefect modules
├── scripts/ # Utility scripts
└── README.md
📋 Roadmap (Completed)
- Add diff visualization between snapshots
- Implement remote repository support
- Add compression for large file storage
- Create web-based visualization dashboard
- Support for incremental snapshots
- Integration with popular data pipeline frameworks (Airflow, Prefect)
- Cloud storage backends (S3, GCS, Azure Blob)
👤 Author
Azmat Siddique
- GitHub: @azmatsiddique
- Project Link: github.com/azmatsiddique/data-time-machine
⭐ Star this repo if you find it useful!
Made with ❤️ by Azmat Siddique
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_time_machine-0.2.3.tar.gz.
File metadata
- Download URL: data_time_machine-0.2.3.tar.gz
- Upload date:
- Size: 3.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2027a2d3c2cc803859ea58b71dd24ba15333d59b69a8d870e7f1d79546b85f9
|
|
| MD5 |
a92c50f59ee106a8d484b0e949370f86
|
|
| BLAKE2b-256 |
b4fc44f447a14966f76ebf3da545c2732306ee8ed0d424d9ae25a38e30ec342d
|
File details
Details for the file data_time_machine-0.2.3-py3-none-any.whl.
File metadata
- Download URL: data_time_machine-0.2.3-py3-none-any.whl
- Upload date:
- Size: 19.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b87b26fe158d2ff52bec3242f099520def3c93bea15479f6bc96436ad086cc1
|
|
| MD5 |
415eb346c83e34a6c37a970e023bdf66
|
|
| BLAKE2b-256 |
477fd3dd1ee9128c586fd68090b5799807d1fca2a860ebf1170ed37fdb2997f1
|