Skip to main content

A git-like state management system for data pipelines

Project description

🕰️ Data Time Machine

Git for Your Data Pipelines

Python 3.10+ PyPI version License: MIT Code Style: Black

Never lose track of your data states again. Roll back, debug, and restore with confidence.

FeaturesInstallationQuick StartCloud & RemoteDashboardIntegrations


🌟 Overview

Data Time Machine (DTM) is a revolutionary state management system for data pipelines, inspired by Git's version control philosophy. When complex data transformations fail in production, DTM enables you to snapshot entire data environments and roll back to known-good states instantly.

Why DTM?

  • 🔍 Debug Complex Failures: Capture exact data states before and after pipeline runs
  • ☁️ Cloud Native: Push snapshots to S3, GCS, or Azure Blob Storage
  • Visual Insights: Explore commit history and diffs via a built-in Web Dashboard
  • Optimized Storage: Deduplication and gzip compression for handling large datasets efficienty
  • Pipeline Ready: Native integrations for Apache Airflow and Prefect

✨ Features

Core Capabilities

  • 🔐 Content-Addressable Storage: Efficient deduplication and compression
  • 📊 Metadata & Diffs: View unified diffs of data changes between snapshots
  • ⚡ Incremental Snapshots: Only stores changed files automatically
  • 🌐 Remote Support: Push/Pull to S3, Google Cloud Storage, and Azure Blob
  • 🎨 Web Dashboard: Interactive browser-based visualization of your data history

Command Set

dtm init                       # Initialize a new DTM repository
dtm snapshot -m "message"      # Snapshot current state
dtm checkout <commit-id>       # Restore to a specific snapshot
dtm diff <commit_a> <commit_b> # Compare two snapshots
dtm log                        # View snapshot history
dtm web                        # Launch Visualization Dashboard
dtm remote add origin s3://... # Add a remote storage backend
dtm push origin                # Push snapshots to cloud
dtm pull origin                # Pull snapshots from cloud

🚀 Installation

Prerequisites

  • Python 3.10 or higher
  • pip package manager

Install from PyPI

pip install data-time-machine

Install with Cloud Support

To enable S3, GCS, or Azure support, install the necessary extras (conceptually):

pip install boto3 google-cloud-storage azure-storage-blob

(Or install fastapi uvicorn for the dashboard)


🏁 Quick Start

1️⃣ Initialize

cd /path/to/data
dtm init

2️⃣ Snapshot

echo "important data" > dataset.csv
dtm snapshot -m "Initial baseline"

3️⃣ Visualize Changes

echo "bad data" >> dataset.csv
cid=$(dtm snapshot -m "Corrupted run")
dtm diff HEAD^ HEAD

4️⃣ Use the Dashboard

dtm web
# Open http://localhost:8000 to browse history visually!

☁️ Cloud & Remote

Push your data snapshots to the cloud for backup or sharing.

# S3
dtm remote add s3-backup s3://my-bucket/dtm-repo
dtm push s3-backup

# Google Cloud Storage
dtm remote add gcs-origin gs://my-data-lake/dtm
dtm pull gcs-origin

🔌 Integrations

Apache Airflow

Use DTMSnapshotOperator to automatically snapshot data in your DAGs.

from src.integrations.airflow import DTMSnapshotOperator

snapshot_task = DTMSnapshotOperator(
    task_id='snapshot_data',
    message='Post-transformation snapshot',
    repo_path='/data/project'
)

Prefect

Use the create_dtm_snapshot task in your flows.

from src.integrations.prefect import create_dtm_snapshot

@flow
def data_pipeline():
    # ... processing ...
    create_dtm_snapshot(message="Pipeline Success", repo_path=".")

🏗️ Project Structure

data-time-machine/
├── src/
│   ├── cli.py              # CLI Entry point
│   ├── core/
│   │   ├── backends.py     # Storage Backends (Local, S3, GCS, Azure)
│   │   ├── remote.py       # Remote Manager (Push/Pull)
│   │   ├── storage.py      # Storage Engine & Compression
│   │   └── controller.py   # Business Logic
│   ├── web/                # FastAPI Web Dashboard
│   └── integrations/       # Airflow & Prefect modules
├── scripts/                # Utility scripts
└── README.md

📋 Roadmap (Completed)

  • Add diff visualization between snapshots
  • Implement remote repository support
  • Add compression for large file storage
  • Create web-based visualization dashboard
  • Support for incremental snapshots
  • Integration with popular data pipeline frameworks (Airflow, Prefect)
  • Cloud storage backends (S3, GCS, Azure Blob)

👤 Author

Azmat Siddique


⭐ Star this repo if you find it useful!

Made with ❤️ by Azmat Siddique

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_time_machine-0.2.2.tar.gz (3.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_time_machine-0.2.2-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file data_time_machine-0.2.2.tar.gz.

File metadata

  • Download URL: data_time_machine-0.2.2.tar.gz
  • Upload date:
  • Size: 3.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for data_time_machine-0.2.2.tar.gz
Algorithm Hash digest
SHA256 d0bf6e8df1490f20382cba66689406818559d5b5a4ecbc3e21eb9d655fe046e5
MD5 724f72e2f2801dcfa27b88f1ce0a1518
BLAKE2b-256 a34e7c40773672947fc4442f7feeb0774d385e066cdb5c554cafe13128aa2c93

See more details on using hashes here.

File details

Details for the file data_time_machine-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for data_time_machine-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5f8eedf561eccb5d5df8a51b7a97444c5f8c74b79dd2e706cec2cb1794c2edb3
MD5 ce40ebdcc754d67c8d400d10c94b15aa
BLAKE2b-256 5f1dabeb6af4f1b23b76c7a0842be0d80bc6cfead2af649c682dfab3c6640b9a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page