Skip to main content

A git-like state management system for data pipelines

Project description

๐Ÿ•ฐ๏ธ Data Time Machine

Git for Your Data Pipelines

Python 3.10+ PyPI version License: MIT Code Style: Black

Never lose track of your data states again. Roll back, debug, and restore with confidence.

Features โ€ข Installation โ€ข Quick Start โ€ข Documentation โ€ข Contributing


๐ŸŒŸ Overview

Data Time Machine (DTM) is a revolutionary state management system for data pipelines, inspired by Git's version control philosophy. When complex data transformations fail in production, DTM enables you to snapshot entire data environments and roll back to known-good states instantly.

Why DTM?

  • ๐Ÿ” Debug Complex Failures: Capture exact data states before and after pipeline runs
  • โฎ๏ธ Instant Rollbacks: Restore entire environments to previous snapshots in seconds
  • ๐Ÿ“ธ Automatic Snapshots: Configure automatic state capture at critical pipeline stages
  • ๐ŸŽฏ Lightweight & Fast: Content-addressable storage means duplicate data is stored only once
  • ๐Ÿ”— Git-Like Workflow: Familiar commands (init, snapshot, checkout, log)

โœจ Features

Core Capabilities

  • ๐Ÿ” Content-Addressable Storage: Efficient deduplication using SHA-256 hashing
  • ๐Ÿ“Š Metadata Tracking: Complete audit trail of all data state changes
  • ๐ŸŒณ Branch Support: Manage multiple data environments simultaneously
  • โšก Fast Restoration: Quickly restore files from any snapshot
  • ๐ŸŽจ Clean CLI: Intuitive command-line interface built with Click
  • ๐Ÿงช Fully Tested: Comprehensive test suite with pytest

Command Set

dtm init                    # Initialize a new DTM repository
dtm snapshot -m "message"   # Snapshot current state
dtm checkout <commit-id>    # Restore to a specific snapshot
dtm log                     # View snapshot history

๐Ÿš€ Installation

Prerequisites

  • Python 3.10 or higher
  • pip package manager

Install from PyPI (Recommended)

The easiest way to install Data Time Machine:

pip install data-time-machine

Install from Source

For development or to get the latest changes:

# Clone the repository
git clone https://github.com/azmatsiddique/data-time-machine.git
cd data-time-machine

# Install in editable mode
pip install -e .

Verify Installation

dtm --help

๐Ÿ Quick Start

1๏ธโƒฃ Initialize Your Data Environment

cd /path/to/your/data/project
dtm init

2๏ธโƒฃ Create Your First Snapshot

# Make some changes to your data files
echo "id,value" > data.csv
echo "1,100" >> data.csv
echo "2,200" >> data.csv

# Snapshot the current state
dtm snapshot -m "Initial clean dataset"

3๏ธโƒฃ Simulate a Data Corruption

# Oops! Pipeline bug corrupts your data
echo "id,value" > data.csv
echo "1,ERROR" >> data.csv
echo "2,200" >> data.csv

4๏ธโƒฃ Roll Back to Safety

# View your snapshot history
dtm log

# Restore to the last good state
dtm checkout <commit-id>

# Your data is back! โœจ
cat data.csv

๐Ÿ“– Documentation

How It Works

DTM uses a three-tier architecture:

  1. Storage Layer: Content-addressable blob storage for deduplication
  2. Metadata Layer: Tracks commits, branches, and file relationships
  3. Controller Layer: Orchestrates snapshots, checkouts, and workspace management
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           CLI Interface (Click)         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚       Controller (DTMController)        โ”‚
โ”‚  โ€ข Snapshot creation & restoration      โ”‚
โ”‚  โ€ข High-level workflow orchestration    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ”‚                        โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ MetadataManagerโ”‚    โ”‚  StorageEngine   โ”‚
โ”‚ โ€ข Commits      โ”‚    โ”‚  โ€ข Hashing       โ”‚
โ”‚ โ€ข Branches     โ”‚    โ”‚  โ€ข Blobs         โ”‚
โ”‚ โ€ข References   โ”‚    โ”‚  โ€ข Restoration   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Running the Demo

Experience DTM in action with the included demo script:

python demo.py

This demonstrates:

  • โœ… Repository initialization
  • โœ… Data state snapshotting
  • โœ… Simulated pipeline failure
  • โœ… Successful state restoration

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=src tests/

# Run specific test file
pytest tests/test_controller.py -v

๐Ÿ—๏ธ Project Structure

data-time-machine/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ cli.py              # Command-line interface
โ”‚   โ”œโ”€โ”€ core/
โ”‚   โ”‚   โ”œโ”€โ”€ controller.py   # Main orchestration logic
โ”‚   โ”‚   โ”œโ”€โ”€ metadata.py     # Metadata management
โ”‚   โ”‚   โ””โ”€โ”€ storage.py      # Storage engine
โ”‚   โ””โ”€โ”€ models/
โ”‚       โ””โ”€โ”€ schema.py       # Pydantic data models
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ test_controller.py
โ”‚   โ”œโ”€โ”€ test_metadata.py
โ”‚   โ”œโ”€โ”€ test_storage.py
โ”‚   โ””โ”€โ”€ conftest.py
โ”œโ”€โ”€ demo.py                 # Interactive demonstration
โ”œโ”€โ”€ pyproject.toml          # Project configuration
โ””โ”€โ”€ README.md

๐Ÿ› ๏ธ Technology Stack

  • Language: Python 3.10+
  • CLI Framework: Click 8.1+
  • Data Validation: Pydantic 2.5+
  • Testing: pytest 7.4+
  • Hashing: SHA-256 (hashlib)
  • Build System: Hatchling

๐Ÿค Contributing

Contributions are welcome! Here's how you can help:

  1. ๐Ÿด Fork the repository
  2. ๐ŸŒฟ Create a feature branch (git checkout -b feature/amazing-feature)
  3. โœ… Make your changes and add tests
  4. โœ”๏ธ Ensure all tests pass (pytest)
  5. ๐Ÿ’ฌ Commit your changes (git commit -m 'Add amazing feature')
  6. ๐Ÿ“ค Push to your branch (git push origin feature/amazing-feature)
  7. ๐ŸŽ‰ Open a Pull Request

Development Setup

# Clone your fork
git clone https://github.com/azmatsiddique/data-time-machine.git
cd data-time-machine

# Install in development mode with test dependencies
pip install -e ".[dev]"

# Run tests to verify setup
pytest

๐Ÿ“‹ Roadmap

  • Add diff visualization between snapshots
  • Implement remote repository support
  • Add compression for large file storage
  • Create web-based visualization dashboard
  • Support for incremental snapshots
  • Integration with popular data pipeline frameworks (Airflow, Prefect)
  • Cloud storage backends (S3, GCS, Azure Blob)

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ‘ค Author

Azmat Siddique


๐Ÿ™ Acknowledgments

  • Inspired by Git's elegant version control design
  • Built with modern Python best practices
  • Thanks to the open-source community for amazing tools

โญ Star this repo if you find it useful!

Made with โค๏ธ by Azmat Siddique

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_time_machine-0.1.1.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_time_machine-0.1.1-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file data_time_machine-0.1.1.tar.gz.

File metadata

  • Download URL: data_time_machine-0.1.1.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for data_time_machine-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ab7107c16dc047eb3a88541c0940aeaeeff4cf7aa814ebe4d5a5ec60f68d84d6
MD5 69c8e6f36cdcc4dc073ed5043e97808f
BLAKE2b-256 364bcbadb27efd4811e6e353c28830bac5b9ad552063ba8fee072ba9a7477641

See more details on using hashes here.

File details

Details for the file data_time_machine-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for data_time_machine-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f91951d58c67387495fae18f508eb9fff489c39492ff49f4eb64d17a573734de
MD5 3fc2ac1729fb6d47354607009e6a87f5
BLAKE2b-256 4a4a8d2677137c610b9fa1c30d952a71136868ebebeaa6e0cab79bbb5f1adadb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page