A git-like state management system for data pipelines
Project description
๐ฐ๏ธ Data Time Machine
Git for Your Data Pipelines
Never lose track of your data states again. Roll back, debug, and restore with confidence.
Features โข Installation โข Quick Start โข Documentation โข Contributing
๐ Overview
Data Time Machine (DTM) is a revolutionary state management system for data pipelines, inspired by Git's version control philosophy. When complex data transformations fail in production, DTM enables you to snapshot entire data environments and roll back to known-good states instantly.
Why DTM?
- ๐ Debug Complex Failures: Capture exact data states before and after pipeline runs
- โฎ๏ธ Instant Rollbacks: Restore entire environments to previous snapshots in seconds
- ๐ธ Automatic Snapshots: Configure automatic state capture at critical pipeline stages
- ๐ฏ Lightweight & Fast: Content-addressable storage means duplicate data is stored only once
- ๐ Git-Like Workflow: Familiar commands (
init,snapshot,checkout,log)
โจ Features
Core Capabilities
- ๐ Content-Addressable Storage: Efficient deduplication using SHA-256 hashing
- ๐ Metadata Tracking: Complete audit trail of all data state changes
- ๐ณ Branch Support: Manage multiple data environments simultaneously
- โก Fast Restoration: Quickly restore files from any snapshot
- ๐จ Clean CLI: Intuitive command-line interface built with Click
- ๐งช Fully Tested: Comprehensive test suite with pytest
Command Set
dtm init # Initialize a new DTM repository
dtm snapshot -m "message" # Snapshot current state
dtm checkout <commit-id> # Restore to a specific snapshot
dtm log # View snapshot history
๐ Installation
Prerequisites
- Python 3.10 or higher
- pip package manager
Install from PyPI (Recommended)
The easiest way to install Data Time Machine:
pip install data-time-machine
Install from Source
For development or to get the latest changes:
# Clone the repository
git clone https://github.com/azmatsiddique/data-time-machine.git
cd data-time-machine
# Install in editable mode
pip install -e .
Verify Installation
dtm --help
๐ Quick Start
1๏ธโฃ Initialize Your Data Environment
cd /path/to/your/data/project
dtm init
2๏ธโฃ Create Your First Snapshot
# Make some changes to your data files
echo "id,value" > data.csv
echo "1,100" >> data.csv
echo "2,200" >> data.csv
# Snapshot the current state
dtm snapshot -m "Initial clean dataset"
3๏ธโฃ Simulate a Data Corruption
# Oops! Pipeline bug corrupts your data
echo "id,value" > data.csv
echo "1,ERROR" >> data.csv
echo "2,200" >> data.csv
4๏ธโฃ Roll Back to Safety
# View your snapshot history
dtm log
# Restore to the last good state
dtm checkout <commit-id>
# Your data is back! โจ
cat data.csv
๐ Documentation
How It Works
DTM uses a three-tier architecture:
- Storage Layer: Content-addressable blob storage for deduplication
- Metadata Layer: Tracks commits, branches, and file relationships
- Controller Layer: Orchestrates snapshots, checkouts, and workspace management
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CLI Interface (Click) โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโ
โ Controller (DTMController) โ
โ โข Snapshot creation & restoration โ
โ โข High-level workflow orchestration โ
โโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโ
โ โ
โโโโโโโผโโโโโโโโโโโ โโโโโโโโโผโโโโโโโโโโโ
โ MetadataManagerโ โ StorageEngine โ
โ โข Commits โ โ โข Hashing โ
โ โข Branches โ โ โข Blobs โ
โ โข References โ โ โข Restoration โ
โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
Running the Demo
Experience DTM in action with the included demo script:
python demo.py
This demonstrates:
- โ Repository initialization
- โ Data state snapshotting
- โ Simulated pipeline failure
- โ Successful state restoration
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=src tests/
# Run specific test file
pytest tests/test_controller.py -v
๐๏ธ Project Structure
data-time-machine/
โโโ src/
โ โโโ cli.py # Command-line interface
โ โโโ core/
โ โ โโโ controller.py # Main orchestration logic
โ โ โโโ metadata.py # Metadata management
โ โ โโโ storage.py # Storage engine
โ โโโ models/
โ โโโ schema.py # Pydantic data models
โโโ tests/
โ โโโ test_controller.py
โ โโโ test_metadata.py
โ โโโ test_storage.py
โ โโโ conftest.py
โโโ demo.py # Interactive demonstration
โโโ pyproject.toml # Project configuration
โโโ README.md
๐ ๏ธ Technology Stack
- Language: Python 3.10+
- CLI Framework: Click 8.1+
- Data Validation: Pydantic 2.5+
- Testing: pytest 7.4+
- Hashing: SHA-256 (hashlib)
- Build System: Hatchling
๐ค Contributing
Contributions are welcome! Here's how you can help:
- ๐ด Fork the repository
- ๐ฟ Create a feature branch (
git checkout -b feature/amazing-feature) - โ Make your changes and add tests
- โ๏ธ Ensure all tests pass (
pytest) - ๐ฌ Commit your changes (
git commit -m 'Add amazing feature') - ๐ค Push to your branch (
git push origin feature/amazing-feature) - ๐ Open a Pull Request
Development Setup
# Clone your fork
git clone https://github.com/azmatsiddique/data-time-machine.git
cd data-time-machine
# Install in development mode with test dependencies
pip install -e ".[dev]"
# Run tests to verify setup
pytest
๐ Roadmap
- Add diff visualization between snapshots
- Implement remote repository support
- Add compression for large file storage
- Create web-based visualization dashboard
- Support for incremental snapshots
- Integration with popular data pipeline frameworks (Airflow, Prefect)
- Cloud storage backends (S3, GCS, Azure Blob)
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ค Author
Azmat Siddique
- GitHub: @azmatsiddique
- Project Link: github.com/azmatsiddique/data-time-machine
๐ Acknowledgments
- Inspired by Git's elegant version control design
- Built with modern Python best practices
- Thanks to the open-source community for amazing tools
โญ Star this repo if you find it useful!
Made with โค๏ธ by Azmat Siddique
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_time_machine-0.1.1.tar.gz.
File metadata
- Download URL: data_time_machine-0.1.1.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab7107c16dc047eb3a88541c0940aeaeeff4cf7aa814ebe4d5a5ec60f68d84d6
|
|
| MD5 |
69c8e6f36cdcc4dc073ed5043e97808f
|
|
| BLAKE2b-256 |
364bcbadb27efd4811e6e353c28830bac5b9ad552063ba8fee072ba9a7477641
|
File details
Details for the file data_time_machine-0.1.1-py3-none-any.whl.
File metadata
- Download URL: data_time_machine-0.1.1-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f91951d58c67387495fae18f508eb9fff489c39492ff49f4eb64d17a573734de
|
|
| MD5 |
3fc2ac1729fb6d47354607009e6a87f5
|
|
| BLAKE2b-256 |
4a4a8d2677137c610b9fa1c30d952a71136868ebebeaa6e0cab79bbb5f1adadb
|