Dataset versioning and migration framework for ML data
Project description
DataShift
DataShift is a dataset versioning and migration framework designed for Machine Learning workflows. Think of it as "Git for Data", allowing you to track changes, compare versions, and manage dataset lifecycles with ease.
Key Features
- Dataset Versioning: Snapshot datasets (CSV, Parquet) and track their evolution over time.
- Diffing: Compare two versions of a dataset to see added/removed rows and schema changes.
- Drift Detection: Guardrails to check for data drift between versions (e.g., row count changes, null distribution).
- Tags & Channels: specific versions with tags (e.g.,
#latest) or moving channels (e.g.,:prod). - Python API & CLI: Flexible usage through a command-line interface or directly within your Python code.
- Experiment Tracking: Link datasets to experiments for reproducibility.
Installation
From Source
pip install .
With Optional Dependencies
For PyTorch integration:
pip install .[torch]
For Parquet support:
pip install .[parquet]
For everything (including dev tools):
pip install .[all]
For Development
pip install -e .[dev]
Quick Start
CLI Usage
-
Initialize DataShift in your project directory:
datashift init -
Snapshot a Dataset:
# Create a version of your customers data datashift snapshot ./data/customers.csv --name customers
-
List Datasets:
datashift list -
Show Dataset Details:
datashift show customers
-
Compare Versions:
# Compare version 1 and version 2 datashift diff customers@v1 customers@v2
-
Checkout a Version:
# Restore a specific version to a file datashift checkout customers@v1 ./restored_customers.csv
-
Drift Check (Guardrails):
# Check if the new version deviates too much from the baseline datashift check customers@v2 --baseline customers@v1 --max-row-change 0.1
Python API Usage
import pandas as pd
from datashift import snapshot_dataset, load, diff_datasets, format_diff_summary
# 1. Snapshot a dataset
result = snapshot_dataset(dataset_name="metrics", source_path="metrics.csv")
print(f"Created version: {result.version}")
# 2. Load a specific version into a DataFrame
df = load("metrics#latest")
print(df.head())
# 3. Diff two versions
diff = diff_datasets("metrics@v1", "metrics@v2")
print(format_diff_summary(diff))
Development
- Clone the repository.
- Install dependencies:
pip install -e .[dev]
- Run tests:
pytest
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hyper_flux_data_shift-0.1.0.tar.gz.
File metadata
- Download URL: hyper_flux_data_shift-0.1.0.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.0 CPython/3.13.3 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5be97c68f2aad7a67e569e5ac81abe8aa11dec911934de8cd00085b4f84aae2d
|
|
| MD5 |
2dcdabf37a88342c55617b1fece79e69
|
|
| BLAKE2b-256 |
cb19e511703c9332b4af0f4d179323934471269c53018f0cda3a1be71120e7aa
|
File details
Details for the file hyper_flux_data_shift-0.1.0-py3-none-any.whl.
File metadata
- Download URL: hyper_flux_data_shift-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.0 CPython/3.13.3 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8513ab8ac830e00d9eed60d765cfe6acddea9137c301a84731d4ad257cc3c0f
|
|
| MD5 |
42db4fb828d357506a584c961dde4914
|
|
| BLAKE2b-256 |
c0c11096aaebfdfa7b1dd3c5946efdb91135d5b1447220f570a43ffac04d3e4b
|