Dataset versioning and migration framework for ML data

These details have not been verified by PyPI

Project description

DataShift

DataShift is a dataset versioning and migration framework designed for Machine Learning workflows. Think of it as "Git for Data", allowing you to track changes, compare versions, and manage dataset lifecycles with ease.

Key Features

Dataset Versioning: Snapshot datasets (CSV, Parquet) and track their evolution over time.
Diffing: Compare two versions of a dataset to see added/removed rows and schema changes.
Drift Detection: Guardrails to check for data drift between versions (e.g., row count changes, null distribution).
Tags & Channels: specific versions with tags (e.g., #latest) or moving channels (e.g., :prod).
Python API & CLI: Flexible usage through a command-line interface or directly within your Python code.
Experiment Tracking: Link datasets to experiments for reproducibility.

Installation

From Source

pip install .

With Optional Dependencies

For PyTorch integration:

pip install .[torch]

For Parquet support:

pip install .[parquet]

For everything (including dev tools):

pip install .[all]

For Development

pip install -e .[dev]

Quick Start

CLI Usage

Initialize DataShift in your project directory:
```
datashift init
```

Snapshot a Dataset:

# Create a version of your customers data
datashift snapshot ./data/customers.csv --name customers

List Datasets:
```
datashift list
```
Show Dataset Details:
```
datashift show customers
```

Compare Versions:

# Compare version 1 and version 2
datashift diff customers@v1 customers@v2

Checkout a Version:

# Restore a specific version to a file
datashift checkout customers@v1 ./restored_customers.csv

Drift Check (Guardrails):

# Check if the new version deviates too much from the baseline
datashift check customers@v2 --baseline customers@v1 --max-row-change 0.1

Python API Usage

import pandas as pd
from datashift import snapshot_dataset, load, diff_datasets, format_diff_summary

# 1. Snapshot a dataset
result = snapshot_dataset(dataset_name="metrics", source_path="metrics.csv")
print(f"Created version: {result.version}")

# 2. Load a specific version into a DataFrame
df = load("metrics#latest")
print(df.head())

# 3. Diff two versions
diff = diff_datasets("metrics@v1", "metrics@v2")
print(format_diff_summary(diff))

Development

Clone the repository.
Install dependencies:
```
pip install -e .[dev]
```
Run tests:
```
pytest
```

License

MIT License

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jan 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hyper_flux_data_shift-0.1.0.tar.gz (16.3 kB view details)

Uploaded Jan 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hyper_flux_data_shift-0.1.0-py3-none-any.whl (20.9 kB view details)

Uploaded Jan 20, 2026 Python 3

File details

Details for the file hyper_flux_data_shift-0.1.0.tar.gz.

File metadata

Download URL: hyper_flux_data_shift-0.1.0.tar.gz
Upload date: Jan 20, 2026
Size: 16.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.0 CPython/3.13.3 Windows/11

File hashes

Hashes for hyper_flux_data_shift-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5be97c68f2aad7a67e569e5ac81abe8aa11dec911934de8cd00085b4f84aae2d`
MD5	`2dcdabf37a88342c55617b1fece79e69`
BLAKE2b-256	`cb19e511703c9332b4af0f4d179323934471269c53018f0cda3a1be71120e7aa`

See more details on using hashes here.

File details

Details for the file hyper_flux_data_shift-0.1.0-py3-none-any.whl.

File metadata

Download URL: hyper_flux_data_shift-0.1.0-py3-none-any.whl
Upload date: Jan 20, 2026
Size: 20.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.0 CPython/3.13.3 Windows/11

File hashes

Hashes for hyper_flux_data_shift-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c8513ab8ac830e00d9eed60d765cfe6acddea9137c301a84731d4ad257cc3c0f`
MD5	`42db4fb828d357506a584c961dde4914`
BLAKE2b-256	`c0c11096aaebfdfa7b1dd3c5946efdb91135d5b1447220f570a43ffac04d3e4b`

See more details on using hashes here.

hyper-flux-data-shift 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

DataShift

Key Features

Installation

From Source

With Optional Dependencies

For Development

Quick Start

CLI Usage

Python API Usage

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes