Skip to main content

Dataset versioning and migration framework for ML data

Project description

DataShift

DataShift is a dataset versioning and migration framework designed for Machine Learning workflows. Think of it as "Git for Data", allowing you to track changes, compare versions, and manage dataset lifecycles with ease.

Key Features

  • Dataset Versioning: Snapshot datasets (CSV, Parquet) and track their evolution over time.
  • Diffing: Compare two versions of a dataset to see added/removed rows and schema changes.
  • Drift Detection: Guardrails to check for data drift between versions (e.g., row count changes, null distribution).
  • Tags & Channels: specific versions with tags (e.g., #latest) or moving channels (e.g., :prod).
  • Python API & CLI: Flexible usage through a command-line interface or directly within your Python code.
  • Experiment Tracking: Link datasets to experiments for reproducibility.

Installation

From Source

pip install .

With Optional Dependencies

For PyTorch integration:

pip install .[torch]

For Parquet support:

pip install .[parquet]

For everything (including dev tools):

pip install .[all]

For Development

pip install -e .[dev]

Quick Start

CLI Usage

  1. Initialize DataShift in your project directory:

    datashift init
    
  2. Snapshot a Dataset:

    # Create a version of your customers data
    datashift snapshot ./data/customers.csv --name customers
    
  3. List Datasets:

    datashift list
    
  4. Show Dataset Details:

    datashift show customers
    
  5. Compare Versions:

    # Compare version 1 and version 2
    datashift diff customers@v1 customers@v2
    
  6. Checkout a Version:

    # Restore a specific version to a file
    datashift checkout customers@v1 ./restored_customers.csv
    
  7. Drift Check (Guardrails):

    # Check if the new version deviates too much from the baseline
    datashift check customers@v2 --baseline customers@v1 --max-row-change 0.1
    

Python API Usage

import pandas as pd
from datashift import snapshot_dataset, load, diff_datasets, format_diff_summary

# 1. Snapshot a dataset
result = snapshot_dataset(dataset_name="metrics", source_path="metrics.csv")
print(f"Created version: {result.version}")

# 2. Load a specific version into a DataFrame
df = load("metrics#latest")
print(df.head())

# 3. Diff two versions
diff = diff_datasets("metrics@v1", "metrics@v2")
print(format_diff_summary(diff))

Development

  1. Clone the repository.
  2. Install dependencies:
    pip install -e .[dev]
    
  3. Run tests:
    pytest
    

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hyper_flux_data_shift-0.1.0.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hyper_flux_data_shift-0.1.0-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file hyper_flux_data_shift-0.1.0.tar.gz.

File metadata

  • Download URL: hyper_flux_data_shift-0.1.0.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.0 CPython/3.13.3 Windows/11

File hashes

Hashes for hyper_flux_data_shift-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5be97c68f2aad7a67e569e5ac81abe8aa11dec911934de8cd00085b4f84aae2d
MD5 2dcdabf37a88342c55617b1fece79e69
BLAKE2b-256 cb19e511703c9332b4af0f4d179323934471269c53018f0cda3a1be71120e7aa

See more details on using hashes here.

File details

Details for the file hyper_flux_data_shift-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for hyper_flux_data_shift-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c8513ab8ac830e00d9eed60d765cfe6acddea9137c301a84731d4ad257cc3c0f
MD5 42db4fb828d357506a584c961dde4914
BLAKE2b-256 c0c11096aaebfdfa7b1dd3c5946efdb91135d5b1447220f570a43ffac04d3e4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page