Data Refresh and Intelligent Fast Training for Azure ML

Project description

Drift - Data Refresh and Intelligent Fast Training

Drift is an automation framework for Azure Machine Learning that streamlines the process of registering training datasets and retraining models in response to data updates. Built on top of PyDataIO, Drift provides a robust solution for managing ML pipelines in Databricks and Azure ML environments.

📋 Table of Contents

Overview
Design Architecture
Features
Installation
Getting Started
Usage
- Training Dataset Registration
- Model Retraining
Running Jobs in Databricks
Configuration
Contributing
License

🎯 Overview

Drift automates two critical ML operations workflows:

Dataset Registration: Automatically registers new versions of training datasets from Azure Data Lake Storage Gen2 to Azure ML as both MLTable and URI folder assets.
Model Retraining: Intelligently triggers retraining of ML models when new data becomes available, ensuring your models stay up-to-date with the latest data.

The framework is built on PyDataIO, leveraging its pipeline orchestration capabilities to create robust, configurable ML workflows.

🏗️ Design Architecture

Drift Architecture

The architecture consists of two main components:

Dataset Registrator: Monitors data sources and registers new dataset versions
Model Retrainer: Detects new dataset versions and triggers model retraining jobs

✨ Features

Automated Dataset Registration: Register Delta Lake tables as Azure ML data assets
Intelligent Model Retraining: Automatically retrain models based on data asset updates
Version Management: Track dataset and model versions with timestamp-based versioning
Flexible Configuration: YAML-based configuration for easy customization
Status Monitoring: Built-in job status tracking and timeout management
Group-based Training: Support for training multiple model groups with selective retraining
Azure Integration: Seamless integration with Azure ML, Databricks, and Azure Key Vault

📦 Installation

Prerequisites

Python 3.11 or higher
Access to Azure ML workspace
Databricks workspace (for job execution)
Azure Data Lake Storage Gen2

Install from source

git clone https://github.com/yourusername/Drift.git
cd Drift
pip install -e .

Install dependencies

pip install -r requirements.txt

Or if using uv:

uv sync

🚀 Getting Started

1. Set up Azure credentials

Drift uses Azure Service Principal for authentication. Store your credentials in Azure Key Vault:

ApplicationID: Your Azure AD application ID (base64 encoded)
ApplicationPassword: Your Azure AD application secret (base64 encoded)

2. Configure your job

Create a configuration file based on the examples in the docs/ directory.

3. Run a job

spark_job --config path/to/config.conf \
          --tenant <tenant-id> \
          --vault_name <key-vault-name> \
          --data_asset_version <version> (optional)

📖 Usage

Training Dataset Registration

The Dataset Registrator creates new versions of training data assets in Azure ML from Delta Lake tables stored in Azure Data Lake Storage Gen2.

Configuration Example (example-training-dataset-registrator.conf):

parameters:
    azml:
        subscriptionId: <azure-ml-subscription-id>
        resourceGroup: <azure-ml-resource-group>
        mlWorkspaceName: <azure-ml-mlWorkspace-name>
    storageAccountName: <azure-storage-account-name>
    containerName: <azure-container-name>
    containerDataPath: <path-into-container>

What it does:

Connects to Azure Data Lake Storage Gen2
Creates/updates a datastore in Azure ML
Registers the Delta Lake table as an MLTable asset
Registers the data as a URI folder asset
Generates a timestamp-based version number
Publishes the new version to Databricks job task values

Model Retraining

The Model Retrainer automatically triggers retraining jobs for models that use updated data assets.

Configuration Example (example-model-retrainer.conf):

parameters:
    azml:
        subscriptionId: <azure-ml-subscription-id>
        resourceGroup: <azure-ml-resource-group>
        mlWorkspaceName: <azure-ml-mlWorkspace-name>
    dataAssets:
      - name: mltable_seat_spinning_features
        value: <ml-table-name>
      - name: raw_data
        value: <raw-dataset-name>
    refreshTimeout: "2700"
    refreshDelay: "10"

What it does:

Retrieves all Azure ML pipeline jobs matching the naming pattern
Identifies the most recent job for each model group
Creates new retraining jobs with updated data asset versions
Monitors job completion with configurable timeout and refresh intervals
Reports success or failure status

Job Naming Convention: Jobs should follow the pattern: {model_name_prefix}_{timestamp}_{random_string}

🔧 Running Jobs in Databricks

Drift is designed to run as Databricks jobs. The recommended way to deploy is using Databricks Asset Bundles (DABs).

Using Databricks Asset Bundles (Recommended)

Create a databricks.yml file to define your workflow. Here's a complete example:

resources:
  jobs:
    retraining_models:
      name: retraining-models
      email_notifications:
        no_alert_for_skipped_runs: true
      schedule:
        quartz_cron_expression: 0 0 8 ? * MON *
        timezone_id: UTC
        pause_status: PAUSED
      tasks:
        - task_key: training-dataset-registrator
          python_wheel_task:
            package_name: drift
            entry_point: spark_job
            parameters:
              - --tenant
              - <tenant-id>
              - --vault_name
              - <vault-name>
              - --config
              - <path-to-configuration-file>
          job_cluster_key: retraining-models-cluster
          libraries:
            - whl: <path-to-drift-python-wheel>
        - task_key: model-retrainer
          depends_on:
            - task_key: training-dataset-registrator
          python_wheel_task:
            package_name: drift
            entry_point: spark_job
            parameters:
              - --tenant
              - <tenant-id>
              - --vault_name
              - <vault-name>
              - --data_asset_version
              - "{{tasks.`training-dataset-registrator`.values.data_asset_version}}"
              - --config
              - <path-to-configuration-file>
          job_cluster_key: retraining-models-cluster
          libraries:
            - whl: <path-to-drift-python-wheel>
      job_clusters:
        - job_cluster_key: retraining-models-cluster
          new_cluster:
            cluster_name: ""
            spark_version: 15.4.x-scala2.12
            node_type_id: Standard_D4s_v5
            enable_elastic_disk: true
            data_security_mode: DATA_SECURITY_MODE_DEDICATED
            kind: CLASSIC_PREVIEW
            is_single_node: true
      queue:
        enabled: false

See the complete example in docs/example-databricks-workflow.yaml.

Deployment Steps:

Build the Drift wheel package:
```
uv build
```

Upload the wheel to Databricks:

databricks fs cp dist/drift-*.whl dbfs:/path/to/drift-wheel/

Deploy the workflow:
```
databricks bundle deploy
```

Run the workflow:

databricks bundle run retraining_models

Key Configuration Details

Task 1: Dataset Registration

Registers new training dataset versions
Outputs the new version via data_asset_version task value

Task 2: Model Retraining

Depends on Task 1 completion
Retrieves the dataset version using: {{tasks.training-dataset-registrator.values.data_asset_version}}
Triggers retraining jobs with the new data version

Scheduling:

The example runs weekly on Mondays at 8 AM UTC
Modify the quartz_cron_expression to adjust the schedule
Set pause_status: UNPAUSED to activate the schedule

Prerequisites

Databricks CLI installed:
```
pip install databricks-cli
```
Databricks secrets configured:
- Set up a secret scope linked to Azure Key Vault
- Ensure it contains ApplicationID and ApplicationPassword (base64 encoded)
Configuration files uploaded:
- Upload your configuration files to DBFS
- Update the --config parameter paths in the workflow
Drift wheel package:
- Build and upload the Drift wheel to DBFS
- Update the whl path in the libraries section

Manual Setup (Alternative)

If you prefer to create jobs manually through the Databricks UI:

Create a new job with two tasks
Task 1 Configuration:
- Task type: Python wheel
- Package name: drift
- Entry point: spark_job
- Parameters: --config <config-path> --tenant <tenant-id> --vault_name <vault-name>
Task 2 Configuration:
- Depends on: Task 1
- Task type: Python wheel
- Package name: drift
- Entry point: spark_job
- Parameters: --config <config-path> --tenant <tenant-id> --vault_name <vault-name> --data_asset_version {{tasks.training-dataset-registrator.values.data_asset_version}}
Attach libraries: Add the Drift wheel to both tasks
Configure cluster: Use Spark 15.4.x with appropriate node type

⚙️ Configuration

Required Parameters

For Dataset Registration:

azml.subscriptionId: Azure subscription ID
azml.resourceGroup: Azure resource group name
azml.mlWorkspaceName: Azure ML workspace name
storageAccountName: Azure Storage account name
containerName: Storage container name
containerDataPath: Path within the container

For Model Retraining:

azml.subscriptionId: Azure subscription ID
azml.resourceGroup: Azure resource group name
azml.mlWorkspaceName: Azure ML workspace name
dataAssets: List of data assets to monitor (name and value)
refreshTimeout: Maximum time to wait for job completion (seconds)
refreshDelay: Interval between status checks (seconds)

Optional Parameters

data_asset_version: Specific version to use for retraining (if not provided, uses latest)
model_name_prefix: Comma-separated list of model prefix to retrain (if not provided, retrains all matching training jobs)

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details on:

How to report issues
Development setup and workflow
Code quality standards and testing
Pull request process

Please also read our Code of Conduct before contributing.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

Built on top of PyDataIO
Developed by Simone DE SANTIS and Guillaume LECLERC at Amadeus

📞 Support

For issues and questions:

Open an issue on GitHub
Check existing documentation in the docs/ directory

Note: This project is designed for Azure ML and Databricks environments. Ensure you have the necessary permissions and resources configured before running jobs.

Project details

Release history Release notifications | RSS feed

1.0.9

Dec 1, 2025

1.0.8

Dec 1, 2025

This version

1.0.6

Dec 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amadeus_drift-1.0.6.tar.gz (16.3 kB view details)

Uploaded Dec 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

amadeus_drift-1.0.6-py3-none-any.whl (18.4 kB view details)

Uploaded Dec 1, 2025 Python 3

File details

Details for the file amadeus_drift-1.0.6.tar.gz.

File metadata

Download URL: amadeus_drift-1.0.6.tar.gz
Upload date: Dec 1, 2025
Size: 16.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for amadeus_drift-1.0.6.tar.gz
Algorithm	Hash digest
SHA256	`6b8e195b944e33dd7de3304b27c537584d116450298c37e0325dfcce90a33a9a`
MD5	`1e551c071e5ae8e788fe8b202d5264db`
BLAKE2b-256	`34cd8e43300a9aa1a35ff32aaa831fee2f2e30246d12ead5856501e6452ad82e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for amadeus_drift-1.0.6.tar.gz:

Publisher: release-pypi.yml on AmadeusITGroup/Drift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: amadeus_drift-1.0.6.tar.gz
- Subject digest: 6b8e195b944e33dd7de3304b27c537584d116450298c37e0325dfcce90a33a9a
- Sigstore transparency entry: 732293574
- Sigstore integration time: Dec 1, 2025
Source repository:
- Permalink: AmadeusITGroup/Drift@9422bc0a65339bd578509b60ae175237f6f23a33
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/AmadeusITGroup
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-pypi.yml@9422bc0a65339bd578509b60ae175237f6f23a33
- Trigger Event: push

File details

Details for the file amadeus_drift-1.0.6-py3-none-any.whl.

File metadata

Download URL: amadeus_drift-1.0.6-py3-none-any.whl
Upload date: Dec 1, 2025
Size: 18.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for amadeus_drift-1.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2a737c17e9155367088b39890311202d30d1719918669a51937db87f0c65151b`
MD5	`1465e98040f306dfb40a032a0a035010`
BLAKE2b-256	`fa7a155d4f2fc270168e700ef567c855c92e03a3060f96b3c8ae08126940ec06`

See more details on using hashes here.

Provenance

The following attestation bundles were made for amadeus_drift-1.0.6-py3-none-any.whl:

Publisher: release-pypi.yml on AmadeusITGroup/Drift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: amadeus_drift-1.0.6-py3-none-any.whl
- Subject digest: 2a737c17e9155367088b39890311202d30d1719918669a51937db87f0c65151b
- Sigstore transparency entry: 732293586
- Sigstore integration time: Dec 1, 2025
Source repository:
- Permalink: AmadeusITGroup/Drift@9422bc0a65339bd578509b60ae175237f6f23a33
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/AmadeusITGroup
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-pypi.yml@9422bc0a65339bd578509b60ae175237f6f23a33
- Trigger Event: push

amadeus-drift 1.0.6

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Drift - Data Refresh and Intelligent Fast Training

📋 Table of Contents

🎯 Overview

🏗️ Design Architecture

✨ Features

📦 Installation

Prerequisites

Install from source

Install dependencies

🚀 Getting Started

1. Set up Azure credentials

2. Configure your job

3. Run a job

📖 Usage

Training Dataset Registration

Model Retraining

🔧 Running Jobs in Databricks

Using Databricks Asset Bundles (Recommended)

Key Configuration Details

Prerequisites

Manual Setup (Alternative)

⚙️ Configuration

Required Parameters

Optional Parameters

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance