Skip to main content

Data Refresh and Intelligent Fast Training for Azure ML

Project description

Drift - Data Refresh and Intelligent Fast Training

Python Version License

Drift is an automation framework for Azure Machine Learning that streamlines the process of registering training datasets and retraining models in response to data updates. Built on top of PyDataIO, Drift provides a robust solution for managing ML pipelines in Databricks and Azure ML environments.

📋 Table of Contents

🎯 Overview

Drift automates two critical ML operations workflows:

  1. Dataset Registration: Automatically registers new versions of training datasets from Azure Data Lake Storage Gen2 to Azure ML as both MLTable and URI folder assets.

  2. Model Retraining: Intelligently triggers retraining of ML models when new data becomes available, ensuring your models stay up-to-date with the latest data.

The framework is built on PyDataIO, leveraging its pipeline orchestration capabilities to create robust, configurable ML workflows.

🏗️ Design Architecture

Drift Architecture

The architecture consists of two main components:

  • Dataset Registrator: Monitors data sources and registers new dataset versions
  • Model Retrainer: Detects new dataset versions and triggers model retraining jobs

✨ Features

  • Automated Dataset Registration: Register Delta Lake tables as Azure ML data assets
  • Intelligent Model Retraining: Automatically retrain models based on data asset updates
  • Version Management: Track dataset and model versions with timestamp-based versioning
  • Flexible Configuration: YAML-based configuration for easy customization
  • Status Monitoring: Built-in job status tracking and timeout management
  • Group-based Training: Support for training multiple model groups with selective retraining
  • Azure Integration: Seamless integration with Azure ML, Databricks, and Azure Key Vault

📦 Installation

Prerequisites

  • Python 3.11 or higher
  • Access to Azure ML workspace
  • Databricks workspace (for job execution)
  • Azure Data Lake Storage Gen2

Install from source

git clone https://github.com/yourusername/Drift.git
cd Drift
pip install -e .

Install dependencies

pip install -r requirements.txt

Or if using uv:

uv sync

🚀 Getting Started

1. Set up Azure credentials

Drift uses Azure Service Principal for authentication. Store your credentials in Azure Key Vault:

  • ApplicationID: Your Azure AD application ID (base64 encoded)
  • ApplicationPassword: Your Azure AD application secret (base64 encoded)

2. Configure your job

Create a configuration file based on the examples in the docs/ directory.

3. Run a job

spark_job --config path/to/config.conf \
          --tenant <tenant-id> \
          --vault_name <key-vault-name> \
          --data_asset_version <version> (optional)

📖 Usage

Training Dataset Registration

The Dataset Registrator creates new versions of training data assets in Azure ML from Delta Lake tables stored in Azure Data Lake Storage Gen2.

Configuration Example (example-training-dataset-registrator.conf):

parameters:
    azml:
        subscriptionId: <azure-ml-subscription-id>
        resourceGroup: <azure-ml-resource-group>
        mlWorkspaceName: <azure-ml-mlWorkspace-name>
    storageAccountName: <azure-storage-account-name>
    containerName: <azure-container-name>
    containerDataPath: <path-into-container>

What it does:

  1. Connects to Azure Data Lake Storage Gen2
  2. Creates/updates a datastore in Azure ML
  3. Registers the Delta Lake table as an MLTable asset
  4. Registers the data as a URI folder asset
  5. Generates a timestamp-based version number
  6. Publishes the new version to Databricks job task values

Model Retraining

The Model Retrainer automatically triggers retraining jobs for models that use updated data assets.

Configuration Example (example-model-retrainer.conf):

parameters:
    azml:
        subscriptionId: <azure-ml-subscription-id>
        resourceGroup: <azure-ml-resource-group>
        mlWorkspaceName: <azure-ml-mlWorkspace-name>
    dataAssets:
      - name: mltable_seat_spinning_features
        value: <ml-table-name>
      - name: raw_data
        value: <raw-dataset-name>
    refreshTimeout: "2700"
    refreshDelay: "10"

What it does:

  1. Retrieves all Azure ML pipeline jobs matching the naming pattern
  2. Identifies the most recent job for each model group
  3. Creates new retraining jobs with updated data asset versions
  4. Monitors job completion with configurable timeout and refresh intervals
  5. Reports success or failure status

Job Naming Convention: Jobs should follow the pattern: {model_name_prefix}_{timestamp}_{random_string}

🔧 Running Jobs in Databricks

Drift is designed to run as Databricks jobs. The recommended way to deploy is using Databricks Asset Bundles (DABs).

Using Databricks Asset Bundles (Recommended)

Create a databricks.yml file to define your workflow. Here's a complete example:

resources:
  jobs:
    retraining_models:
      name: retraining-models
      email_notifications:
        no_alert_for_skipped_runs: true
      schedule:
        quartz_cron_expression: 0 0 8 ? * MON *
        timezone_id: UTC
        pause_status: PAUSED
      tasks:
        - task_key: training-dataset-registrator
          python_wheel_task:
            package_name: drift
            entry_point: spark_job
            parameters:
              - --tenant
              - <tenant-id>
              - --vault_name
              - <vault-name>
              - --config
              - <path-to-configuration-file>
          job_cluster_key: retraining-models-cluster
          libraries:
            - whl: <path-to-drift-python-wheel>
        - task_key: model-retrainer
          depends_on:
            - task_key: training-dataset-registrator
          python_wheel_task:
            package_name: drift
            entry_point: spark_job
            parameters:
              - --tenant
              - <tenant-id>
              - --vault_name
              - <vault-name>
              - --data_asset_version
              - "{{tasks.`training-dataset-registrator`.values.data_asset_version}}"
              - --config
              - <path-to-configuration-file>
          job_cluster_key: retraining-models-cluster
          libraries:
            - whl: <path-to-drift-python-wheel>
      job_clusters:
        - job_cluster_key: retraining-models-cluster
          new_cluster:
            cluster_name: ""
            spark_version: 15.4.x-scala2.12
            node_type_id: Standard_D4s_v5
            enable_elastic_disk: true
            data_security_mode: DATA_SECURITY_MODE_DEDICATED
            kind: CLASSIC_PREVIEW
            is_single_node: true
      queue:
        enabled: false

See the complete example in docs/example-databricks-workflow.yaml.

Deployment Steps:

  1. Build the Drift wheel package:

    uv build
    
  2. Upload the wheel to Databricks:

    databricks fs cp dist/drift-*.whl dbfs:/path/to/drift-wheel/
    
  3. Deploy the workflow:

    databricks bundle deploy
    
  4. Run the workflow:

    databricks bundle run retraining_models
    

Key Configuration Details

Task 1: Dataset Registration

  • Registers new training dataset versions
  • Outputs the new version via data_asset_version task value

Task 2: Model Retraining

  • Depends on Task 1 completion
  • Retrieves the dataset version using: {{tasks.training-dataset-registrator.values.data_asset_version}}
  • Triggers retraining jobs with the new data version

Scheduling:

  • The example runs weekly on Mondays at 8 AM UTC
  • Modify the quartz_cron_expression to adjust the schedule
  • Set pause_status: UNPAUSED to activate the schedule

Prerequisites

  1. Databricks CLI installed:

    pip install databricks-cli
    
  2. Databricks secrets configured:

    • Set up a secret scope linked to Azure Key Vault
    • Ensure it contains ApplicationID and ApplicationPassword (base64 encoded)
  3. Configuration files uploaded:

    • Upload your configuration files to DBFS
    • Update the --config parameter paths in the workflow
  4. Drift wheel package:

    • Build and upload the Drift wheel to DBFS
    • Update the whl path in the libraries section

Manual Setup (Alternative)

If you prefer to create jobs manually through the Databricks UI:

  1. Create a new job with two tasks

  2. Task 1 Configuration:

    • Task type: Python wheel
    • Package name: drift
    • Entry point: spark_job
    • Parameters: --config <config-path> --tenant <tenant-id> --vault_name <vault-name>
  3. Task 2 Configuration:

    • Depends on: Task 1
    • Task type: Python wheel
    • Package name: drift
    • Entry point: spark_job
    • Parameters: --config <config-path> --tenant <tenant-id> --vault_name <vault-name> --data_asset_version {{tasks.training-dataset-registrator.values.data_asset_version}}
  4. Attach libraries: Add the Drift wheel to both tasks

  5. Configure cluster: Use Spark 15.4.x with appropriate node type

⚙️ Configuration

Required Parameters

For Dataset Registration:

  • azml.subscriptionId: Azure subscription ID
  • azml.resourceGroup: Azure resource group name
  • azml.mlWorkspaceName: Azure ML workspace name
  • storageAccountName: Azure Storage account name
  • containerName: Storage container name
  • containerDataPath: Path within the container

For Model Retraining:

  • azml.subscriptionId: Azure subscription ID
  • azml.resourceGroup: Azure resource group name
  • azml.mlWorkspaceName: Azure ML workspace name
  • dataAssets: List of data assets to monitor (name and value)
  • refreshTimeout: Maximum time to wait for job completion (seconds)
  • refreshDelay: Interval between status checks (seconds)

Optional Parameters

  • data_asset_version: Specific version to use for retraining (if not provided, uses latest)
  • model_name_prefix: Comma-separated list of model prefix to retrain (if not provided, retrains all matching training jobs)

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details on:

  • How to report issues
  • Development setup and workflow
  • Code quality standards and testing
  • Pull request process

Please also read our Code of Conduct before contributing.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

  • Built on top of PyDataIO
  • Developed by Simone DE SANTIS and Guillaume LECLERC at Amadeus

📞 Support

For issues and questions:

  • Open an issue on GitHub
  • Check existing documentation in the docs/ directory

Note: This project is designed for Azure ML and Databricks environments. Ensure you have the necessary permissions and resources configured before running jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amadeus_drift-1.0.6.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

amadeus_drift-1.0.6-py3-none-any.whl (18.4 kB view details)

Uploaded Python 3

File details

Details for the file amadeus_drift-1.0.6.tar.gz.

File metadata

  • Download URL: amadeus_drift-1.0.6.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for amadeus_drift-1.0.6.tar.gz
Algorithm Hash digest
SHA256 6b8e195b944e33dd7de3304b27c537584d116450298c37e0325dfcce90a33a9a
MD5 1e551c071e5ae8e788fe8b202d5264db
BLAKE2b-256 34cd8e43300a9aa1a35ff32aaa831fee2f2e30246d12ead5856501e6452ad82e

See more details on using hashes here.

Provenance

The following attestation bundles were made for amadeus_drift-1.0.6.tar.gz:

Publisher: release-pypi.yml on AmadeusITGroup/Drift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file amadeus_drift-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: amadeus_drift-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 18.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for amadeus_drift-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 2a737c17e9155367088b39890311202d30d1719918669a51937db87f0c65151b
MD5 1465e98040f306dfb40a032a0a035010
BLAKE2b-256 fa7a155d4f2fc270168e700ef567c855c92e03a3060f96b3c8ae08126940ec06

See more details on using hashes here.

Provenance

The following attestation bundles were made for amadeus_drift-1.0.6-py3-none-any.whl:

Publisher: release-pypi.yml on AmadeusITGroup/Drift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page