Data migration orchestrator for Snowflake

These details have not been verified by PyPI

Project links

Project description

Snowflake Data Migration Orchestrator

The Cloud Data Migration feature of SnowConvert provides a fault-tolerant, scalable solution for moving data from external sources into Snowflake. This tool is specifically designed for cases where a user is moving data from a system they plan to decommission. For replication purposes, other solutions are available that might better fit your use case.

Architecture

1 Orchestrator is connected to the Snowflake Account.
- It requires privileges to create/operate the SNOWCONVERT_AI database, in which metadata is stored.
1 or more Workers connect to the Source System and to the Snowflake Account.
- Workers read data from the Source System and upload it to a Snowflake Stage.
- Workers pick up tasks created by the Orchestrator and process them in parallel.
Files uploaded to the Snowflake Stage are copied into the Target Tables using a COPY INTO statement.
- The COPY INTO statement is submitted and monitored by the Orchestrator.

Where to Deploy Orchestrator and Worker(s)?

The Orchestrator and Worker(s) can be deployed in multiple ways:

Both on Snowpark Container Services (in the Snowflake Account).
Both on the Customer's Environment (custom hardware, virtual machines, containers, etc.).
Orchestrator on Snowpark Container Services and Worker(s) on the Customer's Environment (or the other way around).

Requirements for the environment:

The Orchestrator and Worker(s) are Python packages, so Python must be installed.
The Worker(s) will typically require an ODBC driver to connect to the Source System.
The Orchestrator needs to be able to connect to the Snowflake Account. The connection used must have privileges to create the SNOWCONVERT_AI database and create schemas/objects on that database.

Setup

Additional Configuration on Snowflake Account

When starting the Orchestrator, it will automatically try to set up resources in your Snowflake Account in the SNOWCONVERT_AI database (if it does not exist yet, it will be created). This is a one-time step and transparent to the user. Some considerations:

The Orchestrator should connect with a role that has privileges to create the SNOWCONVERT_AI database and its objects.
Whenever the Orchestrator starts, it should use a role that allows it to interact with the SNOWCONVERT_AI database and its resources. The easiest way to guarantee this is to always run it with the same role that was used for creating SNOWCONVERT_AI in the first place.

Usage

In general, for migrating data using this solution, you will need to:

Start the Orchestrator.
Start the Worker(s).
Create a Data Migration Workflow.
Monitor the Data Migration Workflow (asynchronously) until completion.

A Data Migration Workflow is essentially an action/goal for the system to complete, such as migrating a specific set of tables with a given configuration. You can submit multiple workflows simultaneously and monitor them. The Orchestrator breaks Data Migration Workflows into smaller tasks. Normally, this also involves splitting a table into partitions before extracting its data and loading it to Snowflake.

Starting the Orchestrator

After installation, start the Orchestrator by running:

python -m data_migration_orchestrator

Before running, make sure that the SNOWFLAKE_CONNECTION_NAME environment variable is set to a value that matches one of the connection names in your Snowflake config.toml or connections.toml. That is the name of the connection used to connect to the Target Snowflake Account.

Connection session defaults use SNOWFLAKE_DATABASE and SNOWFLAKE_SCHEMA where your environment provides them (for example in SPCS). Those are separate from metadata object locations: by default, workflow and task-queue objects live in SNOWCONVERT_AI.DATA_MIGRATION and data-validation objects in SNOWCONVERT_AI.DATA_VALIDATION. To deploy metadata under different database or schema names, set:

CUSTOM_SNOWFLAKE_DATABASE_FOR_METADATA (default SNOWCONVERT_AI)
CUSTOM_SNOWFLAKE_SCHEMA_FOR_DATA_MIGRATION_METADATA (default DATA_MIGRATION)
CUSTOM_SNOWFLAKE_SCHEMA_FOR_DATA_VALIDATION_METADATA (default DATA_VALIDATION)

Workers that call task-queue stored procedures must use the same CUSTOM_* values as the orchestrator.

The Orchestrator will run until you stop it. Data Migration Workflows need an active Orchestrator to be completed. However, the Orchestrator can be safely stopped at any point and resumed later (ongoing Data Migration Workflows will be resumed at that point).

Starting the Worker(s)

After installation (pip install snowflake-data-exchange-agent), start a Worker by running:

python -m data_exchange_agent -c <configuration-file-path>

The path to the configuration file can be omitted. In that case, the worker will look for a file called configuration.toml in your current directory. See the Worker Configuration section below for the full specification.

Workers will run until you stop them. Data Migration Workflows and Cloud Data Validation Workflows need at least one active Worker to be completed. However, the Workers can be safely stopped at any point and resumed later (ongoing workflows will be resumed at that point).

Creating a Data Migration Workflow

After installation, create workflows by running:

python -m data_migration_orchestrator create-data-migration-workflow <workflow-config-file-path> --name <workflow-name> --connection-name <connection-name> --source-platform <source-platform>

The Workflow Configuration specification can be found in the Workflow Configuration Reference section.
The workflow name must be composed of alphanumerical characters and cannot start with a digit.
You must specify the name of the Snowflake connection you want to use, as it appears in your config.toml or connections.toml file.
--source-platform is required for this subcommand. Supported values are sqlserver, redshift, and postgresql.

Monitoring a Data Migration Workflow

Each Workflow will go through different stages through its lifecycle:

Pending: No tasks have been created for this workflow yet.
Executing: Tasks have been created for this workflow and there are still tasks that haven't reached a terminal state (COMPLETED or FAILED).
Completed: All tasks have reached a terminal state (COMPLETED or FAILED).

In the data migration metadata schema (by default SNOWCONVERT_AI.DATA_MIGRATION) there are tables/views that can be queried to understand the status of one or more Workflows:

View/Table	Description
WORKFLOW	One row per workflow. Includes start/end time, status, and configuration.
TABLE_PROGRESS_WITH_EXAMPLE_ERROR	One row per table being migrated. Shows how many partitions are in each stage (extraction, loading, completed, or failed), along with related errors. Filterable by `WORKFLOW_ID`.
DATA_MIGRATION_ERROR	For each failed partition, contains the first known error. Filterable by `WORKFLOW_ID`.

In the same schema, there is a Streamlit dashboard called DATA_MIGRATION_DASHBOARD that can be used to monitor the workflows. This dashboard presents data from those tables/views.

Cloud Data Validation Workflows

The orchestrator can run Cloud Data Validation workflows in addition to data migration. Validation work is queued as data_validation tasks; the same Worker package (snowflake-data-exchange-agent) executes them when the optional snowflake-data-validation dependency is available in the worker environment. Create a validation workflow with:

python -m data_migration_orchestrator create-data-validation-workflow <validation-config-file-path> --source-platform <source-platform> [--name <workflow-name>] [--connection-name <connection-name>]

--source-platform is required for this subcommand. Its value must match the source_platform field in the JSON configuration file.
New workflow rows are inserted into the WORKFLOW table in the data migration metadata schema (default SNOWCONVERT_AI.DATA_MIGRATION) with WORKFLOW_TYPE set to data-validation. Validation results and related objects are stored under the data validation metadata schema (default SNOWCONVERT_AI.DATA_VALIDATION, configurable with CUSTOM_SNOWFLAKE_SCHEMA_FOR_DATA_VALIDATION_METADATA).

Data Validation Workflow Configuration (Top Level)

Property	Type	Required	Description
`source_platform`	String	Yes	Source dialect identifier (for example `SqlServer`, `Redshift`). Must match the `--source-platform` argument when creating the workflow from the CLI.
`target_platform`	String	No	Defaults to `Snowflake`.
`target_database`	String	No	Default target database name for tables when not specified per table.
`validation_configuration`	Object	No	Global validation levels and options (see below).
`comparison_configuration`	Object	No	Numeric tolerance and optional type mapping file.
`database_mappings`	Object	No	Map of source database names to Snowflake database names.
`schema_mappings`	Object	No	Map of source schema names to Snowflake schema names.
`tables`	Array	Yes	At least one table (or view) to validate.
`views`	Array	No	Additional view entries using the same shape as `tables`.
`affinity`	String	No	Same affinity semantics as data migration workflows.
`use_snowflake_compute`	Boolean	No	When `true`, enables Snowflake-side computation paths where supported. Default `false`.
`max_rows_per_partition`	Integer	No	Global default for the target number of rows per partition when a table specifies `partition_column`. Must be greater than 0. Overridable per table. When omitted, the orchestrator uses an internal default of 500 000.

validation_configuration (global defaults)

When validation_configuration is omitted, the orchestrator applies these defaults: schema and metrics validation are enabled; row validation is disabled; row_validation_mode defaults to row; continue_on_failure defaults to false; max_failed_rows_number defaults to 100; exclude_metrics defaults to false; apply_metric_column_modifier defaults to true. Any field set here can be overridden per table via a nested validation_configuration on that table entry.

Property	Type	Description
`schema_validation`	Boolean	Level 1: schema / column consistency checks.
`metrics_validation`	Boolean	Level 2: statistical metrics comparison.
`row_validation`	Boolean	Level 3: row-level or cell-level data comparison.
`row_validation_mode`	String	For row validation: typically `row` or `cell`.
`continue_on_failure`	Boolean	Whether to continue to the next validation level after a failure.
`max_failed_rows_number`	Integer	Cap on failed rows reported for L3 validation (must be greater than 0 when set).
`exclude_metrics`	Boolean	Whether to exclude unsupported metric columns.
`apply_metric_column_modifier`	Boolean	Whether to apply metric column modifiers.

comparison_configuration

Property	Type	Description
`tolerance`	Number	Numeric comparison tolerance for metrics (must be greater than 0 when set). Default applied by the orchestrator when omitted is 0.001.
`type_mapping_file_path`	String	Optional path to a custom type mapping file for comparisons.

Per-table / per-view entry (`tables` and `views`)

Property	Type	Required	Description
`fully_qualified_name`	String	Yes	Source object name (format depends on source platform).
`use_column_selection_as_exclude_list`	Boolean	No	Default `false`.
`column_selection_list`	String[]	No	Columns to include or exclude per `use_column_selection_as_exclude_list`.
`target_name`	String	No	Target object name override.
`target_database`	String	No	Per-table target database override.
`target_schema`	String	No	Per-table target schema override.
`where_clause`	String	No	Filter on the source side.
`target_where_clause`	String	No	Filter on the target side.
`index_column_list`	String[]	No	Columns used to align rows on the source.
`target_index_column_list`	String[]	No	Columns used to align rows on the target.
`column_mappings`	Object	No	Map of source column name to target column name.
`is_case_sensitive`	Boolean	No	Case sensitivity for identifiers.
`chunk_number`	Integer	No	Chunking hint for validation (greater than 0 when set).
`max_failed_rows_number`	Integer	No	Overrides global cap for this object.
`exclude_metrics`	Boolean	No	Per-object metrics exclusion override.
`apply_metric_column_modifier`	Boolean	No	Per-object modifier override.
`object_type`	String	No	Typically `TABLE` or `VIEW`.
`partition_column`	String	No	Column used for range-based (NTILE) partitioning during validation. Without this, the table is processed as a single partition.
`max_rows_per_partition`	Integer	No	Per-table override for the target number of rows per partition. Must be greater than 0. Overrides the global `max_rows_per_partition` when set.
`validation_configuration`	Object	No	Nested object with the same fields as global `validation_configuration` to override defaults for this object only.

Advanced Features

Redshift UNLOAD

For Redshift, it is recommended to use the UNLOAD extraction strategy. The main idea behind this is:

Large query results are written directly to an S3 Bucket instead of being downloaded to the machine in which the Worker is running.
On Snowflake side, an External Stage is set up to reference the corresponding S3 Bucket, so that COPY INTO statements can be done directly from that stage.

See the Extraction Strategy section for configuration details.

Incremental Synchronization

It is possible to migrate some tables and then re-migrate them in the future, moving only the data that has changed. See the Synchronization Strategies section for the available strategies and their configuration.

Considerations and Recommendations

Connecting to Snowflake with a PAT

It is recommended to use Programmatic Access Tokens for connections used by the Orchestrator and Workers. This ensures there won't be a need to constantly authenticate through the browser or with an Authenticator app. You will need to establish a Network Policy or temporarily bypass the requirement for a Network Policy (this can be done from Snowsight).

Running Orchestrator and/or Workers on SPCS

If you want to leverage Snowflake compute for these tasks, you can:

Prepare Docker images that use the Python modules and have the appropriate configuration.
Push those Docker images to an Image Repository in Snowflake.
Execute the Orchestrator and/or Worker(s) images using Snowpark Container Services.

Some considerations:

It is recommended to execute them as Services, not Jobs.
It is possible to run only one component (Orchestrator or Workers) in SPCS and the other on another platform.
It is a good practice to monitor the SPCS service and suspend it when it is not being used.
Depending on the network configuration of the Source System, you might need to configure an External Access Integration so that these services can connect to your Source System.

Initial Testing

It is recommended to deploy the DDL for the tables you want to migrate before starting data migration. This ensures the target types match the behavior you want in those tables and their related views/procedures. Converting the DDL from your source dialect into Snowflake SQL can be done through the Code Conversion capabilities of SnowConvert AI and/or Cortex Code. If you don't deploy the DDL before starting data migration, the types will be inferred and might not be as accurate as desired.

Additionally, it is a good practice to move a few rows from each table as a test before starting the full migration. This helps detect configuration or connectivity issues early.

Managing Workers

The time it takes to complete a workflow depends heavily on many variables. One of the variables that affects the most is the number of workers (and the number of threads per worker), since that determines how many extraction tasks can be executed in parallel. Consider:

It is not necessary to run two workers on the same machine. If you want more parallelism on one machine, increase the thread count instead.
Network bandwidth greatly affects the speed of workers and is effectively shared between threads of a worker.
Even with many workers/threads processing tasks in parallel, your source system might not have enough resources to handle the load.
You might want to keep a low worker count to avoid overloading your source system.
You might want to stop some (or all) of your workers at times when your source system is already overloaded by unrelated operations, to avoid disrupting those operations.

Workflow Configuration Reference

The workflow configuration file is a JSON object. Its structure is described below using named models -- each model's properties reference other models by name.

WorkflowConfiguration (Top Level)

Property	Type	Required	Description
`schemaVersion`	String	No	Version of the configuration schema (e.g. "1.0.0"). Accepts formats "major", "major.minor", or "major.minor.patch". Defaults to "1.0.0" if omitted.
`tables`	TableConfiguration[]	Yes	An array of table-specific configurations defining which tables to migrate and how.
`defaultTableConfiguration`	TableConfiguration	No	Shared settings inherited by all tables. Table-specific values override these defaults (see merging rules below).
`affinity`	String	No	Affinity group string. Ensures that only orchestrator and worker instances with a matching affinity process this workflow.

TableConfiguration

Defines the settings for migrating a single table.

Property	Type	Required	Description
`source`	SourceTargetIdentifier	Yes	Identifies the source table.
`target`	SourceTargetIdentifier	Yes	Identifies the target table in Snowflake.
`columnNamesToPartitionBy`	String[]	Yes	Columns used to partition data during extraction.
`extraction`	ExtractionStrategy	No	Configures how data is extracted from the source database.
`synchronization`	SynchronizationStrategy	No	Configures incremental synchronization behavior.
`columnTypeMappings`	ColumnTypeMapping[]	No	Type conversions applied during migration.
`columnNameMappings`	ColumnNameMapping[]	No	Column renaming mappings.
`primaryKeyColumns`	String[]	No	Primary key columns for the source table. Required when using `trackModifications` in the watermark synchronization strategy.
`partitionSize`	PartitionSize	No	Configures the target size of each partition. Defaults to `"auto"`.
`whereClauseCriteria`	String	No	SQL-like filter to select a subset of rows (e.g., `"is_deleted = 0"`).

Default Table Configuration Merging Rules

When defaultTableConfiguration is provided, its values are merged into each table entry using these rules:

Nested objects (source, target, synchronization, extraction): Deep merge -- fields within are merged individually.
Collections (columnTypeMappings, columnNameMappings, etc.): Table value replaces default entirely.
Scalars (whereClauseCriteria): Table value overrides default.

SourceTargetIdentifier

Used by source and target in TableConfiguration to identify a database object.

Property	Type	Required	Description
`databaseName`	String	Yes	Name of the database.
`schemaName`	String	Yes	Name of the schema.
`tableName`	String	Yes	Name of the table.

The target object accepts two additional optional fields: tableType ("native" or "iceberg") and icebergConfig (required when using Iceberg). If tableType is omitted or null, it defaults to "native" (standard Snowflake table). See IcebergConfig and the Redshift UNLOAD with Iceberg Tables example.

IcebergConfig (target.icebergConfig)

Used when target.tableType is "iceberg". Fields are merged with defaultTableConfiguration.target.icebergConfig; table-level keys override defaults.

Property	Type	Required	Description
`catalog`	String	No	Default `SNOWFLAKE` for Snowflake-managed Iceberg. Use a catalog integration name for externally cataloged tables (for example AWS Glue).
`externalVolume`	String	For `catalog` `SNOWFLAKE`	Snowflake external volume for Iceberg data and metadata.
`baseLocationPrefix`	String	No	Optional path prefix for `BASE_LOCATION` when using Snowflake-managed Iceberg (`catalog` `SNOWFLAKE`).
`catalogTableName`	String	For external `catalog`	Fully qualified name of the table in the external catalog (for example `glue_db.my_table`).
`catalogSync`	String	No	Optional catalog integration used to sync Snowflake-managed metadata back to an external catalog.
`sourceDataStage`	String	No	Stage path starting with `@` pointing at existing Parquet files; used for `copy_files`-style loads with Snowflake-managed Iceberg.
`migrationStrategy`	String	No	One of `catalog_link`, `convert_to_managed`, `copy_files`. When omitted, the orchestrator infers a strategy from `catalog` and `sourceDataStage`.

Snowflake account setup for Iceberg (external volumes, catalog integrations, stages, and privileges) follows Snowflake’s Iceberg documentation; use the examples above as a template for JSON fields.

ColumnTypeMapping

Property	Type	Required	Description
`sourceType`	String	Yes	Type name in the source system.
`targetType`	String	Yes	Target type in Snowflake.

ColumnNameMapping

Property	Type	Required	Description
`sourceName`	String	Yes	Column name in the source system.
`targetName`	String	Yes	Target column name in Snowflake.

ExtractionStrategy

Field	Type	Required	Description
`strategy`	`"regular"` or `"unload"`	Yes	Extraction method. `"regular"` is the default; `"unload"` is available for Redshift sources.
`externalStage`	String	UNLOAD only	Fully-qualified Snowflake external stage name (e.g., `"MY_DB.MY_SCHEMA.S3_STAGE"`).

regular (default) -- Data is queried and downloaded through the Worker:

"extraction": { "strategy": "regular" }

unload (Redshift only) -- Data is written to S3 via Redshift UNLOAD and loaded from an external stage:

"extraction": { "strategy": "unload", "externalStage": "MY_DB.MY_SCHEMA.S3_EXTERNAL_STAGE" }

PartitionSize

Controls how large each partition should be during extraction. Accepts a string or an object.

Form	Description
`"auto"` (default)	The system picks optimal partition sizes based on the source platform, extraction strategy, and table size.
`{ "targetSizeMb": N }`	Each partition targets approximately `N` megabytes of data.
`{ "maxRowsPerPartition": N }`	Each partition contains at most `N` rows, regardless of data size.

Only one of targetSizeMb or maxRowsPerPartition may be specified when using the object form.

auto (default) -- The system selects partition sizes tuned for the platform and extraction strategy:

"partitionSize": "auto"

Auto mode uses larger partitions for Redshift UNLOAD (where S3 handles large files well) and smaller partitions for ODBC-based extraction (SQL Server, PostgreSQL, Redshift REGULAR) where data flows through the Worker's memory. For very large tables (100+ GB), the maximum number of partitions is increased to allow more parallelism.

Fixed size in MB -- Specify a target size per partition:

"partitionSize": { "targetSizeMb": 2048 }

Fixed row count -- Specify a maximum number of rows per partition:

"partitionSize": { "maxRowsPerPartition": 500000 }

SynchronizationStrategy

Controls whether subsequent workflow runs perform a full re-extraction or only sync changed data.

Field	Type	Required	Description
`strategy`	`"none"`, `"checksum"`, or `"watermark"`	Yes	The synchronization method.
`watermarkColumn`	String	Watermark only	Column name to track (must be monotonically increasing).
`trackModifications`	Boolean	No	If `true`, uses the primary key to deduplicate modified rows. Requires `primaryKeyColumns` in TableConfiguration.

none (default) -- Full extraction on every run. No synchronization metadata is stored.

"synchronization": { "strategy": "none" }

Use when: Data is small, changes are unpredictable, or guaranteed consistency is needed.

checksum -- Computes a hash of all column values per partition. Only changed partitions are cleared and re-extracted.

"synchronization": { "strategy": "checksum" }

Use when: You need to detect any change but lack a reliable monotonic column (e.g., dimension tables).
Trade-offs: Requires a checksum computation on the source for every partition on every run.

watermark -- Tracks a monotonic column (timestamp, ID, version) to sync only rows newer than the last observed maximum.

"synchronization": { "strategy": "watermark", "watermarkColumn": "UPDATED_AT" }

Use when: Your table has a reliable monotonic column that increases on insert/update (e.g., fact tables, event logs).
Limitation: Watermark alone cannot currently track deletions. Support for this will be added in the future.

Quoting Identifiers

Names that need quoting (or brackets) must be manually quoted as they would normally be in JSON. For example: "tableName": "\"MyCaseSensitiveTable\"".

Workflow Configuration Examples

Basic Migration (SQL Server)

Migrates two tables with shared source/target schemas, type mappings, column renaming, watermark sync, and row filtering:

{
  "defaultTableConfiguration": {
    "source": {
      "schemaName": "data_migration_cloud_test",
      "databaseName": "SampleStoreDB"
    },
    "target": {
      "schemaName": "data_migration_cloud_test",
      "databaseName": "samplestoredb"
    }
  },
  "tables": [
    {
      "source": { "tableName": "store_employee" },
      "target": { "tableName": "target_employee" },
      "columnNamesToPartitionBy": ["ID"]
    },
    {
      "source": { "tableName": "Sales_Simple" },
      "target": { "tableName": "Sales_Simple" },
      "columnNamesToPartitionBy": ["ID"],
      "columnTypeMappings": [
        { "sourceType": "MONEY", "targetType": "DECIMAL(19,4)" }
      ],
      "columnNameMappings": [
        { "sourceName": "id", "targetName": "old_id" },
        { "sourceName": "name", "targetName": "full_name" }
      ],
      "synchronization": {
        "strategy": "watermark",
        "watermarkColumn": "UPDATED_AT"
      },
      "partitionSize": { "targetSizeMb": 2048 },
      "whereClauseCriteria": "is_deleted = 0"
    }
  ]
}

Redshift UNLOAD

Uses the UNLOAD extraction strategy with an external stage for S3-based data transfer:

{
  "defaultTableConfiguration": {
    "source": {
      "schemaName": "ecommerce_raw",
      "databaseName": "snowconvert_demo"
    },
    "target": {
      "schemaName": "ecommerce_raw",
      "databaseName": "TARGET_DB"
    },
    "extraction": {
      "strategy": "unload",
      "externalStage": "MY_DB.MY_SCHEMA.S3_EXTERNAL_STAGE"
    },
    "partitionSize": "auto"
  },
  "tables": [
    {
      "source": { "tableName": "customers" },
      "target": { "tableName": "customers" },
      "columnNamesToPartitionBy": ["customer_id"]
    },
    {
      "source": { "tableName": "orders" },
      "target": { "tableName": "orders" },
      "columnNamesToPartitionBy": ["order_id"],
      "columnTypeMappings": [
        { "sourceType": "NUMERIC(10,2)", "targetType": "DECIMAL(10,2)" }
      ]
    }
  ]
}

Redshift UNLOAD with Iceberg Tables

Combines Redshift UNLOAD with Iceberg table targets, including Snowflake-managed and Glue catalog configurations:

{
  "defaultTableConfiguration": {
    "source": {
      "schemaName": "public",
      "databaseName": "analytics_db"
    },
    "target": {
      "schemaName": "public",
      "databaseName": "TARGET_DB",
      "tableType": "iceberg",
      "icebergConfig": {
        "catalog": "SNOWFLAKE",
        "externalVolume": "my_iceberg_ext_vol",
        "baseLocationPrefix": "migrations/redshift",
        "sourceDataStage": "@TARGET_DB.PUBLIC.ICEBERG_SOURCE_STAGE"
      }
    },
    "extraction": {
      "strategy": "unload",
      "externalStage": "TARGET_DB.PUBLIC.S3_EXTERNAL_STAGE"
    },
    "partitionSize": "auto"
  },
  "tables": [
    {
      "source": { "tableName": "customers" },
      "target": { "tableName": "customers" },
      "columnNamesToPartitionBy": ["customer_id"]
    },
    {
      "source": { "tableName": "events" },
      "target": {
        "tableName": "events",
        "tableType": "iceberg",
        "icebergConfig": {
          "catalog": "my_glue_catalog_integration",
          "externalVolume": "my_iceberg_ext_vol",
          "catalogTableName": "glue_db.events"
        }
      },
      "columnNamesToPartitionBy": ["event_id"]
    },
    {
      "source": { "tableName": "orders" },
      "target": {
        "tableName": "orders",
        "tableType": "iceberg",
        "icebergConfig": {
          "catalog": "my_glue_catalog_integration",
          "externalVolume": "my_iceberg_ext_vol",
          "catalogTableName": "glue_db.orders",
          "migrationStrategy": "convert_to_managed"
        }
      },
      "columnNamesToPartitionBy": ["order_id"]
    }
  ]
}

Incremental Sync with Watermark

Uses watermark-based synchronization with modification tracking for incremental data migration:

{
  "defaultTableConfiguration": {
    "source": { "databaseName": "SRC", "schemaName": "dbo" },
    "target": { "databaseName": "TGT", "schemaName": "public" },
    "synchronization": {
      "strategy": "watermark",
      "watermarkColumn": "updated_at"
    }
  },
  "tables": [
    {
      "source": { "tableName": "orders" },
      "target": { "tableName": "orders" },
      "columnNamesToPartitionBy": ["order_id"],
      "primaryKeyColumns": ["order_id"],
      "synchronization": {
        "trackModifications": true
      }
    }
  ]
}

Worker Configuration

This section documents the configuration for the Worker (snowflake-data-exchange-agent package). The Worker configuration file uses TOML format.

Section	Property	Type	Description
Top Level	`selected_task_source`	String	Currently should always be set to `"snowflake_stored_procedure"`.
`[application]`	`max_parallel_tasks`	Integer	Maximum number of tasks the worker will process in parallel (using threads).
`[application]`	`task_fetch_interval`	Integer	Interval (in seconds) between attempts to fetch new tasks from the Orchestrator.
`[application]`	`snowflake_database_for_metadata`	String	Optional. Database where the orchestrator deployed the task queue (default `SNOWCONVERT_AI`). Must match the orchestrator's `CUSTOM_SNOWFLAKE_DATABASE_FOR_METADATA` if you override it there.
`[application]`	`snowflake_schema_for_data_migration_metadata`	String	Optional. Schema for `PULL_TASKS` / `COMPLETE_TASK` / `FAIL_TASK` (default `DATA_MIGRATION`). Must match the orchestrator's `CUSTOM_SNOWFLAKE_SCHEMA_FOR_DATA_MIGRATION_METADATA` if overridden.
`[connections.source.*]`		Object	Configuration for source system connections. The Worker typically requires an ODBC driver. See examples below.
`[connections.target.snowflake_connection_name]`	`connection_name`	String	The name of the connection entry in the `~/.snowflake/config.toml` file to use.

When selected_task_source is snowflake_stored_procedure, the worker calls the task-queue procedures using application.snowflake_database_for_metadata and application.snowflake_schema_for_data_migration_metadata. These values are independent of Snowflake session defaults (SNOWFLAKE_DATABASE, SNOWFLAKE_SCHEMA) in the connection profile.

Example: SQL Server (Standard Authentication)

[connections.source.sqlserver]
username = "username"
password = "password"
database = "database_name"
host = "127.0.0.1"
port = 1433

Example: Amazon Redshift (IAM Authentication)

[connections.source.redshift]
username = "demo-user"
database = "demo_db"
auth_method = "iam-provisioned-cluster"
cluster_id = "my-aws-cluster"
region = "us-west-2"
access_key_id = "your-access-key-id"
secret_access_key = "your-secret-access-key"

Example: Amazon Redshift (Standard Authentication)

[connections.source.redshift]
username = "myuser"
password = "mypassword"
database = "mydatabase"
host = "my-cluster.abcdef123456.us-west-2.redshift.amazonaws.com"
port = 5439
auth_method = "standard"

Example: Teradata (ODBC)

[connections.source.teradata]
driver_name = "Teradata Database ODBC Driver 17.20"
host = "your-teradata-host.example.com"
port = 1025
database = "tpcds"
username = "your_username"
password = "your_password"

Note: Only one source connection is needed. The Snowflake target connection should point to a valid entry in your ~/.snowflake/config.toml.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.11.2

May 25, 2026

0.11.1

May 25, 2026

0.11.0

May 21, 2026

0.10.0

May 13, 2026

0.9.1

Apr 29, 2026

0.9.0

Apr 28, 2026

0.8.0

Apr 16, 2026

0.7.2

Apr 9, 2026

0.7.1

Apr 8, 2026

This version

0.7.0

Apr 7, 2026

0.6.0

Apr 6, 2026

0.5.0

Apr 1, 2026

0.4.0

Mar 31, 2026

0.3.0

Mar 26, 2026

0.2.0

Mar 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snowflake_data_migration_orchestrator-0.7.0.tar.gz (721.3 kB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

snowflake_data_migration_orchestrator-0.7.0-py3-none-any.whl (396.3 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file snowflake_data_migration_orchestrator-0.7.0.tar.gz.

File metadata

Download URL: snowflake_data_migration_orchestrator-0.7.0.tar.gz
Upload date: Apr 7, 2026
Size: 721.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for snowflake_data_migration_orchestrator-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`d860993afb51dffe38bea0b2d8a24cea6faa5ab5e5d9a2e6ec52e13a741d8abe`
MD5	`e4a61c0b896c0308fd047f3d7dde9ded`
BLAKE2b-256	`41d67504817ac7da4821541dd5fc34ef1e18864855df0db0d07eb935b972c006`

See more details on using hashes here.

File details

Details for the file snowflake_data_migration_orchestrator-0.7.0-py3-none-any.whl.

File metadata

Download URL: snowflake_data_migration_orchestrator-0.7.0-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 396.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for snowflake_data_migration_orchestrator-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`70fd8f8c383508b37b23b7bc83dbe20bec61d1df81130dfc28d199902db28e52`
MD5	`db3200f1cfd6c2a11dcdd943e057710d`
BLAKE2b-256	`e67fcac15b09fa0e7984bdd880f2bad0be2c1f50688bf688e7c377da2eb8832a`

See more details on using hashes here.

snowflake-data-migration-orchestrator 0.7.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Snowflake Data Migration Orchestrator

Architecture

Where to Deploy Orchestrator and Worker(s)?

Setup

Additional Configuration on Snowflake Account

Usage

Starting the Orchestrator

Starting the Worker(s)

Creating a Data Migration Workflow

Monitoring a Data Migration Workflow

Cloud Data Validation Workflows

Data Validation Workflow Configuration (Top Level)

validation_configuration (global defaults)

comparison_configuration

Per-table / per-view entry (tables and views)

Advanced Features

Redshift UNLOAD

Incremental Synchronization

Considerations and Recommendations

Connecting to Snowflake with a PAT

Running Orchestrator and/or Workers on SPCS

Initial Testing

Managing Workers

Workflow Configuration Reference

WorkflowConfiguration (Top Level)

TableConfiguration

Default Table Configuration Merging Rules

SourceTargetIdentifier

IcebergConfig (target.icebergConfig)

ColumnTypeMapping

ColumnNameMapping

ExtractionStrategy

PartitionSize

SynchronizationStrategy

Quoting Identifiers

Workflow Configuration Examples

Basic Migration (SQL Server)

Redshift UNLOAD

Redshift UNLOAD with Iceberg Tables

Incremental Sync with Watermark

Worker Configuration

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Per-table / per-view entry (`tables` and `views`)