Skip to main content

Snowflake loader for mkpipe.

Project description

mkpipe-loader-snowflake

Snowflake loader plugin for MkPipe. Writes Spark DataFrames into Snowflake tables using the native Snowflake Spark connector (spark-snowflake), which stages data via internal cloud storage (S3/Azure/GCS) — significantly faster than JDBC for large datasets.

Documentation

For more detailed documentation, please visit the GitHub repository.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.


Connection Configuration

connections:
  snowflake_target:
    variant: snowflake
    host: myaccount.snowflakecomputing.com
    port: 443
    database: MY_DATABASE
    schema: MY_SCHEMA
    user: myuser
    password: mypassword
    warehouse: MY_WAREHOUSE

With RSA key pair authentication:

connections:
  snowflake_target:
    variant: snowflake
    host: myaccount.snowflakecomputing.com
    port: 443
    database: MY_DATABASE
    schema: MY_SCHEMA
    user: myuser
    warehouse: MY_WAREHOUSE
    private_key_file: /path/to/rsa_key.p8
    private_key_file_pwd: mypassphrase

Table Configuration

pipelines:
  - name: pg_to_snowflake
    source: pg_source
    destination: snowflake_target
    tables:
      - name: public.events
        target_name: STG_EVENTS
        replication_method: full
        batchsize: 50000

Write Strategy

Control how data is written to Snowflake:

      - name: public.events
        target_name: STG_EVENTS
        write_strategy: upsert       # append | replace | upsert | merge
        write_key: [id]              # required for upsert/merge
Strategy Snowflake Behavior
append Insert via Spark connector (default for incremental)
replace Overwrite table via Spark connector (default for full). Use if_exists: append to preserve existing table
upsert Write to temp table, then MERGE INTO target USING temp ON ... WHEN MATCHED THEN UPDATE ... WHEN NOT MATCHED THEN INSERT ...
merge Same as upsert for Snowflake

Note: upsert/merge requires write_key. The loader creates a temp table, writes data there, executes a MERGE statement, then drops the temp table.


Write Parallelism & Throughput

Snowflake loader uses the native Spark connector. Two parameters control write performance:

      - name: public.events
        target_name: STG_EVENTS
        replication_method: full
        batchsize: 50000        # rows per batch insert (default: 10000)
        write_partitions: 4     # coalesce DataFrame to N partitions before writing

How they work

  • batchsize: number of rows buffered before sending to Snowflake. Larger batches reduce round-trips and staging overhead.
  • write_partitions: calls coalesce(N) on the DataFrame before writing, controlling the number of concurrent write operations to Snowflake.

Performance Notes

  • Snowflake Warehouse size is the primary write performance lever. A larger warehouse processes inserts faster regardless of partition count.
  • The Spark connector stages data internally before committing. Large batchsize (50,000+) reduces staging overhead.
  • For very large loads, consider using Snowflake's native COPY INTO via an external stage (S3/GCS) instead — that is significantly faster but requires additional infrastructure.
  • write_partitions: 4–8 is a good default to balance throughput and connection count.

All Table Parameters

Parameter Type Default Description
name string required Source table name
target_name string required Snowflake destination table name
replication_method full / incremental full Replication strategy
batchsize int 10000 Rows per batch insert
write_partitions int Coalesce DataFrame to N partitions before writing
write_strategy string append, replace, upsert, merge
write_key list Key columns for upsert/merge (required)
if_exists string replace (drop+create) or append (preserve table). Inherits from settings
dedup_columns list Columns used for mkpipe_id hash deduplication
tags list [] Tags for selective pipeline execution
pass_on_error bool false Skip table on error instead of failing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mkpipe_loader_snowflake-0.7.0.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mkpipe_loader_snowflake-0.7.0-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file mkpipe_loader_snowflake-0.7.0.tar.gz.

File metadata

  • Download URL: mkpipe_loader_snowflake-0.7.0.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for mkpipe_loader_snowflake-0.7.0.tar.gz
Algorithm Hash digest
SHA256 66884c82df751a0fba460935a2efeeeb9367c4fa9392479ce5e0c320bfe0cf24
MD5 1f867d71497c4632719721a07521bac3
BLAKE2b-256 863cfd7df1fccfd84abe8acc428762a3ae564722e79f18407fe26f18a23cfcef

See more details on using hashes here.

File details

Details for the file mkpipe_loader_snowflake-0.7.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mkpipe_loader_snowflake-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0b12036e5b3326f580205b3ace577c9ef017ceab5b96bcdd2c1e462b2e7d2e54
MD5 9f20bb3c1ec4c0e0ae6be7f341e97ff1
BLAKE2b-256 4927e49eae684f458094c000acbfc91a5a618da9ebb9954c2044ffbf5278ca25

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page