Skip to main content

DuckLake provider for Apache Airflow (based on DuckDB)

Project description

DuckLake Provider for Apache Airflow

This is a custom provider for integrating DuckLake (based on DuckDB) with Apache Airflow.

DuckLake Configuration

The DuckLakeHook uses Airflow connection fields and extras to configure the connection. Standard fields are relabeled for common use:

  • Host: Used for metadata host (e.g., Postgres/MySQL host) or file path (e.g., for DuckDB/SQLite metadata file).
  • Login: Username (for Postgres/MySQL).
  • Password: Password (for Postgres/MySQL).
  • Schema: Metadata schema (defaults to 'duckdb').
  • Extra: JSON dict for all other settings (required for engine, storage_type, and conditional fields).

Example extras JSON (adjust based on engine and storage_type):

{
  "engine": "postgres",
  "dbname": "my_ducklake",
  "pgdbname": "dev_nophiml_db",
  "storage_type": "s3",
  "s3_bucket": "your-s3-bucket",
  "s3_path": "your/s3/path/",
  "encrypted": true,
  "aws_access_key_id": "your-access-key-id",
  "aws_secret_access_key": "your-secret-access-key",
  "aws_region": "us-east-1",
  "expire_older_than": "1 day",
  "delete_older_than": "1 day",
  "parquet_version": 2,
  "max_temp_directory_size": "100GB",
  "install_extensions": ["spatial"],  # Optional: Inherited from DuckDB provider
  "load_extensions": ["spatial"],     # Optional
  "connect_stack": [                  # Optional: override default DuckLake install/load commands
    "INSTALL httpfs;",
    "LOAD httpfs;",
    "INSTALL ducklake;",
    "LOAD ducklake;"
  ]
}

Supported Engines (set in extras['engine'])

  • duckdb: Requires 'metadata_file' in extras or host as file path.
  • sqlite: Requires 'metadata_file' in extras or host as file path.
  • postgres: Requires host, login, password, and 'pgdbname' in extras.
  • mysql: Requires host, login, password, and 'mysqldbname' in extras.

Supported Storage Types (set in extras['storage_type'], default 's3')

  • s3: Requires 's3_bucket', 's3_path'; optional AWS creds.
  • azure: Requires 'azure_account_name', 'azure_container', 'azure_path'; optional 'azure_connection_string'.
  • gcs: Requires 'gcs_bucket', 'gcs_path'; optional service_account_key (JSON string).
  • local: Requires 'local_data_path'.

The UI shows core fields; use extras for engine/storage-specific ones. For dynamic behavior, select engine/storage in extras and provide corresponding keys. If you need to customize the static DuckLake connection commands (for example to install additional extensions), provide a connect_stack list in extras. Commands that depend on runtime variables (secrets, thread settings, attachments, etc.) are always appended automatically by the hook.

Performance and Resource Controls

The hook exposes a few knobs for tuning concurrency and memory usage:

  • threads: (int/string) Overrides DuckDB's worker thread count. Non-numeric/blank values are ignored and the default of 4 is used.
  • memory_limit: (string) A DuckDB-formatted limit such as "4GB" or "512MB". If provided, this always wins.
  • memory_plan: ("conservative", "midtier", "aggressive") Lets the hook auto-size memory_limit based on available RAM. Defaults to "midtier" if not configured.
  • max_temp_directory_size: (string) A DuckDB-formatted spill limit such as "100GB". If provided, the hook issues SET max_temp_directory_size=... before queries run.
  • encrypted: (bool/string) Adds ENCRYPTED to the DuckLake ATTACH options. Defaults to true.
  • expire_older_than: (string) Sets DuckLake's global expiry window after attach. Defaults to "1 day".
  • delete_older_than: (string) Sets DuckLake's global delete retention window after attach. Defaults to "1 day".
  • parquet_version: (int/string) Sets DuckLake's parquet writer version after attach. Defaults to 2.

When memory_limit is omitted, DuckLake estimates available physical memory (using psutil, /proc/meminfo, POSIX sysconf, or Windows APIs), applies the selected plan’s fraction, and clamps within defined min/max bounds. This ensures the hook never grabs more than the machine can spare and still caps to sane maxima. If the machine’s free memory cannot be determined, DuckDB’s default memory settings are used.

The hook includes ENCRYPTED on every ATTACH by default. After the DuckLake catalog is attached, it also ensures these defaults exist when the catalog does not already have values for them:

  • CALL <dbname>.set_option('expire_older_than', '1 day')
  • CALL <dbname>.set_option('delete_older_than', '1 day')
  • CALL <dbname>.set_option('parquet_version', 2)

If DuckLake already has values for those three options, the hook leaves the existing catalog settings unchanged.

You can also pass these parameters directly when instantiating the hook in a DAG:

from ducklake_provider.hooks.ducklake_hook import DuckLakeHook

hook = DuckLakeHook(
    ducklake_conn_id="ducklake_default",
    memory_plan="conservative",  # or set memory_limit="6GB"
    max_temp_directory_size="100GB",
    threads=8,
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

airflow_provider_ducklake-0.0.12.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

airflow_provider_ducklake-0.0.12-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file airflow_provider_ducklake-0.0.12.tar.gz.

File metadata

File hashes

Hashes for airflow_provider_ducklake-0.0.12.tar.gz
Algorithm Hash digest
SHA256 78c26ea2ebcf306dd96468ec4b9a98d9863fa8a5cf39723cee7b7f4278765b89
MD5 abfadad1e13ff343b9a4f6c88824016a
BLAKE2b-256 e916011da4d4fc0850bf4ca6289c26b31dafe076581748880b037ce59fd9ca0e

See more details on using hashes here.

File details

Details for the file airflow_provider_ducklake-0.0.12-py3-none-any.whl.

File metadata

File hashes

Hashes for airflow_provider_ducklake-0.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 ae0a0842670036582491b0d21d290472168061d8710b9a894be6a0ead8b33dc8
MD5 0111f14ea8f4b4d71dd788e56c104f59
BLAKE2b-256 80d0dc9f68ebcd6e1965062f6e15df186da63e5f80d100573f0dc163a95a5945

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page