DuckLake provider for Apache Airflow (based on DuckDB)
Project description
DuckLake Provider for Apache Airflow
This is a custom provider for integrating DuckLake (based on DuckDB) with Apache Airflow.
DuckLake Configuration
The DuckLakeHook uses Airflow connection fields and extras to configure the connection. Standard fields are relabeled for common use:
- Host: Used for metadata host (e.g., Postgres/MySQL host) or file path (e.g., for DuckDB/SQLite metadata file).
- Login: Username (for Postgres/MySQL).
- Password: Password (for Postgres/MySQL).
- Schema: Metadata schema (defaults to 'duckdb').
- Extra: JSON dict for all other settings (required for engine, storage_type, and conditional fields).
Example extras JSON (adjust based on engine and storage_type):
{
"engine": "postgres",
"dbname": "my_ducklake",
"pgdbname": "dev_nophiml_db",
"storage_type": "s3",
"s3_bucket": "your-s3-bucket",
"s3_path": "your/s3/path/",
"encrypted": true,
"aws_access_key_id": "your-access-key-id",
"aws_secret_access_key": "your-secret-access-key",
"aws_region": "us-east-1",
"expire_older_than": "1 day",
"delete_older_than": "1 day",
"parquet_version": 2,
"max_temp_directory_size": "100GB",
"install_extensions": ["spatial"], # Optional: Inherited from DuckDB provider
"load_extensions": ["spatial"], # Optional
"connect_stack": [ # Optional: override default DuckLake install/load commands
"INSTALL httpfs;",
"LOAD httpfs;",
"INSTALL ducklake;",
"LOAD ducklake;"
]
}
Supported Engines (set in extras['engine'])
- duckdb: Requires 'metadata_file' in extras or host as file path.
- sqlite: Requires 'metadata_file' in extras or host as file path.
- postgres: Requires host, login, password, and 'pgdbname' in extras.
- mysql: Requires host, login, password, and 'mysqldbname' in extras.
Supported Storage Types (set in extras['storage_type'], default 's3')
- s3: Requires 's3_bucket', 's3_path'; optional AWS creds.
- azure: Requires 'azure_account_name', 'azure_container', 'azure_path'; optional 'azure_connection_string'.
- gcs: Requires 'gcs_bucket', 'gcs_path'; optional service_account_key (JSON string).
- local: Requires 'local_data_path'.
The UI shows core fields; use extras for engine/storage-specific ones. For dynamic behavior, select engine/storage in extras and provide corresponding keys.
If you need to customize the static DuckLake connection commands (for example to install additional extensions),
provide a connect_stack list in extras. Commands that depend on runtime variables (secrets, thread settings,
attachments, etc.) are always appended automatically by the hook.
Performance and Resource Controls
The hook exposes a few knobs for tuning concurrency and memory usage:
threads: (int/string) Overrides DuckDB's worker thread count. Non-numeric/blank values are ignored and the default of 4 is used.memory_limit: (string) A DuckDB-formatted limit such as"4GB"or"512MB". If provided, this always wins.memory_plan: ("conservative","midtier","aggressive") Lets the hook auto-sizememory_limitbased on available RAM. Defaults to"midtier"if not configured.max_temp_directory_size: (string) A DuckDB-formatted spill limit such as"100GB". If provided, the hook issuesSET max_temp_directory_size=...before queries run.encrypted: (bool/string) AddsENCRYPTEDto the DuckLakeATTACHoptions. Defaults totrue.expire_older_than: (string) Sets DuckLake's global expiry window after attach. Defaults to"1 day".delete_older_than: (string) Sets DuckLake's global delete retention window after attach. Defaults to"1 day".parquet_version: (int/string) Sets DuckLake's parquet writer version after attach. Defaults to2.
When memory_limit is omitted, DuckLake estimates available physical memory (using psutil, /proc/meminfo, POSIX sysconf, or Windows APIs), applies the selected plan’s fraction, and clamps within defined min/max bounds. This ensures the hook never grabs more than the machine can spare and still caps to sane maxima. If the machine’s free memory cannot be determined, DuckDB’s default memory settings are used.
After the DuckLake catalog is attached, the hook also executes these defaults unless you override them in connection extras:
CALL <dbname>.set_option('expire_older_than', '1 day')CALL <dbname>.set_option('delete_older_than', '1 day')CALL <dbname>.set_option('parquet_version', 2)
By default the ATTACH statement also includes ENCRYPTED.
You can also pass these parameters directly when instantiating the hook in a DAG:
from ducklake_provider.hooks.ducklake_hook import DuckLakeHook
hook = DuckLakeHook(
ducklake_conn_id="ducklake_default",
memory_plan="conservative", # or set memory_limit="6GB"
max_temp_directory_size="100GB",
threads=8,
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file airflow_provider_ducklake-0.0.11.tar.gz.
File metadata
- Download URL: airflow_provider_ducklake-0.0.11.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7132a4ee8129b6b70766ab5f9294df5d3da5da5175068b36d069935f87062cb1
|
|
| MD5 |
26c00caba083a162693b7f696f6eb96d
|
|
| BLAKE2b-256 |
407729da4ded19236aa582cbd8a73e006945bf2110582b1777b4507770fd52be
|
File details
Details for the file airflow_provider_ducklake-0.0.11-py3-none-any.whl.
File metadata
- Download URL: airflow_provider_ducklake-0.0.11-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f598404699ea9701f540ad3e4b84054900fb8664009498f80c9d4277450257ae
|
|
| MD5 |
71dd5a32d9b75d89c9599571bbf9c7c7
|
|
| BLAKE2b-256 |
c5bcc65d06d6241d3acb7b19d9c463f7fd53b7a8b672efa74a1e708a93df24c6
|