A package to run DuckDB queries from Apache Airflow
Project description
Airflow DuckDB on Kubernetes
DuckDB is an in-memory analytical database to run analytical queries on large data sets.
Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows.
Apache Airflow is not an ETL tool, but more of a workflow scheduler that can be used to schedule and monitor ETL jobs. Airflow users create DAGs to schedule Spark, Hive, Athena, Trino, BigQuery, and other ETL jobs to process their data.
By using DuckDB with Airflow, the users can run analytical queries on local or remote large data sets and store the results without the need to use these ETL tools.
To use DuckDB with Airflow, the users can use the PythonOperator with the DuckDB Python library, the BashOperator with the DuckDB CLI, or one of the available Airflow operators that support DuckDB (e.g. airflow-provider-duckdb developed by Astronomer). All of these operators will be running in the worker pod and limited by its resources, for that reason, some users use the Kubernetes Executor to run the tasks in a dedicated Kubernetes pod to request more resources when needed.
Setting up Kubernetes Executor could be a bit challenging for some users, especially maintaining the workers docker image. This project provides an alternative solution to run DuckDB with Airflow using the KubernetesPodOperator.
How to use
The developed operator is completely based on the KubernetesPodOperator, so it needs cncf-kubernetes provider to be installed in the Airflow environment (preferably the latest version to profit from all the features).
Install the package
To use the operator, you need to install the package in your Airflow environment. You can install the package using pip:
pip install airflow-duckdb
Use the operator
The operators supports all the parameters of the KubernetesPodOperator, and it has some additional parameters to simplify the usage of DuckDB.
Here is an example of how to use the operator:
with DAG("duckdb_dag", ...) as dag:
DuckDBPodOperator(
task_id="duckdb_task",
query="SELECT MAX(col1) AS FROM READ_PARQUET('s3://my_bucket/data.parquet');",
do_xcom_push=True,
s3_fs_config=S3FSConfig(
access_key_id="{{ conn.duckdb_s3.login }}",
secret_access_key="{{ conn.duckdb_s3.password }}",
),
container_resources=k8s.V1ResourceRequirements(
requests={"cpu": "1", "memory": "8Gi"},
limits={"cpu": "1", "memory": "8Gi"},
),
)
Features
The current version of the operator supports the following features:
- Running one or more DuckDB queries in a Kubernetes pod
- Configuring the pod resources (requests and limits) to run the queries
- Configuring the S3 credentials securely with a Kubernetes secret to read and write data from/to S3 (AWS S3, MinIO or GCS with S3 compatibility)
- Using Jinja templating to configure the query
- Loading the queries from a file
- Pushing the query result to XCom
The project also provides a Docker image with DuckDB CLI and some extensions to use it with Airflow.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file airflow_duckdb-0.1.2.tar.gz
.
File metadata
- Download URL: airflow_duckdb-0.1.2.tar.gz
- Upload date:
- Size: 8.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 62434195af038c57a9374aaf1c502a3d56ee13448bf1ef8bab69a2e41de338fc |
|
MD5 | 0f90802cbfd0b305c4c4ee23daa61eef |
|
BLAKE2b-256 | 18e8cdb10fed3178c55a2cb921e3c983810dc213af9de0caf36d93f3764a5f64 |
File details
Details for the file airflow_duckdb-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: airflow_duckdb-0.1.2-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e95c0bbc8d48584cbdd7dffd5d61f2489144df6976a6ae59b9b17d8e370fe24 |
|
MD5 | 49236bb0e29daecccf2802217d1af6a1 |
|
BLAKE2b-256 | 4583f3271b42634373b110fcd1fa41accb535086d07a0c4c275d7085eb099389 |