Vineyard provider for apache-airflow
Project description
Apache Airflow Provider for Vineyard
The apache airflow provider for vineyard contains components to sharing intermediate data among tasks in Airflow workflows using Vineyard.
Vineyard works as a XCom backend for airflow workers to allow transferring large-scale data objects between tasks that cannot be fit into the Airflow's database backend without involving external storage systems like HDFS. The Vineyard XCom backend handles object migration as well when the required inputs is not located on where the task is scheduled to execute.
Requirements
The following packages are needed to run Airflow on Vineyard,
- airflow >= 2.1.0
- vineyard >= 0.2.12
Configuration and Usage
-
Install required packages:
pip3 install airflow-provider-vineyard
-
Configure Vineyard locally
The vineyard server can be easier launched locally with the following command:
python3 -m vineyard --socket=/tmp/vineyard.sock
See also our documentation about launching vineyard.
-
Configure Airflow to use the vineyard XCom backend by specifying the environment variable
export AIRFLOW__CORE__XCOM_BACKEND=vineyard.contrib.airflow.xcom.VineyardXCom
and configure the location of UNIX-domain IPC socket for vineyard client by
export AIRFLOW__VINEYARD__IPC_SOCKET=/tmp/vineyard.sock
or
export VINEYARD_IPC_SOCKET=/tmp/vineyard.sock
-
Launching your airflow scheduler and workers, and run the following DAG as example,
```python import numpy as np import pandas as pd from airflow.decorators import dag, task from airflow.utils.dates import days_ago default_args = { 'owner': 'airflow', } @dag(default_args=default_args, schedule_interval=None, start_date=days_ago(2), tags=['example']) def taskflow_etl_pandas(): @task() def extract(): order_data_dict = pd.DataFrame({ 'a': np.random.rand(100000), 'b': np.random.rand(100000) }) return order_data_dict @task(multiple_outputs=True) def transform(order_data_dict: dict): return {"total_order_value": order_data_dict["a"].sum()} @task() def load(total_order_value: float): print(f"Total order value is: {total_order_value:.2f}") order_data = extract() order_summary = transform(order_data) load(order_summary["total_order_value"]) taskflow_etl_pandas_dag = taskflow_etl_pandas() ```
In above example, task :code:extract
and task :code:transform
shares a
:code:pandas.DataFrame
as the intermediate data, which is impossible as
it cannot be pickled and when the data is large, it cannot be fit into the
table in backend databases of Airflow.
The example is adapted from the documentation of Airflow, see also Tutorial on the Taskflow API.
Run the tests
-
Start your vineyardd with the following command,
python3 -m vineyard
-
Set airflow to use the vineyard XCom backend, and run tests with pytest,
export AIRFLOW__CORE__XCOM_BACKEND=vineyard.contrib.airflow.xcom.VineyardXCom pytest -s -vvv python/vineyard/contrib/airflow/tests/test_python_dag.py pytest -s -vvv python/vineyard/contrib/airflow/tests/test_pandas_dag.py
The pandas test suite is not possible to run with the default XCom backend, vineyard enables airflow to exchange complex and big data without modify the DAG and tasks!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for airflow_provider_vineyard-0.3.19-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1446ea0eb648d74821bbb684487b84352715639c557ad79cdf6a47cec91a781c |
|
MD5 | a4b4a26d4838142c8dbc8a35c19bfc5f |
|
BLAKE2b-256 | e4f15352af68435b0b02aeb992c4a40fb83276991b2708645407514de384503f |