Apache airflow provider for managing Reverse ETL syncs and Profiles runs in RudderStack.
Project description
The Customer Data Platform for Developers
RudderStack Airflow Provider
The RudderStack Airflow Provider lets you programmatically schedule and trigger your Reverse ETL syncs and Profiles runs outside RudderStack and integrate them with your existing Airflow workflows. Refer to orchestration docs.
Installation
pip install rudderstack-airflow-provider
Usage
RudderstackRETLOperator
[!NOTE]
Use RudderstackRETLOperator for reverse ETL connections
A simple DAG for triggering syncs for a RudderStack Reverse ETL source:
with DAG(
"rudderstack-retl-sample",
default_args=default_args,
description="A simple tutorial DAG for reverse etl",
schedule_interval=timedelta(days=1),
start_date=datetime(2021, 1, 1),
catchup=False,
tags=["rs-retl"],
) as dag:
# retl_connection_id, sync_type are template fields
rs_operator = RudderstackRETLOperator(
retl_connection_id="connection_id",
task_id="<a unique, meaningful id for the airflow task>",
connection_id="<rudderstack api airflow connection id>"
)
For the complete code, refer to this example.
Mandatatory parameters for RudderstackRETLOperator:
- retl_connection_id: This is the connection id for the sync job.
- connection_id: The Airflow connection to use for connecting to the Rudderstack API. Default value is
rudderstack_default.
RudderstackRETLOperator exposes other configurable parameters as well. Mostly default values for them would be recommended.
- request_max_retries: The maximum number of times requests to the RudderStack API should be retried before failng.
- request_retry_delay: Time (in seconds) to wait between each request retry.
- request_timeout: Time (in seconds) after which the requests to RudderStack are declared timed out.
- poll_interval: Time (in seconds) for polling status of triggered job.
- poll_timeout: Time (in seconds) after which the polling for a triggered job is declared timed out.
- wait_for_completion: Boolean if execution run should poll and wait till completion of sync. Default value is True.
- sync_type: Type of sync to trigger
incrementalorfull. Default is None as RudderStack will be deteriming sync type.
RudderstackProfilesOperator
RudderstackProfilesOperator can be used to trigger profiles run. A simple DAG for triggering profile runs for a profiles project.
with DAG(
"rudderstack-profiles-sample",
default_args=default_args,
description="A simple tutorial DAG for profiles run.",
schedule_interval=timedelta(days=1),
start_date=datetime(2021, 1, 1),
catchup=False,
tags=["rs-profiles"],
) as dag:
# profile_id is template field
rs_operator = RudderstackProfilesOperator(
profile_id="<profile_id>",
task_id="<a unique, meaningful id for the airflow task",
connection_id="<rudderstack api connection id>",
)
Mandatatory parameters for RudderstackProfilesOperator:
- profile_id: This is the profiles id for the profiles project to run.
- connection_id: The Airflow connection to use for connecting to the Rudderstack API. Default value is
rudderstack_default.
RudderstackProfilesOperator exposes other configurable parameters as well. Mostly default values for them would be recommended.
- request_max_retries: The maximum number of times requests to the RudderStack API should be retried before failng.
- request_retry_delay: Time (in seconds) to wait between each request retry.
- request_timeout: Time (in seconds) after which the requests to RudderStack are declared timed out.
- poll_interval: Time (in seconds) for polling status of triggered job.
- poll_timeout: Time (in seconds) after which the polling for a triggered job is declared timed out.
- wait_for_completion: Boolean if execution run should poll and wait till completion of sync. Default value is True.
- parameters: Additional parameters to pass to the profiles run command, as supported by the API endpoint. Default value is
None.
RudderstackETLOperator
RudderstackETLOperator can be used to trigger ETL sync runs. A simple DAG for triggering ETL sync.
with DAG(
"rudderstack-etl-sample",
default_args=default_args,
description="A simple tutorial DAG for etl sync.",
schedule_interval=timedelta(days=1),
start_date=datetime(2021, 1, 1),
catchup=False,
tags=["rs-etl"],
) as dag:
# etl_source_id is template field
rs_operator = RudderstackProfilesOperator(
etl_source_id="<etl_source_id>",
task_id="<a unique, meaningful id for the airflow task",
connection_id="<rudderstack api connection id>",
)
Mandatatory parameters for RudderstackETLOperator:
- etl_source_id: This is the source id for the ETL source.
- connection_id: The Airflow connection to use for connecting to the Rudderstack API. Default value is
rudderstack_default.
RudderstackETLOperator exposes other configurable parameters as well. Mostly default values for them would be recommended.
- request_max_retries: The maximum number of times requests to the RudderStack API should be retried before failng.
- request_retry_delay: Time (in seconds) to wait between each request retry.
- request_timeout: Time (in seconds) after which the requests to RudderStack are declared timed out.
- poll_interval: Time (in seconds) for polling status of triggered job.
- poll_timeout: Time (in seconds) after which the polling for a triggered job is declared timed out.
- wait_for_completion: Boolean if execution run should poll and wait till completion of sync. Default value is True.
Contribute
We would love to see you contribute to this project. Get more information on how to contribute here.
License
The RudderStack Airflow Provider is released under the MIT License.
Contact Us
For more information or queries on this feature, you can contact us or start a conversation in our Slack community.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rudderstack_airflow_provider-2.3.0.tar.gz.
File metadata
- Download URL: rudderstack_airflow_provider-2.3.0.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf3b715f54f3d88d0eee6b4bf664301dcddd2d67bf4ee642806dbd251ef2b374
|
|
| MD5 |
0ef9d466d7d35c5f9745970c0b853428
|
|
| BLAKE2b-256 |
f2f4036e2a4bde6640201a0bf60b30d78e47d23c5cad9f5bd3f3670f4adf5c06
|
Provenance
The following attestation bundles were made for rudderstack_airflow_provider-2.3.0.tar.gz:
Publisher:
release.yaml on rudderlabs/rudder-airflow-provider
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rudderstack_airflow_provider-2.3.0.tar.gz -
Subject digest:
cf3b715f54f3d88d0eee6b4bf664301dcddd2d67bf4ee642806dbd251ef2b374 - Sigstore transparency entry: 185980622
- Sigstore integration time:
-
Permalink:
rudderlabs/rudder-airflow-provider@862a000c93a2867dd36c4e5b3d401dff7709764b -
Branch / Tag:
refs/tags/2.3.0 - Owner: https://github.com/rudderlabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@862a000c93a2867dd36c4e5b3d401dff7709764b -
Trigger Event:
release
-
Statement type:
File details
Details for the file rudderstack_airflow_provider-2.3.0-py3-none-any.whl.
File metadata
- Download URL: rudderstack_airflow_provider-2.3.0-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4332fa1456a10466e2b29517c8e2df11176a4747cc18773a23bc581954513e6d
|
|
| MD5 |
4c5cc7f6284f71c9a978c41500626f4e
|
|
| BLAKE2b-256 |
94a01a301da35f1d899448a0b80e08aac3600b134a96d889633cee62c137f4db
|
Provenance
The following attestation bundles were made for rudderstack_airflow_provider-2.3.0-py3-none-any.whl:
Publisher:
release.yaml on rudderlabs/rudder-airflow-provider
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rudderstack_airflow_provider-2.3.0-py3-none-any.whl -
Subject digest:
4332fa1456a10466e2b29517c8e2df11176a4747cc18773a23bc581954513e6d - Sigstore transparency entry: 185980625
- Sigstore integration time:
-
Permalink:
rudderlabs/rudder-airflow-provider@862a000c93a2867dd36c4e5b3d401dff7709764b -
Branch / Tag:
refs/tags/2.3.0 - Owner: https://github.com/rudderlabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@862a000c93a2867dd36c4e5b3d401dff7709764b -
Trigger Event:
release
-
Statement type: