Skip to main content

Apache airflow provider for managing Reverse ETL syncs and Profiles runs in RudderStack.

Project description

The Customer Data Platform for Developers

Website · Slack Community


RudderStack Airflow Provider

The RudderStack Airflow Provider lets you programmatically schedule and trigger your Reverse ETL syncs and Profiles runs outside RudderStack and integrate them with your existing Airflow workflows. Refer to orchestration docs.

Installation

pip install rudderstack-airflow-provider

Usage

RudderstackRETLOperator

[!NOTE]
Use RudderstackRETLOperator for reverse ETL connections

A simple DAG for triggering syncs for a RudderStack Reverse ETL source:

with DAG(
    "rudderstack-retl-sample",
    default_args=default_args,
    description="A simple tutorial DAG for reverse etl",
    schedule_interval=timedelta(days=1),
    start_date=datetime(2021, 1, 1),
    catchup=False,
    tags=["rs-retl"],
) as dag:
    # retl_connection_id, sync_type are template fields
    rs_operator = RudderstackRETLOperator(
        retl_connection_id="connection_id",
        task_id="<a unique, meaningful id for the airflow task>",
        connection_id="<rudderstack api airflow connection id>"
    )

For the complete code, refer to this example.

Mandatatory parameters for RudderstackRETLOperator:

  • retl_connection_id: This is the connection id for the sync job.
  • connection_id: The Airflow connection to use for connecting to the Rudderstack API. Default value is rudderstack_default.

RudderstackRETLOperator exposes other configurable parameters as well. Mostly default values for them would be recommended.

  • request_max_retries: The maximum number of times requests to the RudderStack API should be retried before failng.
  • request_retry_delay: Time (in seconds) to wait between each request retry.
  • request_timeout: Time (in seconds) after which the requests to RudderStack are declared timed out.
  • poll_interval: Time (in seconds) for polling status of triggered job.
  • poll_timeout: Time (in seconds) after which the polling for a triggered job is declared timed out.
  • wait_for_completion: Boolean if execution run should poll and wait till completion of sync. Default value is True.
  • sync_type: Type of sync to trigger incremental or full. Default is None as RudderStack will be deteriming sync type.

RudderstackProfilesOperator

RudderstackProfilesOperator can be used to trigger profiles run. A simple DAG for triggering profile runs for a profiles project.

with DAG(
    "rudderstack-profiles-sample",
    default_args=default_args,
    description="A simple tutorial DAG for profiles run.",
    schedule_interval=timedelta(days=1),
    start_date=datetime(2021, 1, 1),
    catchup=False,
    tags=["rs-profiles"],
) as dag:
    # profile_id is template field
    rs_operator = RudderstackProfilesOperator(
        profile_id="<profile_id>",
        task_id="<a unique, meaningful id for the airflow task",
        connection_id="<rudderstack api connection id>",
    )

Mandatatory parameters for RudderstackProfilesOperator:

  • profile_id: This is the profiles id for the profiles project to run.
  • connection_id: The Airflow connection to use for connecting to the Rudderstack API. Default value is rudderstack_default.

RudderstackProfilesOperator exposes other configurable parameters as well. Mostly default values for them would be recommended.

  • request_max_retries: The maximum number of times requests to the RudderStack API should be retried before failng.
  • request_retry_delay: Time (in seconds) to wait between each request retry.
  • request_timeout: Time (in seconds) after which the requests to RudderStack are declared timed out.
  • poll_interval: Time (in seconds) for polling status of triggered job.
  • poll_timeout: Time (in seconds) after which the polling for a triggered job is declared timed out.
  • wait_for_completion: Boolean if execution run should poll and wait till completion of sync. Default value is True.
  • parameters: Additional parameters to pass to the profiles run command, as supported by the API endpoint. Default value is None.

RudderstackETLOperator

RudderstackETLOperator can be used to trigger ETL sync runs. A simple DAG for triggering ETL sync.

with DAG(
    "rudderstack-etl-sample",
    default_args=default_args,
    description="A simple tutorial DAG for etl sync.",
    schedule_interval=timedelta(days=1),
    start_date=datetime(2021, 1, 1),
    catchup=False,
    tags=["rs-etl"],
) as dag:
    # etl_source_id is template field
    rs_operator = RudderstackProfilesOperator(
        etl_source_id="<etl_source_id>",
        task_id="<a unique, meaningful id for the airflow task",
        connection_id="<rudderstack api connection id>",
    )

Mandatatory parameters for RudderstackETLOperator:

  • etl_source_id: This is the source id for the ETL source.
  • connection_id: The Airflow connection to use for connecting to the Rudderstack API. Default value is rudderstack_default.

RudderstackETLOperator exposes other configurable parameters as well. Mostly default values for them would be recommended.

  • request_max_retries: The maximum number of times requests to the RudderStack API should be retried before failng.
  • request_retry_delay: Time (in seconds) to wait between each request retry.
  • request_timeout: Time (in seconds) after which the requests to RudderStack are declared timed out.
  • poll_interval: Time (in seconds) for polling status of triggered job.
  • poll_timeout: Time (in seconds) after which the polling for a triggered job is declared timed out.
  • wait_for_completion: Boolean if execution run should poll and wait till completion of sync. Default value is True.

Contribute

We would love to see you contribute to this project. Get more information on how to contribute here.

License

The RudderStack Airflow Provider is released under the MIT License.

Contact Us

For more information or queries on this feature, you can contact us or start a conversation in our Slack community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rudderstack_airflow_provider-2.3.0.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rudderstack_airflow_provider-2.3.0-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file rudderstack_airflow_provider-2.3.0.tar.gz.

File metadata

File hashes

Hashes for rudderstack_airflow_provider-2.3.0.tar.gz
Algorithm Hash digest
SHA256 cf3b715f54f3d88d0eee6b4bf664301dcddd2d67bf4ee642806dbd251ef2b374
MD5 0ef9d466d7d35c5f9745970c0b853428
BLAKE2b-256 f2f4036e2a4bde6640201a0bf60b30d78e47d23c5cad9f5bd3f3670f4adf5c06

See more details on using hashes here.

Provenance

The following attestation bundles were made for rudderstack_airflow_provider-2.3.0.tar.gz:

Publisher: release.yaml on rudderlabs/rudder-airflow-provider

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rudderstack_airflow_provider-2.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for rudderstack_airflow_provider-2.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4332fa1456a10466e2b29517c8e2df11176a4747cc18773a23bc581954513e6d
MD5 4c5cc7f6284f71c9a978c41500626f4e
BLAKE2b-256 94a01a301da35f1d899448a0b80e08aac3600b134a96d889633cee62c137f4db

See more details on using hashes here.

Provenance

The following attestation bundles were made for rudderstack_airflow_provider-2.3.0-py3-none-any.whl:

Publisher: release.yaml on rudderlabs/rudder-airflow-provider

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page