Skip to main content

This project provides the Universal Transfer Operator to seamlessly transfer data from source to destination Datasets via a single operator invocation within DAGs.

Project description

⚠️ Discontinuation of project

This project is no longer actively maintained by Astronomer, but we’d love to see it live on in the community. While Astronomer has paused development and is not accepting new contributions, bug fixes or releases, the code is still here for you to explore, fork and adapt under the terms of its license. Please note that it may not work with the latest dependencies or platforms, and it could contain security vulnerabilities. Astronomer can’t offer guarantees or warranties for its use. If you’re interested in adopting or stewarding this project, we’d be happy to chat, reach us at oss@astronomer.io. Thanks for being part of the open-source journey and helping keep great ideas alive!

Universal Transfer Operator

transfers made easy

CI

The UniversalTransferOperator simplifies how users transfer data from a source to a destination using Apache Airflow. It offers a consistent agnostic interface, improving the users' experience so they do not need to use explicitly specific providers or operators.

At the moment, it supports transferring data between file locations and databases (in both directions) and cross-database transfers.

This project is maintained by Astronomer.

Prerequisites

  • Apache Airflow >= 2.4.0.

Install

The apache-airflow-provider-transfers is available at PyPI. Use the standard Python installation tools.

To install a cloud-agnostic version of the apache-airflow-provider-transfers, run:

pip install apache-airflow-provider-transfers

You can also install dependencies for using the UniversalTransferOperator with popular cloud providers:

pip install apache-airflow-provider-transfers[amazon,google,snowflake]

Quickstart

Users can get started quickly with following two approaches:

  1. Spinning up a local Airflow infrastructure using the open-source Astro CLI and Docker
  2. Using vanilla Airflow and Python

Run UniversalTransferOperator using astro

Run UniversalTransferOperator using vanilla airflow and python

  • Install airflow and setup project following this documentation.

  • Ensure that your Airflow environment is set up correctly by running the following commands:

    export AIRFLOW_HOME=`pwd`
    airflow db init
    
  • Add the following in requirements.txt

    apache-airflow-provider-transfers[all]
    
  • Copy file named example_transfer_and_return_files.py and example_snowflake_transfers.py and add it to the dags directory of your Airflow project:

https://github.com/astronomer/apache-airflow-provider-transfers/blob/04b53d780790eaa3b424458742bc89d6fbec2ccd/example_dags/example_transfer_and_return_files.py#L1-L45

https://github.com/astronomer/apache-airflow-provider-transfers/blob/a80dc84b7f33bb86ae244f79411b240f4f4c7e22/example_dags/example_snowflake_transfers.py#L1-L46

Alternatively, you can download example_transfer_and_return_files.py and example_snowflake_transfers.py.

 curl -O https://github.com/astronomer/apache-airflow-provider-transfers/blob/main/example_dags/example_transfer_and_return_files.py

 curl -O https://github.com/astronomer/apache-airflow-provider-transfers/blob/main/example_dags/example_snowflake_transfers.py

Example DAGs

Checkout the example_dags folder for examples of how the UniversalTransferOperator can be used.

How Universal Transfer Operator Works

Approach

With UniversalTransferOperator, users can perform data transfers using the following transfer modes:

  1. Non-native
  2. Native
  3. Third-party

Non-native transfer

Non-native transfers rely on transferring the data through the Airflow worker node. Chunking is applied where possible. This method can be suitable for datasets smaller than 2GB, depending on the source and target. The performance of this method is highly dependent upon the worker's memory, disk, processor and network configuration.

Internally, the steps involved are:

  • Retrieve the dataset data in chunks from dataset storage to the worker node.
  • Send data to the cloud dataset from the worker node.

Following is an example of non-native transfers between Google cloud storage and Sqlite:

https://github.com/astronomer/apache-airflow-provider-transfers/blob/a80dc84b7f33bb86ae244f79411b240f4f4c7e22/example_dags/example_universal_transfer_operator.py#L68-L74

Improving bottlenecks by using native transfer

An alternative to using the Non-native transfer method is the native method. The native transfers rely on mechanisms and tools offered by the data source or data target providers. In the case of moving from object storage to a Snowflake database, for instance, a native transfer consists in using the built-in COPY INTO command. When loading data from S3 to BigQuery, the Universal Transfer Operator uses the GCP Storage Transfer Service.

The benefit of native transfers is that they will likely perform better for larger datasets (2 GB) and do not rely on the Airflow worker node hardware configuration. With this approach, the Airflow worker nodes are used as orchestrators and do not perform the transfer. The speed depends exclusively on the service being used and the bandwidth between the source and destination.

Steps:

  • Request destination dataset to ingest data from the source dataset.
  • Destination dataset requests source dataset for data.

NOTE: The Native method implementation is in progress and will be available in future releases.

Transfer using a third-party tool

The UniversalTransferOperator can also offer an interface to generic third-party services that transfer data, similar to Fivetran.

Here is an example of how to use Fivetran for transfers:

https://github.com/astronomer/apache-airflow-provider-transfers/blob/7d5188c4af214d5cdaeb714654e9bdf5b48cb3fb/example_dags/example_dag_fivetran.py#L35-L56

Supported technologies

Documentation

The documentation is a work in progress -- we aim to follow the Diátaxis system.

Changelog

The Universal Transfer Operator follows semantic versioning for releases. Check the changelog for the latest changes.

Release management

See Managing Releases to learn more about our release philosophy and steps.

Contribution guidelines

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

Read the Contribution Guideline for a detailed overview of how to contribute.

Contributors and maintainers should abide by the Contributor Code of Conduct.

License

Apache Licence 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apache_airflow_provider_transfers-0.1.2.tar.gz (62.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file apache_airflow_provider_transfers-0.1.2.tar.gz.

File metadata

File hashes

Hashes for apache_airflow_provider_transfers-0.1.2.tar.gz
Algorithm Hash digest
SHA256 718ccf00bd809e18f33a22d48dcb83f06302de746138aae2b001cf22dc9a0e56
MD5 c9d6cacf8e828a453519d6eab71cfb48
BLAKE2b-256 9007a1ab906b0928616bc2eea61bcaf1d6b301cf61e7499425dc1396deadd07e

See more details on using hashes here.

File details

Details for the file apache_airflow_provider_transfers-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for apache_airflow_provider_transfers-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e454a1078fb66d68891d759f7c13b017f5d4eae49a281cb5785b82c6db474a06
MD5 3f796d7a467a0ee78ac77ae29cd922fd
BLAKE2b-256 dbeb571fc10d541a1f4b25528d5e3ce39419efbf4828c839bdfcb606c6411d7d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page