Skip to main content

FastETL custom package Apache Airflow provider.

Project description

FastETL's logo. It's a Swiss army knife with some open tools

FastETL framework, modern, versatile, does almost everything.

Este texto também está disponível em português: 🇧🇷LEIAME.md.


CI Tests

FastETL is a plugins package for Airflow for building data pipelines for a number of common scenarios.

Main features:

  • Full or incremental replication of tables in SQL Server, Postgres and MySQL databases
  • Load data from GSheets and from spreadsheets on Samba/Windows networks
  • Extracting CSV from SQL
  • Clean data using custom data patching tasks (e.g. for messy geographical coordinates, mapping canonical values for columns, etc.)
  • Using a Open Street Routing Machine service to calculate route distances
  • Using CKAN or dados.gov.br's API to update dataset metadata
  • Using Frictionless Tabular Data Packages to write OpenDocument Text format data dictionaries

This framework is maintained by a network of developers from many teams at the Ministry of Management and Innovation in Public Services and is the cumulative result of using Apache Airflow, a free and open source tool, starting in 2019.

For government: FastETL is widely used for replication of data queried via Quartzo (DaaS) from Serpro.

Installation in Airflow

FastETL implements the standards for Airflow plugins. To install it, simply add the apache-airflow-providers-fastetl package to your Python dependencies in your Airflow environment.

Or install it with

pip install apache-airflow-providers-fastetl

To see an example of an Apache Airflow container that uses FastETL, check out the airflow2-docker repository.

To ensure appropriate results, please make sure to install the msodbcsql17 and unixodbc-dev libraries on your Apache Airflow workers.

Tests

The test suite uses Docker containers to simulate a complete use environment, including Airflow and the databases. For that reason, to execute the tests, you first need to install Docker and docker-compose.

For instructions on how to do this, see the official Docker documentation.

To build the containers:

make setup

To run the tests, use:

make setup && make tests

To shutdown the environment, use:

make down

Usage examples

The main FastETL feature is the DbToDbOperator operator. It copies data between postgres and mssql databases. MySQL is also supported as a source.

Here goes an example:

from datetime import datetime
from airflow import DAG
from fastetl.operators.db_to_db_operator import DbToDbOperator

default_args = {
    "start_date": datetime(2023, 4, 1),
}

dag = DAG(
    "copy_db_to_db_example",
    default_args=default_args,
    schedule_interval=None,
)


t0 = DbToDbOperator(
    task_id="copy_data",
    source={
        "conn_id": airflow_source_conn_id,
        "schema": source_schema,
        "table": table_name,
    },
    destination={
        "conn_id": airflow_dest_conn_id,
        "schema": dest_schema,
        "table": table_name,
    },
    destination_truncate=True,
    copy_table_comments=True,
    chunksize=10000,
    dag=dag,
)

More detail about the parameters and the workings of DbToDbOperator can bee seen on the following files:

How to contribute

To be written on the CONTRIBUTING.md document (issue #4).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apache_airflow_providers_fastetl-0.2.14.tar.gz (80.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file apache_airflow_providers_fastetl-0.2.14.tar.gz.

File metadata

File hashes

Hashes for apache_airflow_providers_fastetl-0.2.14.tar.gz
Algorithm Hash digest
SHA256 bef8b946c057024d45708016d9733d43e96bb6bee8d06667d3b6c81b46d7b99c
MD5 c3779a0ec4ffc28f492aace9fe8a412b
BLAKE2b-256 b19e97de5e7e6ced8d3695fd1a2e3350530513f65d6d4d2130c290c713f37a16

See more details on using hashes here.

Provenance

The following attestation bundles were made for apache_airflow_providers_fastetl-0.2.14.tar.gz:

Publisher: build-and-publish.yml on gestaogovbr/FastETL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file apache_airflow_providers_fastetl-0.2.14-py3-none-any.whl.

File metadata

File hashes

Hashes for apache_airflow_providers_fastetl-0.2.14-py3-none-any.whl
Algorithm Hash digest
SHA256 85f817282aeced14f4b264919d57df902677bb79f8f1cc322c97121b2ab655cf
MD5 cef77672586568d3e7b305cab0f15edf
BLAKE2b-256 0925ae271fab1af67fe21ed0d1c6cace6b73c09e1bbca7f66373c0fa8a34d56e

See more details on using hashes here.

Provenance

The following attestation bundles were made for apache_airflow_providers_fastetl-0.2.14-py3-none-any.whl:

Publisher: build-and-publish.yml on gestaogovbr/FastETL

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page