AccelData Torch Airflow SDK.

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

TORCH-AIRFLOW-SDK

Torch airflow sdk provides support for observability of airflow dags in torch catalog. With the use of torch airflow sdk, user can e2e observability on airflow dag run in torch UI. Every dag is associated with pipeline in torch.
Make sure while configuring airflow, 3 environmental needs to be set up in airflow env. docker container.

TORCH_CATALOG_URL - URL of the torch catalog
TORCH_ACCESS_KEY - API access key generated from torch UI
TORCH_SECRET_KEY - API secret key generated from torch UI

First of all, install below mentioned 2 pypi package to expose ETL in torch.

pip install torch-sdk

Create Pipeline

First of all, to create a pipeline in torch, the user needs to create a pipeline using torch-sdk. To know more about pipeline, check torch-sdk documentation that contains detailed information about torch pipeline usage.

from torch_sdk.models.pipeline import CreatePipeline, PipelineMetadata
from torch_sdk.torch_client import TorchClient

torchClient = TorchClient(url="https://torch.acceldata.local:5443",
                       access_key="OY2VVIN2N6LJ", secret_key="da6bDBimQfXSMsyyhlPVJJfk7Zc2gs")

pipeline = CreatePipeline(
    uid='customer.orders.monthly.agg',
    name='Customer Orders Monthly aggregate',
    description='Pipeline to Aggregate the customer orders over 1 year',
    meta=PipelineMetadata(
        owner='vaishvik', team='torch', codeLocation='...'),
    context={
        'associated_tables': 'pipeline.customer, pipeline.orders, pipeline.customer_orders, pipeline.customer_orders_monthly_agg'}
)

pipeline_res = torchClient.create_pipeline(pipeline=pipeline)

Create DAG

This changed in version 0.0.30
In airflow DAG code, import torch dag instead of airflow dag. All the parameters will be the same as standard apache airflow dag. But there will be 2 additional parameters override_success_callback, override_failure_callback. override_success_callback can be set to True if we do not want the pipeline run to be ended at the end of the successful run of the DAG. Similarly, override_failure_callback can be set to True if we do not want the pipeline run to be ended at the end of the unsuccessful run of the DAG. These can be useful if few steps of the pipeline are being executed outside of Airflow DAG.

from torch_airflow_sdk.dag import DAG
dag = DAG(
   dag_id='pipeline_demo_final',
   schedule_interval='@daily',
   default_args=default_args,
   start_date=datetime(2020, 2, 2),
   catchup=False,
   on_failure_callback= failure_callback,
   on_success_callback= success_callback,
   override_success_callback=False,
   override_failure_callback=False,
)

Create Job and Span using decorator

This was added in version 0.0.30
To create a job and span in the pipeline, the user needs to decorate the python function with a job_span decorator as shown in the below example. Pass some required parameters (job uid, metadata object and input-output asset list) inside the decorator. Make sure, inside a Dataset the type of the object which will have source (data source name inside torch catalog) and asset_uid (asset path from its root) as parameters. span_uid and xcom_to_event_mapper_ids are optional parameters used to create span. If span_uid is not passed a span corresponding to the job will not be created.

from torch_airflow_sdk.decorators.job_span import job_span
from torch_sdk.models.job import JobMetadata, Dataset
@job_span(job_uid='monthly.order.aggregate.job',
   inputs=[Dataset('POSTGRES_LOCAL_DS', 'pipeline.pipeline.customer_orders')],
   outputs=[Dataset('POSTGRES_LOCAL_DS', 'pipeline.pipeline.customer_orders_monthly_agg')],
   metadata=JobMetadata(name = 'Vaishvik_brahmbhatt', team = 'backend', code_location ='https://github.com/acme/reporting/report.scala'),
   span_uid='customer.orders.datagen.span',
   xcom_to_event_mapper_ids = ['run_id', 'event_id']
   )
def monthly_order_aggregate(**context):
    pass

Create Job using decorator

This changed in version 0.0.30
To create a job in the pipeline, the user needs to decorate the python function with a job decorator as shown in the below example. Pass some required parameters (job uid, metadata object and input-output asset list) inside the decorator. Make sure, inside a Dataset the type of the object which will have source (data source name inside torch catalog) and asset_uid (asset path from its root) as parameters.

from torch_airflow_sdk.decorators.job import job
from torch_sdk.models.job import JobMetadata, Dataset
@job(job_uid='monthly.order.aggregate.job',
   inputs=[Dataset('POSTGRES_LOCAL_DS', 'pipeline.pipeline.customer_orders')],
   outputs=[Dataset('POSTGRES_LOCAL_DS', 'pipeline.pipeline.customer_orders_monthly_agg')],
   metadata=JobMetadata(name = 'Vaishvik_brahmbhatt', team = 'backend', code_location ='https://github.com/acme/reporting/report.scala')
   )
def monthly_order_aggregate(**context):
    pass

Create Span Using Decorator

This changed in version 0.0.30
To create a span for a python function, the user can decorate a python function with a span decorator that contains span uid as parameters. To decorate function with span make sure, it has **context parameter inside the function argument. That gives access to the context of the task. Using the context, various span events can be sent inside the function. To get the parent span context, use the key name span_context_parent in xcom pull of the task instance. It’s value will be span context instance which can be used to create child spans and send custom events (As shown in below example.)

from torch_airflow_sdk.decorators.span import span
from torch_sdk.events.generic_event import GenericEvent
@span(span_uid='customer.orders.datagen.span',
      associated_job_uids = ['monthly.order.aggregate.transfer'],  xcom_to_event_mapper_ids = ['run_id', 'event_id'] )
def data_gen(**context):
   datagen_span_context = context['span_context_parent']
   customer_datagen_span = datagen_span_context.create_child_span(
       uid="customer.data.gen", 
      context_data= {'client_time': str(datetime.now()) }
   )
   customer_datagen_span.send_event(
      GenericEvent(
         context_data={
            'client_time': str(datetime.now()), 
            'row_count': len(rows)
         }, 
         event_uid="order.customer.join.result"
      )
   )
   customer_datagen_span.end(
       context_data={'client_time': str(datetime.now()), 'customers_count': len(customer_ids) }
   )

Custom Operators

Torch airflow sdk contains 4 custom operators.

TorchInitializer Operator :

This changed in version 0.0.30
The user needs to add a task with a given operator at the root of your dag. This operator will create a new pipeline. Additionally, this will create new pipeline run and root span for thar run for each dag run of the airflow dag. Those will not be created if create_pipeline is set to False. This can be useful if pipeline/pipeline run has been created outside of Airflow DAG. We can pass root span name for the pipeline using span_name parameter.

from torch_airflow_sdk.operators.torch_initialiser_operator import TorchInitializer

torch_initializer_task = TorchInitializer(
   task_id='torch_pipeline_initializer',
   pipeline_uid='customer.orders.monthly.agg.demo',
   pipeline_name='CUSTOMERS ORDERS MOTHLY AGG',
   create_pipeline=True,
   span_name='customer.orders.monthly.agg.demo.span',
   dag=dag
)

SpanOperator Operator :

This changed in version 0.0.30
SpanOperator Operator will execute any std operator being passed as operator parameter and send span start and end event it. Just wrap the std operator with a span operator. Make sure that the wrapped operator is not added in the DAG. If the operator is wrapped with a span operator, the span operator will take care of that operator task inside its execution. It will have some required parameters ( span_uid : uid of the span, operator : standard operator task that needs to be wrapped with span). Other parameters will be the same as the airflow standard base operator.

WARNING: Do not specify the `dag` parameter in std airflow operator being passed as an argument to SpanOperator as the execution of operator task is taken care of by SpanOperator.

from torch_airflow_sdk.operators.span_operator import SpanOperator

get_order_agg_for_q4 = PostgresOperator(
   task_id="get_monthly_order_aggregate_last_quarter",
   postgres_conn_id='example_db',
   sql="select * from information_schema.attributess",
)

get_order_agg_for_q4 = SpanOperator(
   task_id="get_monthly_order_aggregate_last_quarter",
   span_uid='monthly.order.agg.q4.span',
   operator=get_order_agg_for_q4,
   associated_job_uids = ['monthly.order.aggregate.transfer'],  
   xcom_to_event_mapper_ids = ['run_id', 'event_id'] ,
   dag=dag
)

This changed in version 0.0.31

SyncOperator Operator :

SyncOperator is used to execute a policy by passing rule_type and rule_id. It will return only after the execution ends. Allowed values for rule_type can be used from torch_sdk.constants

from torch_airflow_sdk.operators.sync_operator import SyncOperator
import torch_sdk.constants as const

syncoperator_task = SyncOperator(
    task_id='torch_pipeline_syncop_test',
    rule_type=const.DATA_QUALITY,
    rule_id=46,
    dag=dag
)

In case you need to query the status in another task you need to pull the execution id from xcom by passing the rule name in the {rule_type}_{rule_id}_execution_id. In this example the rule type is const.DATA_QUALITY and rule_id 46 After getting the execution_id you need to call get_rule_result on torch_client by passing rule_type, execution_id.

from torch_sdk.torch_client import TorchClient
from torch_airflow_sdk.initialiser import torch_credentials
import torch_sdk.constants as const

def syncoperator_result(**context):
    xcom_key = f'{const.DATA_QUALITY}_46_execution_id'
    task_instance = context['ti']
    execution_id = task_instance.xcom_pull(key=xcom_key)
    torch_client = TorchClient(**torch_credentials)
    result = torch_client.get_rule_result(rule_type=const.DATA_QUALITY, execution_id=execution_id)

This changed in version 0.0.31

AsyncOperator Operator :

AyncOperator is used to execute a policy by passing rule_type and rule_id. It will return immediately after starting the rule execution. Allowed values for rule_type can be used from torch_sdk.constants

from torch_airflow_sdk.operators.async_operator import AsyncOperator
import torch_sdk.constants as const

syncoperator_task = AsyncOperator(
    task_id='torch_pipeline_asyncoperatordemo',
    rule_type=const.DATA_QUALITY,
    rule_id=46,
    dag=dag
)

After getting the execution_id you need to call get_rule_result on torch_client by passing rule_type, execution_id

from torch_sdk.torch_client import TorchClient
from torch_airflow_sdk.initialiser import torch_credentials
import torch_sdk.constants as const
def asyncoperator_result(**context):
    xcom_key = f'{const.DATA_QUALITY}_46_execution_id'
    task_instance = context['ti']
    # get the rule_name and execution id - then pull them in xcom
    execution_id = task_instance.xcom_pull(key=xcom_key)

    torch_client = TorchClient(**torch_credentials)
    result = torch_client.get_rule_result(rule_type=const.DATA_QUALITY, execution_id=execution_id)

If you want to get the current status call get_rule_status function

from torch_sdk.torch_client import TorchClient
from torch_airflow_sdk.initialiser import torch_credentials
import torch_sdk.constants as const
xcom_key = f'{const.DATA_QUALITY}_46_execution_id'
task_instance = context['ti']
# get the rule_name and execution id - then pull them in xcom
execution_id = task_instance.xcom_pull(key=xcom_key)
torch_client = TorchClient(**torch_credentials)
result = torch_client.get_rule_status(rule_type=const.DATA_QUALITY, execution_id=execution_id)

Version Log

0.1.0 (17/06/2021)

Torch airflow sdk - Wrapper on apache airflow
Torch airflow sdk provides support for observability of airflow dags in torch catalog. With the use of torch airflow sdk, user can e2e observability on airflow dag run in torch UI.
Support for airflow 1.0 and 2.0.

0.1.3 (2/02/2022)

Added support for pipeline creation by passing uid to the initializer operator
Job and span decorator's additional parameters
Minor bug fixes

Project details

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.1.3 yanked

Feb 2, 2022

Reason this release was yanked:

Currently changes are being done in alpha builds.

0.1.2 yanked

Oct 16, 2021

Reason this release was yanked:

Currently changes are being done in alpha builds.

0.1.1 yanked

Jul 8, 2021

Reason this release was yanked:

Currently changes are being done in alpha builds.

0.1.0 yanked

Jun 17, 2021

Reason this release was yanked:

Currently changes are being done in alpha builds.

0.0.41

Sep 28, 2022

0.0.40

Aug 29, 2022

0.0.39

Aug 25, 2022

0.0.38 yanked

Aug 25, 2022

0.0.37

Jul 28, 2022

0.0.36

Jul 27, 2022

0.0.35

Jul 22, 2022

0.0.34

Jul 21, 2022

0.0.33

Jul 19, 2022

0.0.32

Jul 19, 2022

This version

0.0.31

Jul 6, 2022

0.0.30

Jul 5, 2022

0.0.29

Jun 2, 2022

0.0.28

Jun 2, 2022

0.0.27

Jun 2, 2022

0.0.26

Jun 2, 2022

0.0.25

May 31, 2022

0.0.24

Apr 14, 2022

0.0.23

Apr 14, 2022

0.0.22

Feb 2, 2022

0.0.21

Feb 2, 2022

0.0.20

Oct 19, 2021

0.0.19

Oct 19, 2021

0.0.18

Oct 19, 2021

0.0.17

Aug 18, 2021

0.0.16

Aug 18, 2021

0.0.15

Aug 18, 2021

0.0.14

Jun 17, 2021

0.0.13

Jun 3, 2021

0.0.12

Jun 3, 2021

0.0.11

Jun 2, 2021

0.0.10

Jun 2, 2021

0.0.9

Jun 1, 2021

0.0.8

Jun 1, 2021

0.0.7

Jun 1, 2021

0.0.6

Jun 1, 2021

0.0.5

May 27, 2021

0.0.4

May 21, 2021

0.0.3

May 20, 2021

0.0.2

May 20, 2021

0.0.1

May 11, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torch_airflow_sdk-0.0.31.tar.gz (10.4 MB view hashes)

Uploaded Jul 6, 2022 Source

Hashes for torch_airflow_sdk-0.0.31.tar.gz

Hashes for torch_airflow_sdk-0.0.31.tar.gz
Algorithm	Hash digest
SHA256	`6b7d7cebb6b664ee6c3835c7665018c771fc28006b716d68dad8658cff16be13`
MD5	`d117400b500061576ab46fedfc714fa4`
BLAKE2b-256	`d3134e3a71f68ac0f662ba2540b27e5e9b3ab152400bd3c72104accab14dc7cf`