Skip to main content

An action framework to work with DataHub real time changes.

Project description

⚡ DataHub Actions Framework

Welcome to DataHub Actions! The Actions framework makes responding to realtime changes in your Metadata Graph easy, enabling you to seamlessly integrate DataHub into a broader events-based architecture.

For a detailed introduction, check out the original announcement of the DataHub Actions Framework at the DataHub April 2022 Town Hall. For a more in-depth look at use cases and concepts, check out DataHub Actions Concepts.

Quickstart

To get started right away, check out the DataHub Actions Quickstart Guide.

Prerequisites

The DataHub Actions CLI commands are an extension of the base datahub CLI commands. We recommend first installing the datahub CLI:

python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
datahub --version

Note that the Actions Framework requires a version of acryl-datahub >= v0.8.34

Installation

Next, simply install the acryl-datahub-actions package from PyPi:

python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub-actions
datahub actions version

Configuring an Action

Actions are configured using a YAML file, much in the same way DataHub ingestion sources are. An action configuration file consists of the following

  1. Action Pipeline Name (Should be unique and static)
  2. Source Configurations
  3. Transform + Filter Configurations
  4. Action Configuration
  5. Pipeline Options (Optional)
  6. DataHub API configs (Optional - required for select actions)

With each component being independently pluggable and configurable.

# 1. Required: Action Pipeline Name
name: <action-pipeline-name>

# 2. Required: Event Source - Where to source event from.
source:
  type: <source-type>
  config:
    # Event Source specific configs (map)

# 3a. Optional: Filter to run on events (map)
filter: 
  event_type: <filtered-event-type>
  event:
    # Filter event fields by exact-match
    <filtered-event-fields>

# 3b. Optional: Custom Transformers to run on events (array)
transform:
  - type: <transformer-type>
    config: 
      # Transformer-specific configs (map)

# 4. Required: Action - What action to take on events. 
action:
  type: <action-type>
  config:
    # Action-specific configs (map)

# 5. Optional: Additional pipeline options (error handling, etc)
options: 
  retry_count: 0 # The number of times to retry an Action with the same event. (If an exception is thrown). 0 by default. 
  failure_mode: "CONTINUE" # What to do when an event fails to be processed. Either 'CONTINUE' to make progress or 'THROW' to stop the pipeline. Either way, the failed event will be logged to a failed_events.log file. 
  failed_events_dir: "/tmp/datahub/actions"  # The directory in which to write a failed_events.log file that tracks events which fail to be processed. Defaults to "/tmp/logs/datahub/actions". 

# 6. Optional: DataHub API configuration
datahub:
  server: "http://localhost:8080" # Location of DataHub API
  # token: <your-access-token> # Required if Metadata Service Auth enabled

Example: Hello World

An simple configuration file for a "Hello World" action, which simply prints all events it receives, is

# 1. Action Pipeline Name
name: "hello_world"
# 2. Event Source: Where to source event from.
source:
  type: "kafka"
  config:
    connection:
      bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092}
      schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081}
# 3. Action: What action to take on events. 
action:
  type: "hello_world"

We can modify this configuration further to filter for specific events, by adding a "filter" block.

# 1. Action Pipeline Name
name: "hello_world"

# 2. Event Source - Where to source event from.
source:
  type: "kafka"
  config:
    connection:
      bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092}
      schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081}

# 3. Filter - Filter events that reach the Action
filter:
  event_type: "EntityChangeEvent_v1"
  event:
    category: "TAG"
    operation: "ADD"
    modifier: "urn:li:tag:pii"

# 4. Action - What action to take on events. 
action:
  type: "hello_world"

Running an Action

To run a new Action, just use the actions CLI command

datahub actions -c <config.yml>

Once the Action is running, you will see

Action Pipeline with name '<action-pipeline-name>' is now running.

Running multiple Actions

You can run multiple actions pipeline within the same command. Simply provide multiple config files by restating the "-c" command line argument.

For example,

datahub actions -c <config-1.yaml> -c <config-2.yaml>

Running in debug mode

Simply append the --debug flag to the CLI to run your action in debug mode.

datahub actions -c <config.yaml> --debug

Stopping an Action

Just issue a Control-C as usual. You should see the Actions Pipeline shut down gracefully, with a small summary of processing results.

Actions Pipeline with name '<action-pipeline-name' has been stopped.

Supported Events

Two event types are currently supported. Read more about them below.

Supported Event Sources

Currently, the only event source that is officially supported is kafka, which polls for events via a Kafka Consumer.

Supported Actions

By default, DataHub supports a set of standard actions plugins. These can be found inside the folder src/datahub-actions/plugins.

Some pre-included Actions include

Development

Build and Test

Notice that we support all actions command using a separate datahub-actions CLI entry point. Feel free to use this during development.

# Build datahub-actions module
./gradlew datahub-actions:build

# Drop into virtual env
cd datahub-actions && source venv/bin/activate 

# Start hello world action 
datahub-actions actions -c ../examples/hello_world.yaml

# Start ingestion executor action
datahub-actions actions -c ../examples/executor.yaml

# Start multiple actions 
datahub-actions actions -c ../examples/executor.yaml -c ../examples/hello_world.yaml

Developing a Transformer

To develop a new Transformer, check out the Developing a Transformer guide.

Developing an Action

To develop a new Action, check out the Developing an Action guide.

Contributing

Contributing guidelines follow those of the main DataHub project. We are accepting contributions for Actions, Transformers, and general framework improvements (tests, error handling, etc).

Resources

Check out the original announcement of the DataHub Actions Framework at the DataHub April 2022 Town Hall.

License

Apache 2.0

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acryl-datahub-actions-0.0.14.tar.gz (48.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

acryl_datahub_actions-0.0.14-py3-none-any.whl (77.7 kB view details)

Uploaded Python 3

File details

Details for the file acryl-datahub-actions-0.0.14.tar.gz.

File metadata

  • Download URL: acryl-datahub-actions-0.0.14.tar.gz
  • Upload date:
  • Size: 48.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for acryl-datahub-actions-0.0.14.tar.gz
Algorithm Hash digest
SHA256 fe8715fff3d1e38daf4f028bec399e3d946447d2872bcabed452b3be9dacd6b9
MD5 e0ddd257f2d17a9bc55c6f17d535fea4
BLAKE2b-256 3f6fb86f45427a132164766b66f691567119c99e9bb27aa0c3329c1935549a57

See more details on using hashes here.

File details

Details for the file acryl_datahub_actions-0.0.14-py3-none-any.whl.

File metadata

File hashes

Hashes for acryl_datahub_actions-0.0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 e4c64c5e00f49591d6b41b89ea57860ed996e886e8b5a4bc5a4d0902febebde7
MD5 84d4ed2da5b9caee36afd48fb297a261
BLAKE2b-256 94db730d74b30af273bc0500abb5ace4b965b7983e89104ae061ec6f514fc240

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page