Skip to main content

A CLI to work with DataHub metadata

Project description

DataHub Metadata Ingestion

Python version 3.6+

This module hosts an extensible Python-based metadata ingestion system for DataHub. This supports sending data to DataHub using Kafka or through the REST API. It can be used through our CLI tool, with an orchestrator like Airflow, or as a library.

Getting Started

Prerequisites

Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. If you are trying this out locally, the easiest way to do that is through quickstart Docker images.

Install from PyPI

The folks over at Acryl maintain a PyPI package for DataHub metadata ingestion.

# Requires Python 3.6+
python3 -m pip install --upgrade pip==20.2.4 wheel setuptools
python3 -m pip uninstall datahub acryl-datahub || true  # sanity check - ok if it fails
python3 -m pip install acryl-datahub
datahub version
# If you see "command not found", try running this instead: python3 -m datahub version

If you run into an error, try checking the common setup issues.

Installing Plugins

We use a plugin architecture so that you can install only the dependencies you actually need.

Plugin Name Install Command Provides
file included by default File source and sink
console included by default Console sink
athena pip install 'acryl-datahub[athena]' AWS Athena source
bigquery pip install 'acryl-datahub[bigquery]' BigQuery source
glue pip install 'acryl-datahub[glue]' AWS Glue source
hive pip install 'acryl-datahub[hive]' Hive source
mssql pip install 'acryl-datahub[mssql]' SQL Server source
mysql pip install 'acryl-datahub[mysql]' MySQL source
postgres pip install 'acryl-datahub[postgres]' Postgres source
oracle pip install 'acryl-datahub[oracle]' Oracle source
snowflake pip install 'acryl-datahub[snowflake]' Snowflake source
mongodb pip install 'acryl-datahub[mongodb]' MongoDB source
ldap pip install 'acryl-datahub[ldap]' (extra requirements) LDAP source
kakfa pip install 'acryl-datahub[kafka]' Kafka source
druid pip install 'acryl-datahub[druid]' Druid Source
dbt no additional dependencies DBT source
datahub-rest pip install 'acryl-datahub[datahub-rest]' DataHub sink over REST API
datahub-kafka pip install 'acryl-datahub[datahub-kafka]' DataHub sink over Kafka

These plugins can be mixed and matched as desired. For example:

pip install 'acryl-datahub[bigquery,datahub-rest]'

You can check the active plugins:

datahub check plugins

Basic Usage

pip install 'acryl-datahub[datahub-rest]'  # install the required plugin
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml

Install using Docker

Docker Hub datahub-ingestion docker

If you don't want to install locally, you can alternatively run metadata ingestion within a Docker container. We have prebuilt images available on Docker hub. All plugins will be installed and enabled automatically.

Limitation: the datahub_docker.sh convenience script assumes that the recipe and any input/output files are accessible in the current working directory or its subdirectories. Files outside the current working directory will not be found, and you'll need to invoke the Docker image directly.

./scripts/datahub_docker.sh ingest -c ./examples/recipes/example_to_datahub_rest.yml

Install from source

If you'd like to install from source, see the developer guide.

Usage within Airflow

We have also included a couple sample DAGs that can be used with Airflow.

  • generic_recipe_sample_dag.py - a simple Airflow DAG that picks up a DataHub ingestion recipe configuration and runs it.
  • mysql_sample_dag.py - an Airflow DAG that runs a MySQL metadata ingestion pipeline using an inlined configuration.

Recipes

A recipe is a configuration file that tells our ingestion scripts where to pull data from (source) and where to put it (sink). Here's a simple example that pulls metadata from MSSQL and puts it into datahub.

# A sample recipe that pulls metadata from MSSQL and puts it into DataHub
# using the Rest API.
source:
  type: mssql
  config:
    username: sa
    password: ${MSSQL_PASSWORD}
    database: DemoData

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

We automatically expand environment variables in the config, similar to variable substitution in GNU bash or in docker-compose files. For details, see https://docs.docker.com/compose/compose-file/compose-file-v2/#variable-substitution.

Running a recipe is quite easy.

datahub ingest -c ./examples/recipes/mssql_to_datahub.yml

A number of recipes are included in the examples/recipes directory.

Sources

Kafka Metadata kafka

Extracts:

  • List of topics - from the Kafka broker
  • Schemas associated with each topic - from the schema registry
source:
  type: "kafka"
  config:
    connection:
      bootstrap: "broker:9092"
      schema_registry_url: http://localhost:8081
      consumer_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/index.html#deserializingconsumer

MySQL Metadata mysql

Extracts:

  • List of databases and tables
  • Column types and schema associated with each table
source:
  type: mysql
  config:
    username: root
    password: example
    database: dbname
    host_port: localhost:3306
    table_pattern:
      deny:
        # Note that the deny patterns take precedence over the allow patterns.
        - "performance_schema"
      allow:
        - "schema1.table2"
      # Although the 'table_pattern' enables you to skip everything from certain schemas,
      # having another option to allow/deny on schema level is an optimization for the case when there is a large number
      # of schemas that one wants to skip and you want to avoid the time to needlessly fetch those tables only to filter
      # them out afterwards via the table_pattern.
    schema_pattern:
      deny:
        - "garbage_schema"
      allow:
        - "schema1"

Microsoft SQL Server Metadata mssql

Extracts:

  • List of databases, schema, and tables
  • Column types associated with each table
source:
  type: mssql
  config:
    username: user
    password: pass
    host_port: localhost:1433
    database: DemoDatabase
    table_pattern:
      deny:
        - "^.*\\.sys_.*" # deny all tables that start with sys_
      allow:
        - "schema1.table1"
        - "schema1.table2"
    options:
      # Any options specified here will be passed to SQLAlchemy's create_engine as kwargs.
      # See https://docs.sqlalchemy.org/en/14/core/engines.html for details.
      charset: "utf8"

Hive hive

Extracts:

  • List of databases, schema, and tables
  • Column types associated with each table
source:
  type: hive
  config:
    username: user
    password: pass
    host_port: localhost:10000
    database: DemoDatabase
    # table_pattern/schema_pattern is same as above
    # options is same as above

PostgreSQL postgres

Extracts:

  • List of databases, schema, and tables
  • Column types associated with each table
  • Also supports PostGIS extensions
source:
  type: postgres
  config:
    username: user
    password: pass
    host_port: localhost:5432
    database: DemoDatabase
    # table_pattern/schema_pattern is same as above
    # options is same as above

Snowflake snowflake

Extracts:

  • List of databases, schema, and tables
  • Column types associated with each table
source:
  type: snowflake
  config:
    username: user
    password: pass
    host_port: account_name
    # table_pattern/schema_pattern is same as above
    # options is same as above

Oracle oracle

Extracts:

  • List of databases, schema, and tables
  • Column types associated with each table
source:
  type: oracle
  config:
    # For more details on authentication, see the documentation:
    # https://docs.sqlalchemy.org/en/14/dialects/oracle.html#dialect-oracle-cx_oracle-connect and
    # https://cx-oracle.readthedocs.io/en/latest/user_guide/connection_handling.html#connection-strings.
    username: user
    password: pass
    host_port: localhost:5432
    database: dbname
    # table_pattern/schema_pattern is same as above
    # options is same as above

Google BigQuery bigquery

Extracts:

  • List of databases, schema, and tables
  • Column types associated with each table
source:
  type: bigquery
  config:
    project_id: project # optional - can autodetect from environment
    dataset: dataset_name
    options: # options is same as above
      # See https://github.com/mxmzdlv/pybigquery#authentication for details.
      credentials_path: "/path/to/keyfile.json" # optional
    # table_pattern/schema_pattern is same as above

AWS Athena athena

Extracts:

  • List of databases and tables
  • Column types associated with each table
source:
  type: athena
  config:
    username: aws_access_key_id # Optional. If not specified, credentials are picked up according to boto3 rules.
    # See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
    password: aws_secret_access_key # Optional.
    database: database # Optional, defaults to "default"
    aws_region: aws_region_name # i.e. "eu-west-1"
    s3_staging_dir: s3_location # "s3://<bucket-name>/prefix/"
    # The s3_staging_dir parameter is needed because Athena always writes query results to S3.
    # See https://docs.aws.amazon.com/athena/latest/ug/querying.html
    # However, the athena driver will transparently fetch these results as you would expect from any other sql client.
    work_group: athena_workgroup # "primary"
    # table_pattern/schema_pattern is same as above

AWS Glue glue

Extracts:

  • List of tables
  • Column types associated with each table
  • Table metadata, such as owner, description and parameters
source:
  type: glue
  config:
    aws_region: aws_region_name # i.e. "eu-west-1"
    env: environment used for the DatasetSnapshot URN, one of "DEV", "EI", "PROD" or "CORP". # Optional, defaults to "PROD".
    database_pattern: # Optional, to filter databases scanned, same as schema_pattern above.
    table_pattern: # Optional, to filter tables scanned, same as table_pattern above.
    aws_access_key_id # Optional. If not specified, credentials are picked up according to boto3 rules.
    # See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
    aws_secret_access_key # Optional.
    aws_session_token # Optional.

Druid druid

Extracts:

  • List of databases, schema, and tables
  • Column types associated with each table

Note It is important to define a explicitly define deny schema pattern for internal druid databases (lookup & sys) if adding a schema pattern otherwise the crawler may crash before processing relevant databases. This deny pattern is defined by default but is overriden by user-submitted configurations

source:
  type: druid
  config:
    # Point to broker address
    host_port: localhost:8082
    schema_pattern:
      deny:
        - "^(lookup|sys).*"
    # options is same as above

MongoDB mongodb

Extracts:

  • List of databases
  • List of collections in each database
source:
  type: "mongodb"
  config:
    # For advanced configurations, see the MongoDB docs.
    # https://pymongo.readthedocs.io/en/stable/examples/authentication.html
    connect_uri: "mongodb://localhost"
    username: admin
    password: password
    authMechanism: "DEFAULT"
    options: {}
    database_pattern: {}
    collection_pattern: {}
    # database_pattern/collection_pattern are similar to schema_pattern/table_pattern from above

LDAP ldap

Extracts:

  • List of people
  • Names, emails, titles, and manager information for each person
source:
  type: "ldap"
  config:
    ldap_server: ldap://localhost
    ldap_user: "cn=admin,dc=example,dc=org"
    ldap_password: "admin"
    base_dn: "dc=example,dc=org"
    filter: "(objectClass=*)" # optional field

File file

Pulls metadata from a previously generated file. Note that the file sink can produce such files, and a number of samples are included in the examples/mce_files directory.

source:
  type: file
  filename: ./path/to/mce/file.json

DBT dbt

Pull metadata from DBT output files:

  • dbt manifest file
    • This file contains model, source and lineage data.
  • dbt catalog file
    • This file contains schema data.
    • DBT does not record schema data for Ephemeral models, as such datahub will show Ephemeral models in the lineage, however there will be no associated schema for Ephemeral models
source:
  type: "dbt"
  config:
    manifest_path: "./path/dbt/manifest_file.json"
    catalog_path: "./path/dbt/catalog_file.json"

Sinks

DataHub Rest datahub-rest

Pushes metadata to DataHub using the GMA rest API. The advantage of the rest-based interface is that any errors can immediately be reported.

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

DataHub Kafka datahub-kafka

Pushes metadata to DataHub by publishing messages to Kafka. The advantage of the Kafka-based interface is that it's asynchronous and can handle higher throughput. This requires the Datahub mce-consumer container to be running.

sink:
  type: "datahub-kafka"
  config:
    connection:
      bootstrap: "localhost:9092"
      producer_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/index.html#serializingproducer

Console console

Simply prints each metadata event to stdout. Useful for experimentation and debugging purposes.

sink:
  type: "console"

File file

Outputs metadata to a file. This can be used to decouple metadata sourcing from the process of pushing it into DataHub, and is particularly useful for debugging purposes. Note that the file source can read files generated by this sink.

sink:
  type: file
  filename: ./path/to/mce/file.json

Using as a library

In some cases, you might want to construct the MetadataChangeEvents yourself but still use this framework to emit that metadata to DataHub. In this case, take a look at the emitter interfaces, which can easily be imported and called from your own code.

For a basic usage example, see the lineage_emitter.py example.

Developing

See the developing guide.

Project details


Release history Release notifications | RSS feed

This version

0.1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acryl-datahub-0.1.0.tar.gz (65.5 kB view hashes)

Uploaded Source

Built Distribution

acryl_datahub-0.1.0-py3-none-any.whl (103.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page