Skip to main content

Versatile Data Kit SDK plugin provides support for Impala database.

Project description

This plugin allows vdk-core to interface with and execute queries against an Impala database.

monthly download count for vdk-impala

Features

  • It provides a powerful recovery mechanism handling a lot of challenges - like eventual consistency issues in Impala and more. In one production deployment of VDK it was able to improve SLA of Impala from 95% (queries directly to Impala) to 99% (queries using VDK to Impala).
  • It automatically classifies error based on who is best responsible to handle them - user (job owner) vs platform (infra owner).
  • It provides a default implementation of Kimball templates for generating SCD1, SCD2 dimension tables, Periodic Snapshot Fact Table
  • It can collect lineage data, assuming a lineage logger has been provided through the vdk-core configuration (see vdk config-help for more info)

Usage

Run

pip install vdk-impala

After this, data jobs will have access to a Impala database connection, managed by Versatile Data Kit SDK.

If it is the only database plugin installed , vdk would automatically use it. Otherwise, users need to set VDK_DB_DEFAULT_TYPE=IMPALA as an environment variable or set 'db_default_type' option in the data job config file (config.ini).

For example

    def run(job_input: IJobInput):
        job_input.execute_query("select 'Hi Impala!'")

Lineage

The package can gather lineage data for all successful Impala SQL queries that have actually read or written data. Other plugins can read and optionally send the lineage data to separate system. They need to provide ILineageLogger implementation and hook this way:

    @hookimpl
    def vdk_initialize(context: CoreContext) -> None:
        context.state.set(StoreKey[ILineageLogger]("impala-lineage-logger"), MyLogger())

Lineage is calculated based on the executed query profile. It is retrieved via the cursor by executing additional RPC request against the same Impala node that has coordinated the query right after the original query has successfully finished. See https://impala.apache.org/docs/build/html/topics/impala_logging.html for more information how profiles are stored and here https://impala.apache.org/docs/build/impala-3.1.pdf for more information about the profiles themselves.

If enabled, query plan is retrieved for every successfully executed query against Impala excluding keepalive queries like "Select 1".

Database Loading Templates

Kimbal dimensional modeling templates

See the following tutorial for more details. It is based on Trino but the process is equivalent for Impala (only the database configuration requires change).

Data Quality Checks

Most of the processing templates support quality checks that are used for preventing bad data going into production tables. The checks represent a callback function that is passed as an optional parameter to the job_input.execute_template() method.

Example:

    def run(job_input: IJobInput) -> None:

        def check(tmp_table_name):
            result = #Implement your data quality check logic and return True/False
            return result

        job_input.execute_template(
            template_name="load/dimension/scd1",
            "source_schema": test_schema,
            "source_view": source_view,
            "target_schema": test_schema,
            "target_table": target_table,
            "check": check,
            "staging_schema": staging_schema, #If not provided the checks would be performed in the target_schema
        )

Configuration

Run vdk config-help - search for those prefixed with "IMPALA_" to see what configuration options are available.

Disclaimer

This plugin is tested against a specific impala version. The version comes from the docker-compose.yaml container's impala version. For more information on the imapala version tested against please google the docker image.

Testing

Testing this plugin locally requires installing the dependencies listed in vdk-plugins/vdk-impala/requirements.txt

Run

pip install -r requirements.txt

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vdk_impala-0.4.1431637373.tar.gz (31.0 kB view details)

Uploaded Source

File details

Details for the file vdk_impala-0.4.1431637373.tar.gz.

File metadata

  • Download URL: vdk_impala-0.4.1431637373.tar.gz
  • Upload date:
  • Size: 31.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for vdk_impala-0.4.1431637373.tar.gz
Algorithm Hash digest
SHA256 593fd03755a1c988cdd64430505ab0ca69680a709a5a93b5fc7ec9aaa5a14caa
MD5 b9e88f0bcd881498cf2813e12de499bc
BLAKE2b-256 666a9486c6f182e6b349d84781bd6bbc1fe7105e203c57eb2e7b7edf627a19db

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page