This Versatile Data Kit SDK plugin is a Generative Data Pack, that expands each ingested dataset with the execution ID detected during data job run.

These details have not been verified by PyPI

Project links

Homepage

Project description

An installed Generative Data Pack plugin automatically expands the data sent for ingestion.

This GDP plugin detects the execution ID of a Data Job running, and decorates your data product with it. So that, it is now possible to correlate a data record with a particular ingestion Data Job execution ID.

Each ingested dataset gets automatically expanded with a Data Job execution ID micro-dimension. For example:

{
  "product_name": "name1",
  "product_description": "description1"
}

After installing vdk-gdp-execution-id, one additional field gets automatically appended to your payloads that are sent for ingestion:

{
  "product_name": "name1",
  "product_description": "description1",
  "gdp_execution_id": "product-ingestion-data-job-1628151700498"
}

The newly-added dimension name is configurable.

Usage

Run

pip install vdk-gdp-execution-id

Create a Data Job and add to its requirements.txt file:

# Python jobs can specify extra library dependencies in requirements.txt file.
# See https://pip.readthedocs.io/en/stable/user_guide/#requirements-files
# The file is optional and can be deleted if no extra library dependencies are necessary.
vdk-gdp-execution-id

Reconfigure the ingestion pre-processing sequence to add the plugin name. For example:

export VDK_INGEST_PAYLOAD_PREPROCESS_SEQUENCE="vdk-gdp-execution-id"
# or
export VDK_INGEST_PAYLOAD_PREPROCESS_SEQUENCE="[...,]vdk-gdp-execution-id"

Note: The recommendation is to add this plugin last (at end-of-sequence), due prior plugins may add new data records. For more info on configurations, see projects/vdk-core/src/vdk/internal/core/config.py.

Example ingestion Data Job 10_python_step.py:

def run(job_input: IJobInput):
    # object
    job_input.send_object_for_ingestion(
        payload={"product_name": "name1", "product_description": "description1"},
        destination_table="product")
    # tabular data
    job_input.send_tabular_data_for_ingestion(
        rows=[["name2", "description2"], ["name3", "description3"]],
        column_names=["product_name", "product_description"],
        destination_table="product")

In case the VDK_INGEST_METHOD_DEFAULT was a relational database, then you can query the dataset and filter:

# A processing Data Job then filters the ingested dataset by `vdk_gdp_execution_id` column
def run(job_input: IJobInput):
    execution_ids = job_input.execute_query("SELECT DISTINCT vdk_gdp_execution_id FROM product")
    print(execution_ids)

Configuration

Run vdk config-help - search for those prefixed with "GDP_EXECUTION_ID_" to see what configuration options are available.

Testing

Testing this plugin locally requires installing the dependencies listed in vdk-plugins/vdk-gdp-execution-id/requirements.txt

Run

pip install -r requirements.txt

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.1431637373

Aug 29, 2024

0.0.1190994517

Feb 26, 2024

0.0.1184833162

Feb 21, 2024

0.0.1181636991

Feb 19, 2024

0.0.1073094274

Nov 15, 2023

0.0.1066314998

Nov 9, 2023

0.0.948436673

Jul 28, 2023

0.0.863985686

May 11, 2023

This version

0.0.848464550

Apr 25, 2023

0.0.841725879

Apr 19, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vdk-gdp-execution-id-0.0.848464550.tar.gz (3.7 kB view hashes)

Uploaded Apr 25, 2023 Source

Hashes for vdk-gdp-execution-id-0.0.848464550.tar.gz

Hashes for vdk-gdp-execution-id-0.0.848464550.tar.gz
Algorithm	Hash digest
SHA256	`d9f00b1282663fd6f0bcfad23769e2b7bf1cdc3c0bfc9bb5efd50bac5bab3d7f`
MD5	`262155e283036d9f378b940cb148874a`
BLAKE2b-256	`e2a2412682feb9f6b6f04621b89bbdc60f6674bd1a587051d40aa3cbd8af80b0`