The data product processor (dpp) is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

data product processor

The data product processor is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.

The declaration is based on YAML and covers input and output data stores as well as data structures. It can be augmented with custom, PySpark-based transformation logic.

Installation

Prerequisites

Python 3.x
Apache Spark 3.x

Install with pip

pip install data-product-processor

Getting started

Declare a basic data product

Please see Data product specification for an overview on the files required to declare a data product.

Process the data product

From folder in which the previously created file are stored, run the data-product-processor as follows:

data-product-processor \
  --default_data_lake_bucket some-datalake-bucket \
  --aws_profile some-profile \
  --aws_region eu-central-1 \
  --local

This command will run Apache Spark locally (due to the --local switch) and store the output on an S3 bucket (authenticated with the AWS profile used in the parameter).

If you want to run the library from a different folder than the data product decleration, reference the latter through the additional argument --product_path.

data-product-processor \
  --product_path ../path-to-some-data-product \
  --default_data_lake_bucket some-datalake-bucket \
  --aws_profile some-profile \
  --aws_region eu-central-1 \
  --local

CLI Arguments

data-product-processor --help

  --JOB_ID - the unique id of this Glue/EMR job
  --JOB_RUN_ID - the unique id of this Glue job run
  --JOB_NAME - the name of this Glue job
  --job-bookmark-option - job-bookmark-disable if you don't want bookmarking
  --TempDir - tempoarary results directory
  --product_path - the data product definition folder
  --aws_profile - the AWS profile to be used for connection
  --aws_region - the AWS region to be used
  --local - local development
  --jars - extra jars to be added to the Spark context
  --additional-python-modules - this parameter is injected by Glue, currently it is not in use
  --default_data_lake_bucket - a default bucket location (with s3a:// prefix)

References

Tutorials

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.0.4

Feb 28, 2023

1.0.3

Jan 24, 2023

1.0.2

Jan 19, 2023

1.0.1

Jan 12, 2023

1.0.0

Jan 10, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

data_product_processor-1.0.4-py3-none-any.whl (41.4 kB view hashes)

Uploaded Feb 28, 2023 Python 3

Hashes for data_product_processor-1.0.4-py3-none-any.whl

Hashes for data_product_processor-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2851ef863b3dc3cdd1a235209e6b0ee938ac627c47c29cf83c7600e974a2ad9d`
MD5	`207065d1119ddc7dfd166d1f6d3b03dc`
BLAKE2b-256	`017be6a197cb5080aa23975b9cfeeb56e9579fb8e8bdc1cad3c10471cfeedf91`