Skip to main content

The data product processor (dpp) is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.

Project description

data product processor

The data product processor is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.

The declaration is based on YAML and covers input and output data stores as well as data structures. It can be augmented with custom, PySpark-based transformation logic.

Installation

Prerequisites

  • Python 3.x
  • Apache Spark 3.x

Install with pip

pip install data-product-processor

Getting started

Declare a basic data product

Please see Data product specification for an overview on the files required to declare a data product.

Process the data product

From folder in which the previously created file are stored, run the data-product-processor as follows:

data-product-processor \
  --default_data_lake_bucket some-datalake-bucket \
  --aws_profile some-profile \
  --aws_region eu-central-1 \
  --local

This command will run Apache Spark locally (due to the --local switch) and store the output on an S3 bucket (authenticated with the AWS profile used in the parameter).

If you want to run the library from a different folder than the data product decleration, reference the latter through the additional argument --product_path.

data-product-processor \
  --product_path ../path-to-some-data-product \
  --default_data_lake_bucket some-datalake-bucket \
  --aws_profile some-profile \
  --aws_region eu-central-1 \
  --local

CLI Arguments

data-product-processor --help

  --JOB_ID - the unique id of this Glue/EMR job
  --JOB_RUN_ID - the unique id of this Glue job run
  --JOB_NAME - the name of this Glue job
  --job-bookmark-option - job-bookmark-disable if you don't want bookmarking
  --TempDir - tempoarary results directory
  --product_path - the data product definition folder
  --aws_profile - the AWS profile to be used for connection
  --aws_region - the AWS region to be used
  --local - local development
  --jars - extra jars to be added to the Spark context
  --additional-python-modules - this parameter is injected by Glue, currently it is not in use
  --default_data_lake_bucket - a default bucket location (with s3a:// prefix)

References

Tutorials

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

data_product_processor-1.0.4-py3-none-any.whl (41.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page