The data product processor (dpp) is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.
Project description
data product processor
The data product processor is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.
The declaration is based on YAML and covers input and output data stores as well as data structures. It can be augmented with custom, PySpark-based transformation logic.
Installation
Prerequisites
- Python 3.x
- Apache Spark 3.x
Install with pip
pip install data-product-processor
Getting started
Declare a basic data product
Please see Data product specification for an overview on the files required to declare a data product.
Process the data product
From folder in which the previously created file are stored, run the data-product-processor as follows:
data-product-processor \
--default_data_lake_bucket some-datalake-bucket \
--aws_profile some-profile \
--aws_region eu-central-1 \
--local
This command will run Apache Spark locally (due to the --local switch) and store the output on an S3 bucket (authenticated with the AWS profile used in the parameter).
If you want to run the library from a different folder than the data product decleration, reference the latter through the additional argument --product_path
.
data-product-processor \
--product_path ../path-to-some-data-product \
--default_data_lake_bucket some-datalake-bucket \
--aws_profile some-profile \
--aws_region eu-central-1 \
--local
CLI Arguments
data-product-processor --help
--JOB_ID - the unique id of this Glue/EMR job
--JOB_RUN_ID - the unique id of this Glue job run
--JOB_NAME - the name of this Glue job
--job-bookmark-option - job-bookmark-disable if you don't want bookmarking
--TempDir - tempoarary results directory
--product_path - the data product definition folder
--aws_profile - the AWS profile to be used for connection
--aws_region - the AWS region to be used
--local - local development
--jars - extra jars to be added to the Spark context
--additional-python-modules - this parameter is injected by Glue, currently it is not in use
--default_data_lake_bucket - a default bucket location (with s3a:// prefix)
References
Tutorials
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file data_product_processor-1.0.4-py3-none-any.whl
.
File metadata
- Download URL: data_product_processor-1.0.4-py3-none-any.whl
- Upload date:
- Size: 41.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2851ef863b3dc3cdd1a235209e6b0ee938ac627c47c29cf83c7600e974a2ad9d |
|
MD5 | 207065d1119ddc7dfd166d1f6d3b03dc |
|
BLAKE2b-256 | 017be6a197cb5080aa23975b9cfeeb56e9579fb8e8bdc1cad3c10471cfeedf91 |