A package for zap platform
Project description
glancefeed-reco-spaces-zappcontent-personalization
Overview
This is a repository with the code for building pipelines for Zapp Content Personalization. Currently, the pipeline for BreakingNews Zapp is built using Spark Structured Streaming. In general, any Zapp pipeline should consist the following components:
- Consuming data from Kafka
- A High Pass Filter to filter out the data that is not relevant to the Zapp (Static filters)
- A Low Pass Filter to filter out the data that is not relevant to the Zapp (Dynamic filters)
- A ranking algorithm to rank the data based on the Zapp's requirements
- Writing the output to Kafka
POCs
- Embedded Model in the Pipeline
- RPC Call from the Pipeline
- Message Queue based ML Inference Server
Preparing the pipeline for a new Zapp
- Use this repo as the template
- In the config.ini file update the following fields:
[kafka-config] group.id = <consumer-group-id> # Consumer group ID should in the format of consumer-glance-ds* zapp.widget.id = <zapp-widget-id> # Zapp Widget ID to be deployed into [gcp-config] checkpoint.path = <gcs-checkpoint-path> # for checkpointing the streaming data checkpoint.analytics.path = <gcs-analytics-checkpoint-path> # for checkpointing the streaming data - streaming [env-config] env = <DEV/STAGING/PROD>
Running the pipeline
1. Environment setup
Dev environment
In this environment, output will be displayed on the console and not written to the output topic.
- Set the env config in config.ini to DEV and project number to the project number of the non-prod project
- Create a GCP DataProc cluster on your non-prod project with the spark properties as follows:
--properties spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2
Staging environment
In this environment, output will be written to the staging output topic.
- Set the env config in config.ini to STAGING and project number to the project number of the non-prod project
- Create a GCP DataProc cluster on your non-prod project with the spark properties as follows:
--properties spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2
Production environment
TBD
2. Run the pipeline
The pipeline can be run in two ways:
- Running the pipeline as a spark job
- Running the pipeline as a serverless streaming pipeline
1. Running Zapp Pipeline as a spark job
- First make necessary config changes in config.ini and prepare your script item_fan_out.py
gsutil cp -r src gs://pipelines-staging/spark_job/item_fan_out
- Next, run the pipeline as a spark job
gcloud dataproc jobs submit pyspark --cluster <cluster-name> --region <region> --properties spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 gs://pipelines-staging/spark_job/item_fan_out/item_fan_out.py --files gs://pipelines-staging/spark_job/item_fan_out/config.ini,gs://pipelines-staging/spark_job/item_fan_out/utils.py
2. Running Zapp Pipeline as a serverless steaming pipeline
-
First prepare your script
streaming_zap_pipeline.py
gsutil cp src/streaming_zap_pipeline.py gs://pipelines-staging/serverless/streaming_zap_pipeline.py
-
Next trigger a serverless job
Open notebooks/launch_pipeline_poc.ipynb
References
- CMS <> DS Contract
- DS <> Planner-Ingestion Contract
- Creating a secret in Google Secret Manager - https://github.tools.inmobi.com/shivjeet-bhosale/secret_manager/blob/main/generate_secrets.ipynb
- Kafka topic creation steps
- Go to https://confluent.cloud/home
- For Dev/staging, create a topic in CONFLUENT-KAFKA-NON-PROD-SG-001 cluster under Glance-Staging environment
- Topic name should be in the format of glance-ds-non-prodtopic-name
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
Close
Hashes for ds_planner-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a5c51733ed7e11116253a1fc1e41353b071b15969fdfc33090eb36836cae48c |
|
MD5 | cfaa3e2ce4b35cb570be74b546268606 |
|
BLAKE2b-256 | eb5cfad0954e9c57aa86d9c12963dcba409ef792af4763a89e701c8c9f51024d |