A package for zap platform

Project description

glancefeed-reco-spaces-zappcontent-personalization

Overview

This is a repository with the code for building pipelines for Zapp Content Personalization. Currently, the pipeline for BreakingNews Zapp is built using Spark Structured Streaming. In general, any Zapp pipeline should consist the following components:

Consuming data from Kafka
A High Pass Filter to filter out the data that is not relevant to the Zapp (Static filters)
A Low Pass Filter to filter out the data that is not relevant to the Zapp (Dynamic filters)
A ranking algorithm to rank the data based on the Zapp's requirements
Writing the output to Kafka

POCs

Embedded Model in the Pipeline
RPC Call from the Pipeline
Message Queue based ML Inference Server

Preparing the pipeline for a new Zapp

Use this repo as the template

In the config.ini file update the following fields:

[kafka-config]
group.id = <consumer-group-id> # Consumer group ID should in the format of consumer-glance-ds* 
zapp.widget.id = <zapp-widget-id> # Zapp Widget ID to be deployed into


[gcp-config]
checkpoint.path = <gcs-checkpoint-path> # for checkpointing the streaming data
checkpoint.analytics.path = <gcs-analytics-checkpoint-path> # for checkpointing the streaming data - streaming

[env-config]
env = <DEV/STAGING/PROD>

Running the pipeline

1. Environment setup

Dev environment

In this environment, output will be displayed on the console and not written to the output topic.

Set the env config in config.ini to DEV and project number to the project number of the non-prod project
Create a GCP DataProc cluster on your non-prod project with the spark properties as follows: --properties spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2

Staging environment

In this environment, output will be written to the staging output topic.

Set the env config in config.ini to STAGING and project number to the project number of the non-prod project
Create a GCP DataProc cluster on your non-prod project with the spark properties as follows: --properties spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2

Production environment

TBD

2. Run the pipeline

The pipeline can be run in two ways:

Running the pipeline as a spark job
Running the pipeline as a serverless streaming pipeline

1. Running Zapp Pipeline as a spark job

First make necessary config changes in config.ini and prepare your script item_fan_out.py
```
gsutil cp -r src gs://pipelines-staging/spark_job/item_fan_out
```

Next, run the pipeline as a spark job

gcloud dataproc jobs submit pyspark --cluster <cluster-name> --region <region> --properties  spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 gs://pipelines-staging/spark_job/item_fan_out/item_fan_out.py --files gs://pipelines-staging/spark_job/item_fan_out/config.ini,gs://pipelines-staging/spark_job/item_fan_out/utils.py

2. Running Zapp Pipeline as a serverless steaming pipeline

First prepare your script streaming_zap_pipeline.py

gsutil cp src/streaming_zap_pipeline.py gs://pipelines-staging/serverless/streaming_zap_pipeline.py

Next trigger a serverless job

Open notebooks/launch_pipeline_poc.ipynb

References

CMS <> DS Contract
DS <> Planner-Ingestion Contract
Creating a secret in Google Secret Manager - https://github.tools.inmobi.com/shivjeet-bhosale/secret_manager/blob/main/generate_secrets.ipynb
Kafka topic creation steps
- Go to https://confluent.cloud/home
- For Dev/staging, create a topic in CONFLUENT-KAFKA-NON-PROD-SG-001 cluster under Glance-Staging environment
- Topic name should be in the format of glance-ds-non-prodtopic-name

Project details

Release history Release notifications | RSS feed

This version

1.0.0

Apr 27, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

ds_planner-1.0.0-py3-none-any.whl (7.4 kB view details)

Uploaded Apr 27, 2023 Python 3

File details

Details for the file ds_planner-1.0.0-py3-none-any.whl.

File metadata

Download URL: ds_planner-1.0.0-py3-none-any.whl
Upload date: Apr 27, 2023
Size: 7.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.2

File hashes

Hashes for ds_planner-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2a5c51733ed7e11116253a1fc1e41353b071b15969fdfc33090eb36836cae48c`
MD5	`cfaa3e2ce4b35cb570be74b546268606`
BLAKE2b-256	`eb5cfad0954e9c57aa86d9c12963dcba409ef792af4763a89e701c8c9f51024d`

See more details on using hashes here.

ds-planner 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta