Skip to main content

A package for zap platform

Project description

glancefeed-reco-spaces-zappcontent-personalization

Overview

This is a repository with the code for building pipelines for Zapp Content Personalization. Currently, the pipeline for BreakingNews Zapp is built using Spark Structured Streaming. In general, any Zapp pipeline should consist the following components:

  1. Consuming data from Kafka
  2. A High Pass Filter to filter out the data that is not relevant to the Zapp (Static filters)
  3. A Low Pass Filter to filter out the data that is not relevant to the Zapp (Dynamic filters)
  4. A ranking algorithm to rank the data based on the Zapp's requirements
  5. Writing the output to Kafka

POCs

  • Embedded Model in the Pipeline
  • RPC Call from the Pipeline
  • Message Queue based ML Inference Server

Preparing the pipeline for a new Zapp

  1. Use this repo as the template
  2. In the config.ini file update the following fields:
    [kafka-config]
    group.id = <consumer-group-id> # Consumer group ID should in the format of consumer-glance-ds* 
    zapp.widget.id = <zapp-widget-id> # Zapp Widget ID to be deployed into
    
    
    [gcp-config]
    checkpoint.path = <gcs-checkpoint-path> # for checkpointing the streaming data
    checkpoint.analytics.path = <gcs-analytics-checkpoint-path> # for checkpointing the streaming data - streaming
    
    [env-config]
    env = <DEV/STAGING/PROD>
    

Running the pipeline

1. Environment setup

Dev environment

In this environment, output will be displayed on the console and not written to the output topic.

  1. Set the env config in config.ini to DEV and project number to the project number of the non-prod project
  2. Create a GCP DataProc cluster on your non-prod project with the spark properties as follows: --properties spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2

Staging environment

In this environment, output will be written to the staging output topic.

  1. Set the env config in config.ini to STAGING and project number to the project number of the non-prod project
  2. Create a GCP DataProc cluster on your non-prod project with the spark properties as follows: --properties spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2

Production environment

TBD

2. Run the pipeline

The pipeline can be run in two ways:

  1. Running the pipeline as a spark job
  2. Running the pipeline as a serverless streaming pipeline

1. Running Zapp Pipeline as a spark job

  • First make necessary config changes in config.ini and prepare your script item_fan_out.py
    gsutil cp -r src gs://pipelines-staging/spark_job/item_fan_out
    
  • Next, run the pipeline as a spark job
    gcloud dataproc jobs submit pyspark --cluster <cluster-name> --region <region> --properties  spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 gs://pipelines-staging/spark_job/item_fan_out/item_fan_out.py --files gs://pipelines-staging/spark_job/item_fan_out/config.ini,gs://pipelines-staging/spark_job/item_fan_out/utils.py
    

2. Running Zapp Pipeline as a serverless steaming pipeline

  • First prepare your script streaming_zap_pipeline.py

    gsutil cp src/streaming_zap_pipeline.py gs://pipelines-staging/serverless/streaming_zap_pipeline.py
    
  • Next trigger a serverless job

Open notebooks/launch_pipeline_poc.ipynb

References

  1. CMS <> DS Contract
  2. DS <> Planner-Ingestion Contract
  3. Creating a secret in Google Secret Manager - https://github.tools.inmobi.com/shivjeet-bhosale/secret_manager/blob/main/generate_secrets.ipynb
  4. Kafka topic creation steps
    • Go to https://confluent.cloud/home
    • For Dev/staging, create a topic in CONFLUENT-KAFKA-NON-PROD-SG-001 cluster under Glance-Staging environment
    • Topic name should be in the format of glance-ds-non-prodtopic-name

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

ds_planner-1.0.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file ds_planner-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: ds_planner-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 7.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.2

File hashes

Hashes for ds_planner-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2a5c51733ed7e11116253a1fc1e41353b071b15969fdfc33090eb36836cae48c
MD5 cfaa3e2ce4b35cb570be74b546268606
BLAKE2b-256 eb5cfad0954e9c57aa86d9c12963dcba409ef792af4763a89e701c8c9f51024d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page