Skip to main content

A data processing bundle for spark based recommender system operations

Project description

RecDP

INTRODUCTION

RecDP is a Data Process python module, specifically designed for Recommender System.

  • Easy-to-use – Wrap often used operations with simple APIs.
  • Collaborative pipeline with spark - provide stableness and scalability of handling huge dataset with spark as underlying distributed data process engine.
  • Optimized Performance - 1) Adaptive dataframe plan decision making; 2) Intel-OAP accelerator extensions (SIMD, Cache, Native).
  • Feature Engineer oriented – advanced feature engineering functions (target encoding)

Getting Start

install with pip (require preinstall spark)

# default version is working with spark 3.1
pip install pyrecdp

install with spark preinstalled docker img

docker run --network host -w /home/vmagent/app/ -it xuechendi/recdp_spark3.1 /bin/bash
pip install pyrecdp

examples

More examples

categorify a source data

from pyrecdp.data_processor import *
from pyrecdp.utils import *
proc = DataProcessor(spark, path_prefix, cur_folder)
proc.reset_ops([Categorify(['language'])])
df = proc.transform(df)

sort a list by frequency

from pyrecdp.data_processor import *
from pyrecdp.utils import *
proc = DataProcessor(spark, path_prefix, cur_folder)
# group langugage by hour of day
df = df.groupby('dt_hour').agg(f.collect_list("language").alias("language_list"))
# to sort language by its showing frequency in this hour
df = df.withColumn("sorted_langugage", f.expr(f"sortStringArrayByFrequency(language_list)"))
df = proc.transform(df)

image

use cases

  • Recsys2021 example url
  • Recsys2020 example url
  • Recsys2020 multiitem-categorify example(support for Analytics Zoo Friesian) url
  • DLRM example url
  • DIEN example url

LICENSE

  • Apache 2.0

Dependency

  • Spark 3.x
  • python 3.*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyrecdp-0.1.5.tar.gz (241.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page