A data processing bundle for spark based recommender system operations
Project description
RecDP
INTRODUCTION
RecDP is a Data Process python module, specifically designed for Recommender System.
- Easy-to-use – Wrap often used operations with simple APIs.
- Collaborative pipeline with spark - provide stableness and scalability of handling huge dataset with spark as underlying distributed data process engine.
- Optimized Performance - 1) Adaptive dataframe plan decision making; 2) Intel-OAP accelerator extensions (SIMD, Cache, Native).
- Feature Engineer oriented – advanced feature engineering functions (target encoding)
Getting Start
install with pip (require preinstall spark)
# default version is working with spark 3.1
pip install pyrecdp
install with spark preinstalled docker img
docker run --network host -w /home/vmagent/app/ -it xuechendi/recdp_spark3.1 /bin/bash
pip install pyrecdp
examples
categorify a source data
- convert 'language' column from 'text' to 'unique_integral_id'
- codes link: tests/test_categorify.py
from pyrecdp.data_processor import *
from pyrecdp.utils import *
proc = DataProcessor(spark, path_prefix, cur_folder)
proc.reset_ops([Categorify(['language'])])
df = proc.transform(df)
sort a list by frequency
- when each cell data is a list, organize this list to unique_value order by frequency
- codes link: tests/test_sortArrayByFrequency.py
from pyrecdp.data_processor import *
from pyrecdp.utils import *
proc = DataProcessor(spark, path_prefix, cur_folder)
# group langugage by hour of day
df = df.groupby('dt_hour').agg(f.collect_list("language").alias("language_list"))
# to sort language by its showing frequency in this hour
df = df.withColumn("sorted_langugage", f.expr(f"sortStringArrayByFrequency(language_list)"))
df = proc.transform(df)
use cases
- Recsys2021 example url
- Recsys2020 example url
- Recsys2020 multiitem-categorify example(support for Analytics Zoo Friesian) url
- DLRM example url
- DIEN example url
LICENSE
- Apache 2.0
Dependency
- Spark 3.x
- python 3.*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyrecdp-0.1.5.tar.gz
(241.6 kB
view hashes)