A data processing bundle for spark based recommender system operations
Project description
RecDP v2.0
INTRODUCTION
Problem Statement
Data Preparation is an essential step to build AI pipelines
- key data preparation capabilities: data connector, cleaning, sampling, joining, profiling, feature engineering, low-code/no-code UI, lineage etc.
- exploration of optimal Data preparation consumes majority of Data Science time
Solution with RecDP v2.0
- Auto pipeline
- only 3 lines of codes required
- Pipeline Generator
- Data Profiling:
- Auto anomalies detection
- Auto missing value impute
- Profiling Visualizzation
- Feature Wrangling:
- feature transformation(datetime, geo_info, text_nlp, url, etc.)
- multiple data auto joining
- feature cross(aggregation transformation - sum, avg, count, etc.)
- export pipeline as JSON file, can be import to other data platform
- Data Profiling:
- Pipeline Runner:
- spark engine: convert pipeline to spark codes to run
- pandas engine: convert pipeline to pandas codes to run
- sql engine: convert pipeline to sql
- DataLoader:
- parquet, csv, json, database
- FeatureWriter - ML/DL connector:
- Data Lineage
- Feature Store
- numpy, csv, parquet, dgl / pyG graph
This solution is intended for
citizen data scientists, enterprise users, independent software vendor and partial of cloud service provider.
Getting Start
setup with pip
git clone --single-branch --branch RecDP_v2.0 https://github.com/intel-innersource/frameworks.bigdata.AIDK.git
cd frameworks.bigdata.AIDK/RecDP
# install dependencies
apt-get update -y && DEBIAN_FRONTEND=noninteractive apt-get install -y python3 python3-pip python-is-python3 graphviz
DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre
# install recdp
python setup.py sdist
pip install dist/pyrecdp-1.0.1.tar.gz
sh start-jupyter.sh
# open browser with http://hostname:8888
run
from pyrecdp.autofe import FeatureWrangler
pipeline = FeatureWrangler(dataset=train_data, label="fare_amount")
pipeline.plot()
Quick Example
-
nyc taxi fare - geographic, datetime feature engineering
-
twitter recsys - text nlp, datetime feature engineering
-
outbrain - multiple table joining
-
amazon product review - text nlp, datetime, feature-cross
More Examples - completed example including training
Auto Feature Engineering vs. featuretools
- NYC Taxi fare auto data prepration: An example to show how RecDP_v2.0 automatically generating datetime and geo features upon 55M records. Tested with both Spark and Pandas(featuretools) as compute engine, show 21x speedup by spark.
load PIPELINE and execute
- twitter pipeline re-load and execute: An example to show how RecDP_v2.0 reload pipeline from json and do execution - use RecDP as compute engine.
Data Profiler Examples
-
NYC Taxi fare Profiler: An example to show RecDP_v2.0 to profile data, including infer the potential data type, generate data distribution charts.
-
twitter Profiler: An example to show RecDP_v2.0 to profile data, including infer the potential data type, generate data distribution charts.
LICENSE
- Apache 2.0
Dependency
- Spark 3.x
- python 3.*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pyrecdp-1.0.1b20230307-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f526fd1cd05811db3c40e335426495dc2c9277c7aa4c61787f5780603111f689 |
|
MD5 | b318eec9feb1e383a2f9252d2e38c232 |
|
BLAKE2b-256 | e66073cc017677d2f213edde0fcb7460097fd710df9c2c1ada1264b32148ffc0 |