Skip to main content

A data processing bundle for spark based recommender system operations

Project description

RecDP v2.0

INTRODUCTION

Problem Statement

Data Preparation is an essential step to build AI pipelines

  • key data preparation capabilities: data connector, cleaning, sampling, joining, profiling, feature engineering, low-code/no-code UI, lineage etc.
  • exploration of optimal Data preparation consumes majority of Data Science time

Solution with RecDP v2.0

  • Auto pipeline
    • only 3 lines of codes required
  • Pipeline Generator
    • Data Profiling:
      • Auto anomalies detection
      • Auto missing value impute
      • Profiling Visualizzation
    • Feature Wrangling:
      • feature transformation(datetime, geo_info, text_nlp, url, etc.)
      • multiple data auto joining
      • feature cross(aggregation transformation - sum, avg, count, etc.)
    • export pipeline as JSON file, can be import to other data platform
  • Pipeline Runner:
    • spark engine: convert pipeline to spark codes to run
    • pandas engine: convert pipeline to pandas codes to run
    • sql engine: convert pipeline to sql
  • DataLoader:
    • parquet, csv, json, database
  • FeatureWriter - ML/DL connector:
    • Data Lineage
    • Feature Store
    • numpy, csv, parquet, dgl / pyG graph RecDP v2.0 Overview

This solution is intended for

citizen data scientists, enterprise users, independent software vendor and partial of cloud service provider.

Getting Start

setup with pip

git clone --single-branch --branch RecDP_v2.0 https://github.com/intel-innersource/frameworks.bigdata.AIDK.git
cd frameworks.bigdata.AIDK/RecDP
# install dependencies
apt-get update -y &&  DEBIAN_FRONTEND=noninteractive apt-get install -y python3 python3-pip python-is-python3 graphviz
DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre
# install recdp
python setup.py sdist
pip install dist/pyrecdp-1.0.1.tar.gz

sh start-jupyter.sh
# open browser with http://hostname:8888

run

from pyrecdp.autofe import FeatureWrangler

pipeline = FeatureWrangler(dataset=train_data, label="fare_amount")
pipeline.plot()

nyc taxi demo

Quick Example

More Examples - completed example including training

Auto Feature Engineering vs. featuretools

  • NYC Taxi fare auto data prepration: An example to show how RecDP_v2.0 automatically generating datetime and geo features upon 55M records. Tested with both Spark and Pandas(featuretools) as compute engine, show 21x speedup by spark.

load PIPELINE and execute

Data Profiler Examples

  • NYC Taxi fare Profiler: An example to show RecDP_v2.0 to profile data, including infer the potential data type, generate data distribution charts.

  • twitter Profiler: An example to show RecDP_v2.0 to profile data, including infer the potential data type, generate data distribution charts.

LICENSE

  • Apache 2.0

Dependency

  • Spark 3.x
  • python 3.*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyrecdp-1.0.1b20230307.tar.gz (168.1 kB view details)

Uploaded Source

Built Distribution

pyrecdp-1.0.1b20230307-py3-none-any.whl (185.5 kB view details)

Uploaded Python 3

File details

Details for the file pyrecdp-1.0.1b20230307.tar.gz.

File metadata

  • Download URL: pyrecdp-1.0.1b20230307.tar.gz
  • Upload date:
  • Size: 168.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for pyrecdp-1.0.1b20230307.tar.gz
Algorithm Hash digest
SHA256 a754e1eccf20c03bab97a112fe2a41eec5cf395c6938a0f610e2cb42a8ba014c
MD5 6854ac9e9dda2d4050cba81d82fb2821
BLAKE2b-256 f5d4e06eb7ea77b7eab567a083e81ca1fc72328a755710911a96886637f66368

See more details on using hashes here.

File details

Details for the file pyrecdp-1.0.1b20230307-py3-none-any.whl.

File metadata

File hashes

Hashes for pyrecdp-1.0.1b20230307-py3-none-any.whl
Algorithm Hash digest
SHA256 f526fd1cd05811db3c40e335426495dc2c9277c7aa4c61787f5780603111f689
MD5 b318eec9feb1e383a2f9252d2e38c232
BLAKE2b-256 e66073cc017677d2f213edde0fcb7460097fd710df9c2c1ada1264b32148ffc0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page