A data processing bundle for spark based recommender system operations

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

RecDP v2.0

INTRODUCTION

Problem Statement

Data Preparation is an essential step to build AI pipelines

key data preparation capabilities: data connector, cleaning, sampling, joining, profiling, feature engineering, low-code/no-code UI, lineage etc.
exploration of optimal Data preparation consumes majority of Data Science time

Solution with RecDP v2.0

Auto pipeline
- only 3 lines of codes required
Pipeline Generator
- Data Profiling:
  - Auto anomalies detection
  - Auto missing value impute
  - Profiling Visualizzation
- Feature Wrangling:
  - feature transformation(datetime, geo_info, text_nlp, url, etc.)
  - multiple data auto joining
  - feature cross(aggregation transformation - sum, avg, count, etc.)
- export pipeline as JSON file, can be import to other data platform
Pipeline Runner:
- spark engine: convert pipeline to spark codes to run
- pandas engine: convert pipeline to pandas codes to run
- sql engine: convert pipeline to sql
DataLoader:
- parquet, csv, json, database
FeatureWriter - ML/DL connector:
- Data Lineage
- Feature Store
- numpy, csv, parquet, dgl / pyG graph

This solution is intended for

citizen data scientists, enterprise users, independent software vendor and partial of cloud service provider.

Getting Start

setup with pip

git clone --single-branch --branch RecDP_v2.0 https://github.com/intel-innersource/frameworks.bigdata.AIDK.git
cd frameworks.bigdata.AIDK/RecDP
# install dependencies
apt-get update -y &&  DEBIAN_FRONTEND=noninteractive apt-get install -y python3 python3-pip python-is-python3 graphviz
DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre
# install recdp
python setup.py sdist
pip install dist/pyrecdp-1.0.1.tar.gz

sh start-jupyter.sh
# open browser with http://hostname:8888

run

nyc taxi demo

modify pipeline (add user defined function or remove operation)

def filter_with_cond(df):
    df = df[df['Year'] <= 2018]
    return df
operation = {
    "children": [6], # will to append this new op
    "next": [7], # who will be connected to this new op
    "inline_function": filter_with_cond, # function
}
pipeline.add_operation(operation)

operation = {
    "idx": 6, # OP id to be deleted
    "next": [10] # who is connected to the to_be_deleted op
}
pipeline.delete_operation(operation)

export pipeline

pipeline.export(file_path = "exported_pipeline.json")

{
    "0": {
        "children": null,
        "op": "DataFrame",
        "config": "main_table"
    },
    "1": {
        "children": [0],
        "op": "type_infer",
        "config": [
            ["pickup_datetime",["is_string","is_datetime"]],
            ["pickup_longitude",["is_numeric","is_float"]],
            ["pickup_latitude",["is_numeric","is_float"]],
            ["dropoff_longitude",["is_numeric","is_float"]],
            ["dropoff_latitude",["is_numeric","is_float"]],
            ["passenger_count",["is_numeric","is_int64","is_integer","is_categorical"]]
        ]
    },
    "2": {
        "children": [1],
        "op": "tuple",
        "config": {
            "src": ["pickup_latitude","pickup_longitude"],"dst": "pickup_coordinates"
        }
    },
    "3": {
        "children": [2],
        "op": "tuple",
        "config": {
            "src": ["dropoff_latitude","dropoff_longitude"],"dst": "dropoff_coordinates"
        }
    },
    "4": {
        "children": [3],
        "op": "fillna",
        "config": {
            "pickup_longitude": -1,
            "pickup_latitude": -1,
            "dropoff_longitude": -1,
            "dropoff_latitude": -1,
            "passenger_count": -1
        }
    },
    "5": {
        "children": [4],
        "op": "datetime_feature",
        "config": {
            "pickup_datetime": [
                ["pickup_datetime__day",["featuretools.primitives.standard.transform.datetime.day", "Day"]],
                ["pickup_datetime__month",["featuretools.primitives.standard.transform.datetime.month","Month"]],
                ["pickup_datetime__weekday",["featuretools.primitives.standard.transform.datetime.weekday","Weekday"]],
                ["pickup_datetime__year",["featuretools.primitives.standard.transform.datetime.year","Year"]],
                ["pickup_datetime__hour",["featuretools.primitives.standard.transform.datetime.hour","Hour"]]
            ]
        }
    },
    "6": {
        "children": [5],
        "op": "haversine",
        "config": {
            "['pickup_coordinates', 'dropoff_coordinates']": ["haversine_pickup_coordinates_dropoff_coordinates",["featuretools.primitives.standard.transform.latlong.haversine","Haversine"]]
        }
    },
    "7": {
        "children": [6],
        "op": "drop",
        "config": [
            "pickup_datetime",
            "pickup_coordinates",
            "dropoff_coordinates"
        ]
    },
    "8": {
        "children": [7],
        "op": "lightgbm",
        "config": {
            "label": "fare_amount",
            "metrics": "rmse",
            "objective": "regression",
            "model_file": "lightgbm_regression_label.mdl",
            "method": "predict"
        }
    }
}

Quick Example

- nyc taxi fare - geographic, datetime, lgbm regression
- IBM Card Transaction Fraud Detect - onehot encode, lgbm, binary classification
- twitter recsys - text nlp, datetime feature engineering
- outbrain - multiple table joining
- amazon product review - text nlp, datetime, feature-cross

More Examples - completed example including training

Auto Feature Engineering vs. featuretools

NYC Taxi fare auto data prepration: An example to show how RecDP_v2.0 automatically generating datetime and geo features upon 55M records. Tested with both Spark and Pandas(featuretools) as compute engine, show 21x speedup by spark.

load PIPELINE and execute

twitter pipeline re-load and execute: An example to show how RecDP_v2.0 reload pipeline from json and do execution - use RecDP as compute engine.

Realtime Inference pipeline

nyc taxi fare - geographic, datetime feature engineering, lgbm

Data Profiler Examples

NYC Taxi fare Profiler: An example to show RecDP_v2.0 to profile data, including infer the potential data type, generate data distribution charts.
twitter Profiler: An example to show RecDP_v2.0 to profile data, including infer the potential data type, generate data distribution charts.

LICENSE

Apache 2.0

Dependency

Spark 3.x
python 3.*

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.2.1b2024021403 pre-release

Feb 13, 2024

1.2.1b2024012411 pre-release

Jan 24, 2024

1.2.0

Dec 22, 2023

1.2.0b2024011710 pre-release

Jan 17, 2024

1.2.0b2023122209 pre-release

Dec 22, 2023

1.0.1b202306061 pre-release

Jun 6, 2023

1.0.1b202305181 pre-release

May 19, 2023

1.0.1b202305161 pre-release

May 16, 2023

1.0.1b202305151 pre-release

May 15, 2023

1.0.1b202305134 pre-release

May 13, 2023

1.0.1b202305133 pre-release

May 13, 2023

1.0.1b202305132 pre-release

May 13, 2023

1.0.1b202305131 pre-release

May 13, 2023

This version

1.0.1b202304251 pre-release

Apr 25, 2023

1.0.1b202304142 pre-release

Apr 14, 2023

1.0.1b202304141 pre-release

Apr 14, 2023

1.0.1b20230425 pre-release

Apr 25, 2023

1.0.1b20230404 pre-release

Apr 5, 2023

1.0.1b20230401 pre-release

Apr 4, 2023

1.0.1b20230331 pre-release

Apr 1, 2023

1.0.1b20230328 pre-release

Mar 28, 2023

1.0.1b20230311 pre-release

Mar 10, 2023

1.0.1b20230310 pre-release

Mar 10, 2023

1.0.1b20230309 pre-release

Mar 10, 2023

1.0.1b20230308 pre-release

Mar 7, 2023

1.0.1b20230307 pre-release

Mar 7, 2023

0.1.7

Jul 5, 2023

0.1.6

Jul 5, 2023

0.1.5

Apr 17, 2023

0.1.4

Mar 20, 2023

0.1.3

Sep 16, 2022

0.1.2

Dec 6, 2021

0.1.1

Dec 6, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyrecdp-1.0.1b202304251.tar.gz (181.3 kB view hashes)

Uploaded Apr 25, 2023 Source

Hashes for pyrecdp-1.0.1b202304251.tar.gz

Hashes for pyrecdp-1.0.1b202304251.tar.gz
Algorithm	Hash digest
SHA256	`2136a66d741837471f3294c7760d80045506b3ae8bf030a72a3f639143d66b38`
MD5	`a0586c824588061f5e55f05cb0135710`
BLAKE2b-256	`70dca20860788c34cb1f9e8d4393e3c382e776013a7e1291f16cb8585738871e`