Skip to main content

A data processing bundle for spark based recommender system operations

Project description

RecDP - one stop toolkit for AI data process

We provide intel optimized solution for

  • Auto Feature Engineering - Provides an automatical way to generate new features for any tabular dataset which containing numericals, categoricals and text features. It only takes 3 lines of codes to automatically enrich features based on data analysis, statistics, clustering and multi-feature interacting.
  • LLM Data Preparation - Provides a parallelled easy-to-use data pipeline for LLM data processing. It supports multiple data source such as jsonlines, pdfs, images, audio/vides. Users will be able to perform data extraction, deduplication(near dedup, rouge, exact), splitting, special_character fixing, types of filtering(length, perplexity, profanity, etc), quality analysis(diversity, GPT3 quality, toxicity, perplexity, etc). This tool also support to save output as jsonlines, parquets, or insertion into VectorStores(FaissStore, ChromaStore, ElasticSearchStore).

How it works

Install this tool through pip.

DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre graphviz
pip install pyrecdp[all] --pre

RecDP - Tabular

learn more

  • Auto Feature Engineering Pipeline Auto Feature Engineering Pipeline

Only 3 lines of codes to generate new features for your tabular data. Usually 5x new features can be found with up to 1.2x accuracy boost

from pyrecdp.autofe import AutoFE

pipeline = AutoFE(dataset=train_data, label=target_label, time_series = 'Day')
transformed_train_df = pipeline.fit_transform()
  • High Performance on Terabyte Tabular data processing Performance

RecDP - LLM

learn more

  • Low-code Fault-tolerant Auto-scaling Parallel Pipeline LLM Pipeline
from pyrecdp.primitives.operations import *
from pyrecdp.LLM import ResumableTextPipeline

pipeline = ResumableTextPipeline()
ops = [
    UrlLoader(urls, max_depth=2),
    DocumentSplit(),
    ProfanityFilter(),
    PIIRemoval(),
    ...
    PerfileParquetWriter("ResumableTextPipeline_output")
]
pipeline.add_operations(ops)
pipeline.execute()

LICENSE

  • Apache 2.0

Dependency

  • Spark 3.4.*
  • python 3.*
  • Ray 2.7.*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

e2eAIOK-recdp-1.2.0.tar.gz (286.8 kB view details)

Uploaded Source

File details

Details for the file e2eAIOK-recdp-1.2.0.tar.gz.

File metadata

  • Download URL: e2eAIOK-recdp-1.2.0.tar.gz
  • Upload date:
  • Size: 286.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for e2eAIOK-recdp-1.2.0.tar.gz
Algorithm Hash digest
SHA256 548dcf58a246237c203d7530856b33d833071102b8f832111ba2cbaa5f287d11
MD5 37b7c9042e8c6b73cd662e264790843c
BLAKE2b-256 f30797a6d868f3f123b655ab412b257d694e9453034052aa8eee08963aff50fc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page