A data processing bundle for spark based recommender system operations
Project description
RecDP - one stop toolkit for AI data process
We provide intel optimized solution for
- Auto Feature Engineering - Provides an automatical way to generate new features for any tabular dataset which containing numericals, categoricals and text features. It only takes 3 lines of codes to automatically enrich features based on data analysis, statistics, clustering and multi-feature interacting.
- LLM Data Preparation - Provides a parallelled easy-to-use data pipeline for LLM data processing. It supports multiple data source such as jsonlines, pdfs, images, audio/vides. Users will be able to perform data extraction, deduplication(near dedup, rouge, exact), splitting, special_character fixing, types of filtering(length, perplexity, profanity, etc), quality analysis(diversity, GPT3 quality, toxicity, perplexity, etc). This tool also support to save output as jsonlines, parquets, or insertion into VectorStores(FaissStore, ChromaStore, ElasticSearchStore).
How it works
Install this tool through pip.
DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre graphviz
pip install pyrecdp[all] --pre
RecDP - Tabular
- Auto Feature Engineering Pipeline
Only 3 lines of codes to generate new features for your tabular data. Usually 5x new features can be found with up to 1.2x accuracy boost
from pyrecdp.autofe import AutoFE
pipeline = AutoFE(dataset=train_data, label=target_label, time_series = 'Day')
transformed_train_df = pipeline.fit_transform()
- High Performance on Terabyte Tabular data processing
RecDP - LLM
- Low-code Fault-tolerant Auto-scaling Parallel Pipeline
from pyrecdp.primitives.operations import *
from pyrecdp.LLM import ResumableTextPipeline
pipeline = ResumableTextPipeline()
ops = [
UrlLoader(urls, max_depth=2),
DocumentSplit(),
ProfanityFilter(),
PIIRemoval(),
...
PerfileParquetWriter("ResumableTextPipeline_output")
]
pipeline.add_operations(ops)
pipeline.execute()
LICENSE
- Apache 2.0
Dependency
- Spark 3.4.*
- python 3.*
- Ray 2.7.*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyrecdp-1.2.1b2024012411.tar.gz
(288.8 kB
view details)
File details
Details for the file pyrecdp-1.2.1b2024012411.tar.gz
.
File metadata
- Download URL: pyrecdp-1.2.1b2024012411.tar.gz
- Upload date:
- Size: 288.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 569fbdc69159c2e6a12fb928bbfc0801e161b8946ecd881bce3409de6b1fa560 |
|
MD5 | 4a878327de71d5f170b371e75ed67eae |
|
BLAKE2b-256 | ddaae44ec393a4264c5ffa612ceec252f81dd5e8bc280ecb5b9c39484e937f9e |