Skip to main content

Data Preparation Toolkit Library for creation and execution of ttansformers flows

Project description

Flows Data Processing Library

This is a framework for combining and local execution of DatPrepKit -transforms_. The large-scale execution of transformers is based on use of KubeFlow Pipelines and KubeRay on big Kubernetes clusters. The project provides two example of "super" KFP workflows, one that combines 'exact dedup', 'document identification' and 'fuzzy dedup'. Another workflow demonstrates processing of programming code. This workflow starts from transformation of the code to parquet files, after that it executes 'exact dedup', "document identification_', 'fuzzy dedup', 'programming language select_', 'code quality', 'malware' transformers. The workflow finishes with '_tokenization' transformers.

However, sometimes developers or data scientists would like to execute a set of transformers locally. This can be during the development process or due to the size of the processed data sets.

This package demonstrates two options for how this can be done.

Data Processing Flows

Flow iis a Python representation of a workflow definition. It defines a set of steps that should be executed and a set of global parameters that are common to all steps. Each step can overwrite its corresponding parameters. To provide a "real" data-flow, Flow automatically connects the input of each step to the output of the previous one. The global parameter set defines only the entire Flow input and output, which are set to the first and last steps, respectively.

Currently, Flow supports pure Python local transformers and Ray local transformers. Different transformer types can be part of the same Flow. Other transformer types will be added later.

Flow creation

We provide two options of Flow creation: programmatically or by flow_loader from a JSON file. Flow JSON schema is defined in flow_schema.json.

The examples directory demonstrates creation of a simple Flow with 3 steps: transformation of PDF files into parquet files, document identification and noop transformation. When the pdf and noop transformations are pure python transformers, and the document identification is a Ray local transformation. You can see the JSON Flow definition at flow_example.json and its execution at run_example.py. The file flow_example.py does the same programmatically.

Flow execution from a Jupyter notebook

The wdu jupyter notebook is an example of a flow with one step of WDU transform. We can run this by running the following commands from flows directory:

make venv   # need to run only during the first execution
. ./venv/bin/activate
pip install jupyter
pip install ipykernel

# execute the Jupyter
python -m ipykernel install --user --name=venv --display-name="Python venv"
jupyter notebook

The notebook exists in wdu.ipynb

KFP Local Pipelines

KFPv2 added an option to execute components and pipelines locally, see Execute KFP pipelines locally Overview . Depending on the user's knowledge and preferences, this feature can be another option for executing workflows locally.

KFP supports two local runner types which indicate how and where executed components should be executed: DockerRunner and SubprocessRunner. DockerRunner is the recommended option, because it executes each task in a separate container. It offers the strongest form of local runtime environment isolation; is most faithful to the remote runtime environment, but might require a prebuilt docker image.

The files local_subprocess_wdu_docId_noop_super_pipeline.py and local_docker_wdu_docId_noop_super_pipeline.py demonstrate a KFP local definition of the same workflow using SubprocessRunner and DockerRunner, respectively.

Note: in order to execute the transformation of PDF files into parquet files you should be connected to the IBM Intranet with the "TUNNELAL" VPN

Next Steps

  • Extend support to S3 data sources
  • Support Spark transformers
  • Support isolated virtual environments by executing FlowSteps in subprocesses.
  • Investigate more KFP local opportunities

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_prep_toolkit_flows-0.2.0.dev0.tar.gz (2.3 MB view hashes)

Uploaded Source

Built Distribution

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page