Skip to main content

Data Preparation Toolkit Library for creation and execution of ttansformers flows

Project description

Flows Data Processing Library

This is a framework for combining and local execution of DatPrepKit -transforms_. The large-scale execution of transformers is based on use of KubeFlow Pipelines and KubeRay on big Kubernetes clusters. The project provides two example of "super" KFP workflows, one that combines 'exact dedup', 'document identification' and 'fuzzy dedup'. Another workflow demonstrates processing of programming code. This workflow starts from transformation of the code to parquet files, after that it executes 'exact dedup', "document identification_', 'fuzzy dedup', 'programming language select_', 'code quality', 'malware' transformers. The workflow finishes with '_tokenization' transformers.

However, sometimes developers or data scientists would like to execute a set of transformers locally. This can be during the development process or due to the size of the processed data sets.

This package demonstrates two options for how this can be done.

Data Processing Flows

Flow iis a Python representation of a workflow definition. It defines a set of steps that should be executed and a set of global parameters that are common to all steps. Each step can overwrite its corresponding parameters. To provide a "real" data-flow, Flow automatically connects the input of each step to the output of the previous one. The global parameter set defines only the entire Flow input and output, which are set to the first and last steps, respectively.

Currently, Flow supports pure Python local transformers and Ray local transformers. Different transformer types can be part of the same Flow. Other transformer types will be added later.

Flow creation

We provide two options of Flow creation: programmatically or by flow_loader from a JSON file. Flow JSON schema is defined in flow_schema.json.

The examples directory demonstrates creation of a simple Flow with 3 steps: transformation of PDF files into parquet files, document identification and noop transformation. When the pdf and noop transformations are pure python transformers, and the document identification is a Ray local transformation. You can see the JSON Flow definition at flow_example.json and its execution at run_example.py. The file flow_example.py does the same programmatically.

Flow execution from a Jupyter notebook

The wdu jupyter notebook is an example of a flow with one step of WDU transform. We can run this by running the following commands from flows directory:

make venv   # need to run only during the first execution
. ./venv/bin/activate
pip install jupyter
pip install ipykernel

# execute the Jupyter
python -m ipykernel install --user --name=venv --display-name="Python venv"
jupyter notebook

The notebook exists in wdu.ipynb

KFP Local Pipelines

KFPv2 added an option to execute components and pipelines locally, see Execute KFP pipelines locally Overview . Depending on the user's knowledge and preferences, this feature can be another option for executing workflows locally.

KFP supports two local runner types which indicate how and where executed components should be executed: DockerRunner and SubprocessRunner. DockerRunner is the recommended option, because it executes each task in a separate container. It offers the strongest form of local runtime environment isolation; is most faithful to the remote runtime environment, but might require a prebuilt docker image.

The files local_subprocess_wdu_docId_noop_super_pipeline.py and local_docker_wdu_docId_noop_super_pipeline.py demonstrate a KFP local definition of the same workflow using SubprocessRunner and DockerRunner, respectively.

Note: in order to execute the transformation of PDF files into parquet files you should be connected to the IBM Intranet with the "TUNNELAL" VPN

Next Steps

  • Extend support to S3 data sources
  • Support Spark transformers
  • Support isolated virtual environments by executing FlowSteps in subprocesses.
  • Investigate more KFP local opportunities

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_prep_toolkit_flows-0.2.0.tar.gz (2.3 MB view details)

Uploaded Source

Built Distribution

data_prep_toolkit_flows-0.2.0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file data_prep_toolkit_flows-0.2.0.tar.gz.

File metadata

File hashes

Hashes for data_prep_toolkit_flows-0.2.0.tar.gz
Algorithm Hash digest
SHA256 33bb5606f46972c8f0cb7de0b3eed8e9fcf5b989c7c1cbe9eabcf365fb63c0bd
MD5 616788052138826d9381b797e5aea02c
BLAKE2b-256 73e73dcb33197417ddce791d0630839c2e9e96ff11edce405baa6a91903fbce2

See more details on using hashes here.

File details

Details for the file data_prep_toolkit_flows-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for data_prep_toolkit_flows-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d58fb50a29d62e9a2e6e2d3baf23a2636403db67a6297a7a4ab495c0c567a58a
MD5 827ce69c61b580f0850db89ad2c493c2
BLAKE2b-256 91b7eae435e407c100bc991002f48e17649226cbfd346951eb3d6ca09741e608

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page