Framework for simpler Spark Pipelines

These details have not been verified by PyPI

Project links

Homepage

Project description

SparkPipelineFramework

SparkPipelineFramework implements a few design patterns to make it easier to create Spark applications that:

Separate data transformation logic from the pipeline execution code so you can compose pipelines by just stringing together transformers. (Based on the SparkML Pipeline class but enhanced to work for both ML and non-ML transformations)
Enables running SQL transformations without writing any code
Enables versioning of transformations so different pipelines can use older or newer versions of each transformer. This enables you to upgrade each pipeline at your own choice
Enables Autocomplete of transformations when creating pipelines (in PyCharm).
Implement many separation-of-concerns e.g., logging, performance monitoring, error reporting
Supports both non-ML, ML and mixed workloads
Has an additional library SparkPipelineFramework.AWS that makes running Spark pipelines in AWS easier
Has a sister library SparkPipelineFramework.Catalog that implements a data and ML model catalog so you can load and save data by catalog name instead of path and can manage different versions of the data

PyPi Package

This code is available as a package to import into your project. https://pypi.org/project/sparkpipelineframework/

Using it in your project

(For an example project that uses SparkPipelineFramework, see https://github.com/imranq2/TestSparkPipelineFramework)

Add sparkpipelineframework package to your project requirements.txt
Create a folder called library in your project

To create a new pipeline

Create a class derived from FrameworkPipeline
In your init funtion set self.transformers to the list of transformers to run for this pipeline. For example:

class MyPipeline(FrameworkPipeline):
    def __init__(self, parameters: AttrDict, progress_logger: ProgressLogger):
        super(MyPipeline, self).__init__(parameters=parameters,
                                         progress_logger=progress_logger)
        self.transformers = flatten([
            [
                FrameworkCsvLoader(
                    view="flights",
                    path_to_csv=parameters["flights_path"]
                )
            ],
            FeaturesCarriers(parameters=parameters).transformers,
        ])

To Add a SQL transformation

Create a new folder and a .sql file in that folder. This folder should be in the library folder or any subfolder you choose under the library folder.
The name of the file is the name of the view that will be created/updated to store the result of your sql code e.g., carriers.sql means we will create/update a view called carriers with the results of your sql code.
Add your sql to it. This can be any valid SparkSQL and can refer to any view created by the pipeline before this transformer is run. For example:

SELECT carrier, crsarrtime FROM flights

Run the generate_proxies command as shown in the Generating Proxies section below
Now go to your Pipeline class init and add to self.transformers. Start the folder name and hit ctrl-space for PyCharm to autocomplete the name
That's it. Your sql has been automaticaly wrapped in a Transformer which will do logging, monitor performance and do error checking

To Add a Python transformation

Create a new folder and .py file in that folder. This folder should be in the library folder or any subfolder you choose under the library folder.
In the .py file, create a new class and derive from Transformer (from spark ML). Implement the _transform() function For example:

class MyPythonTransformer(Transformer):
	def _transform(self, df: DataFrame) -> DataFrame:
		# read parameters and do your stuff here.  You can either create/update a view or just update the passed in dataframe.

Run the generate_proxies command as shown in the Generating Proxies section below
Now go to your Pipeline class init and add to self.transformers. Start the folder name and hit ctrl-space for PyCharm to autocomplete the name

To Add a Machine Learning training transformation (called `fit` or `Estimator` in SparkML lingo)

Create a new folder and .py file in that folder. This folder should be in the library folder or any subfolder you choose under the library folder.
In the .py file, create a new class and derive from Estimator (from spark ML). Implement the fit() function
Run the generate_proxies command as shown in the Generating Proxies section below
Now go to your Pipeline class init and add to self.estimators. Start the folder name and hit ctrl-space for PyCharm to autocomplete the name

To Add a Machine Learning prediction transformation

Create a new folder and .py file in that folder. This folder should be in the library folder or any subfolder you choose under the library folder.
In the .py file, create a new class and derive from Estimator (from spark ML). Implement the _transform() function. Note that that can be the same class you use for training and prediction.
Run the generate_proxies command as shown in the Generating Proxies section below
Now go to your Pipeline class init and add to self.transformers. Start the folder name and hit ctrl-space for PyCharm to autocomplete the name

Including pipelines in other pipelines

Pipelines are fully composable so you can include one pipeline as a transformer in another pipeline. For example:

class MyPipeline(FrameworkPipeline):
    def __init__(self, parameters: AttrDict, progress_logger: ProgressLogger):
        super(MyPipeline, self).__init__(parameters=parameters,
                                         progress_logger=progress_logger)
        self.transformers = flatten([
            [
                FrameworkCsvLoader(
                    view="flights",
                    path_to_csv=parameters["flights_path"]
                )
            ],
            PipelineFoo(parameters=parameters).transformers,
            FeaturesCarriers(parameters=parameters).transformers,
        ])

Generating Proxies

Run the following command to generate proxy classes. These automatically wrap your sql with a Spark Transformer that can be included in a Pipeline with no additional code.

python3 spark_pipeline_framework/proxy_generator/generate_proxies.py.

You can also add this to your project Makefile to make it easier to run:

.PHONY:proxies
proxies:
	python3 spark_pipeline_framework/proxy_generator/generate_proxies.py

Contributing

Run make firstime This will install Java, Scala, Spark and other packages

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

3.0.23

Nov 4, 2024

3.0.22

Nov 3, 2024

3.0.21

Nov 1, 2024

3.0.20

Oct 31, 2024

3.0.19

Oct 31, 2024

3.0.18

Oct 30, 2024

3.0.17

Oct 30, 2024

3.0.16

Oct 30, 2024

3.0.15

Oct 28, 2024

3.0.14

Oct 3, 2024

3.0.13

Sep 30, 2024

3.0.12

Sep 5, 2024

3.0.11

Sep 4, 2024

3.0.10

Sep 4, 2024

3.0.9

Sep 2, 2024

3.0.8

Aug 24, 2024

3.0.7

Aug 23, 2024

3.0.6

Aug 23, 2024

3.0.5

Aug 23, 2024

3.0.4

Aug 21, 2024

3.0.3

Aug 21, 2024

3.0.2

Aug 21, 2024

3.0.1

Aug 21, 2024

3.0.0

Aug 21, 2024

2.0.75

Aug 23, 2024

2.0.74

Aug 23, 2024

2.0.73

Aug 21, 2024

2.0.72

Aug 21, 2024

2.0.71

Aug 19, 2024

2.0.70

Aug 18, 2024

2.0.69

Aug 18, 2024

2.0.68

Aug 18, 2024

2.0.67

Aug 16, 2024

2.0.66

Aug 16, 2024

2.0.65

Aug 15, 2024

2.0.64

Aug 15, 2024

2.0.63

Aug 13, 2024

2.0.62

Aug 13, 2024

2.0.61

Aug 13, 2024

2.0.60

Aug 13, 2024

2.0.59

Aug 12, 2024

2.0.58

Aug 12, 2024

2.0.57

Aug 12, 2024

2.0.56

Aug 12, 2024

2.0.55

Aug 12, 2024

2.0.54

Aug 7, 2024

2.0.53

Aug 7, 2024

2.0.52

Aug 7, 2024

2.0.51

Aug 6, 2024

2.0.50

Aug 6, 2024

2.0.49

Aug 6, 2024

2.0.48

Aug 6, 2024

2.0.47

Aug 5, 2024

2.0.46

Aug 5, 2024

2.0.45

Aug 5, 2024

2.0.44

Aug 5, 2024

2.0.43

Aug 2, 2024

2.0.42

Jul 31, 2024

2.0.41

Jul 31, 2024

2.0.40

Jul 30, 2024

2.0.39

Jul 30, 2024

2.0.38

Jul 30, 2024

2.0.33

Jun 19, 2024

2.0.32

Jun 16, 2024

2.0.31

May 29, 2024

2.0.30

May 8, 2024

2.0.29

May 7, 2024

2.0.28

May 6, 2024

2.0.27

May 6, 2024

2.0.26

Apr 29, 2024

2.0.25

Apr 26, 2024

2.0.24

Apr 19, 2024

2.0.23

Apr 5, 2024

2.0.22

Feb 19, 2024

2.0.21

Feb 12, 2024

2.0.20

Nov 27, 2023

2.0.19

Nov 23, 2023

2.0.18

Nov 14, 2023

2.0.17

Nov 10, 2023

2.0.16

Nov 9, 2023

2.0.15

Nov 7, 2023

2.0.14

Oct 30, 2023

2.0.13

Oct 25, 2023

2.0.12

Oct 23, 2023

2.0.11

Oct 21, 2023

2.0.10

Oct 6, 2023

2.0.9

Sep 25, 2023

2.0.8

Sep 19, 2023

2.0.7

Aug 31, 2023

2.0.6

Aug 30, 2023

2.0.5

Aug 23, 2023

2.0.3

Aug 14, 2023

2.0.2

Aug 10, 2023

2.0.1

Aug 9, 2023

2.0.0

Aug 8, 2023

1.0.121

Aug 8, 2023

1.0.120

Jul 27, 2023

1.0.119

Jul 5, 2023

1.0.118

Jun 21, 2023

1.0.117

Jun 15, 2023

1.0.116

May 23, 2023

1.0.115

May 16, 2023

1.0.114

May 11, 2023

1.0.113

May 2, 2023

1.0.112

May 1, 2023

1.0.111

Apr 28, 2023

1.0.110

Apr 27, 2023

1.0.109

Apr 24, 2023

1.0.108

Apr 20, 2023

1.0.107

Apr 14, 2023

1.0.106

Apr 12, 2023

1.0.105

Apr 11, 2023

1.0.104

Apr 3, 2023

1.0.103

Mar 28, 2023

1.0.102

Mar 23, 2023

1.0.101

Mar 23, 2023

1.0.100

Mar 22, 2023

1.0.99

Mar 22, 2023

1.0.98

Mar 21, 2023

1.0.97

Mar 20, 2023

1.0.96

Mar 19, 2023

1.0.95

Mar 19, 2023

1.0.94

Mar 19, 2023

1.0.93

Mar 18, 2023

1.0.92

Mar 17, 2023

1.0.90

Mar 8, 2023

1.0.89

Feb 25, 2023

1.0.88

Feb 24, 2023

1.0.87

Feb 22, 2023

1.0.85

Feb 2, 2023

1.0.84

Feb 2, 2023

1.0.83

Jan 24, 2023

1.0.82

Jan 23, 2023

1.0.81

Jan 20, 2023

1.0.80

Jan 20, 2023

1.0.79

Jan 13, 2023

1.0.78

Jan 12, 2023

1.0.77

Jan 12, 2023

1.0.76

Jan 12, 2023

1.0.75

Jan 12, 2023

1.0.74

Jan 11, 2023

1.0.73

Jan 6, 2023

1.0.72

Jan 3, 2023

1.0.71

Dec 31, 2022

1.0.70

Dec 19, 2022

1.0.69

Dec 13, 2022

1.0.68

Dec 6, 2022

1.0.67

Dec 1, 2022

1.0.66

Nov 30, 2022

1.0.65

Nov 30, 2022

1.0.64

Nov 30, 2022

1.0.63

Nov 30, 2022

1.0.62

Nov 30, 2022

1.0.61

Nov 30, 2022

1.0.60

Nov 30, 2022

1.0.59

Nov 24, 2022

1.0.58

Nov 22, 2022

1.0.57

Nov 21, 2022

1.0.56

Nov 11, 2022

1.0.55

Nov 9, 2022

1.0.54

Nov 8, 2022

1.0.53

Nov 4, 2022

1.0.52

Oct 31, 2022

1.0.51

Oct 31, 2022

1.0.50

Oct 30, 2022

1.0.49

Oct 28, 2022

1.0.48

Oct 28, 2022

1.0.47

Oct 28, 2022

1.0.46

Oct 27, 2022

1.0.45

Oct 27, 2022

1.0.44

Oct 27, 2022

1.0.43

Oct 27, 2022

1.0.42

Oct 27, 2022

1.0.41

Oct 26, 2022

1.0.40

Oct 26, 2022

1.0.39

Oct 26, 2022

1.0.38

Oct 25, 2022

1.0.37

Oct 25, 2022

1.0.36

Oct 23, 2022

1.0.35

Oct 20, 2022

1.0.34

Oct 19, 2022

1.0.33

Oct 18, 2022

1.0.32

Oct 18, 2022

1.0.31

Oct 18, 2022

1.0.30

Oct 14, 2022

1.0.29

Oct 11, 2022

1.0.28

Sep 25, 2022

1.0.27

Sep 10, 2022

1.0.26

Aug 17, 2022

1.0.26a5 pre-release

Sep 8, 2022

1.0.26a4 pre-release

Sep 8, 2022

1.0.26a3 pre-release

Sep 7, 2022

1.0.26a2 pre-release

Sep 4, 2022

1.0.26a1 pre-release

Sep 4, 2022

1.0.25a1 pre-release

Jul 21, 2022

1.0.24

Jul 20, 2022

1.0.23

Jul 13, 2022

1.0.22

Jul 8, 2022

1.0.21

Jun 28, 2022

1.0.20

Jun 14, 2022

1.0.19

Jun 14, 2022

1.0.18

Jun 8, 2022

1.0.17

May 18, 2022

1.0.16

May 10, 2022

1.0.15

Apr 27, 2022

1.0.14

Apr 7, 2022

1.0.13

Apr 3, 2022

1.0.12

Mar 20, 2022

1.0.11

Mar 2, 2022

1.0.10

Feb 22, 2022

1.0.9

Feb 21, 2022

1.0.8

Feb 20, 2022

1.0.7

Feb 11, 2022

1.0.6

Feb 3, 2022

1.0.5

Jan 26, 2022

1.0.4

Dec 13, 2021

1.0.3

Dec 10, 2021

1.0.2

Nov 11, 2021

1.0.1

Nov 11, 2021

1.0.0

Nov 10, 2021

0.1.118

Oct 11, 2021

0.1.117

Oct 8, 2021

0.1.116

Oct 7, 2021

0.1.115

Oct 7, 2021

0.1.114

Oct 7, 2021

0.1.113

Oct 7, 2021

0.1.112

Oct 6, 2021

0.1.111

Oct 6, 2021

0.1.110

Oct 6, 2021

0.1.109

Oct 5, 2021

0.1.108

Sep 16, 2021

0.1.107

Sep 14, 2021

0.1.106

Sep 7, 2021

0.1.105

Aug 23, 2021

0.1.104

Jul 25, 2021

0.1.103

Jul 23, 2021

0.1.102

Jul 20, 2021

0.1.100

Jul 20, 2021

0.1.99

Jul 8, 2021

0.1.98

Jun 18, 2021

0.1.97

Jun 14, 2021

0.1.96

Jun 7, 2021

0.1.95

Jun 7, 2021

0.1.94

Jun 5, 2021

0.1.93

May 9, 2021

0.1.92

May 6, 2021

0.1.91

May 5, 2021

0.1.90

May 5, 2021

0.1.89

May 5, 2021

0.1.88

May 5, 2021

0.1.87

May 5, 2021

0.1.86

May 4, 2021

0.1.85

May 3, 2021

0.1.81

Mar 22, 2021

0.1.80

Mar 18, 2021

0.1.79

Mar 13, 2021

0.1.78

Mar 13, 2021

0.1.77

Mar 13, 2021

0.1.76

Mar 13, 2021

0.1.75

Mar 13, 2021

0.1.74

Mar 12, 2021

0.1.73

Mar 12, 2021

0.1.72

Mar 12, 2021

0.1.71

Mar 12, 2021

0.1.70

Mar 12, 2021

0.1.69

Feb 17, 2021

0.1.68

Feb 15, 2021

0.1.67

Feb 10, 2021

0.1.66

Feb 9, 2021

0.1.65

Jan 23, 2021

0.1.64

Jan 22, 2021

0.1.63

Jan 8, 2021

0.1.62

Dec 23, 2020

0.1.61

Dec 15, 2020

0.1.60

Dec 14, 2020

0.1.59

Dec 13, 2020

0.1.58

Dec 11, 2020

0.1.57

Dec 11, 2020

0.1.56

Dec 10, 2020

0.1.55

Dec 7, 2020

0.1.54

Dec 3, 2020

0.1.53

Dec 2, 2020

0.1.52

Nov 30, 2020

0.1.51

Nov 30, 2020

0.1.50

Nov 23, 2020

0.1.49

Nov 23, 2020

0.1.48

Nov 23, 2020

0.1.47

Nov 22, 2020

0.1.46

Nov 22, 2020

0.1.45

Nov 22, 2020

0.1.44

Nov 17, 2020

0.1.43

Nov 17, 2020

0.1.42

Nov 17, 2020

0.1.41

Nov 15, 2020

0.1.40

Nov 13, 2020

0.1.39

Nov 11, 2020

0.1.38

Nov 8, 2020

0.1.37

Nov 8, 2020

0.1.36

Nov 6, 2020

0.1.35

Nov 6, 2020

0.1.34

Nov 5, 2020

0.1.33

Nov 4, 2020

0.1.32

Nov 4, 2020

0.1.31

Nov 3, 2020

0.1.30

Nov 2, 2020

0.1.29

Oct 31, 2020

0.1.28

Oct 28, 2020

0.1.27

Oct 28, 2020

0.1.26

Oct 28, 2020

0.1.25

Oct 27, 2020

0.1.24

Oct 19, 2020

0.1.23

Oct 13, 2020

0.1.22

Oct 12, 2020

0.1.21

Oct 11, 2020

0.1.20

Oct 11, 2020

0.1.19

Oct 11, 2020

0.1.18

Oct 9, 2020

0.1.17

Oct 7, 2020

0.1.16

Oct 7, 2020

0.1.15

Oct 6, 2020

0.1.14

Oct 6, 2020

0.1.13

Oct 6, 2020

0.1.12

Oct 5, 2020

0.1.11

Oct 2, 2020

0.1.10

Oct 2, 2020

0.1.9

Oct 1, 2020

0.1.6

Oct 1, 2020

0.1.5

Oct 1, 2020

0.1.4

Oct 1, 2020

0.1.3

Oct 1, 2020

0.1.2

Oct 1, 2020

0.1.1

Oct 1, 2020

0.1.0

Oct 1, 2020

0.0.19

Oct 1, 2020

0.0.18

Oct 1, 2020

0.0.17

Sep 30, 2020

0.0.16

Sep 30, 2020

0.0.15

Sep 30, 2020

0.0.14

Sep 30, 2020

0.0.13

Sep 30, 2020

0.0.12

Sep 30, 2020

0.0.11

Sep 30, 2020

0.0.9

Sep 30, 2020

0.0.8

Sep 30, 2020

This version

0.0.7

Sep 29, 2020

0.0.6

Sep 25, 2020

0.0.5

Sep 21, 2020

0.0.3

Sep 20, 2020

0.0.2

Sep 20, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkpipelineframework-0.0.7.tar.gz (21.9 kB view hashes)

Uploaded Sep 29, 2020 Source

Built Distribution

sparkpipelineframework-0.0.7-py3-none-any.whl (42.6 kB view hashes)

Uploaded Sep 29, 2020 Python 3

Hashes for sparkpipelineframework-0.0.7.tar.gz

Hashes for sparkpipelineframework-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`8128b6b8260a2d6e3942829be7bf35f79011e607b68c25465a0316b5d4497c93`
MD5	`61a4c73c3279f8be85f5db60a329bfd8`
BLAKE2b-256	`54db31b24733e74a67e34b11434cca26e3e775816dddc5b28d41649d0a53d761`

Hashes for sparkpipelineframework-0.0.7-py3-none-any.whl

Hashes for sparkpipelineframework-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e173be539e6da558192d318076333db87d1799b0f921fb07b5ad112ec41cebcb`
MD5	`041771dc2f61d92a56ef4826e2bbf6a1`
BLAKE2b-256	`9661ae86bae83bb52c4a694d3f529f5284b10215a24a5a0208569a42a213eecf`

sparkpipelineframework 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SparkPipelineFramework

PyPi Package

Using it in your project

To create a new pipeline

To Add a SQL transformation

To Add a Python transformation

To Add a Machine Learning training transformation (called `fit` or `Estimator` in SparkML lingo)

To Add a Machine Learning prediction transformation

Including pipelines in other pipelines

Generating Proxies

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

sparkpipelineframework 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SparkPipelineFramework

PyPi Package

Using it in your project

To create a new pipeline

To Add a SQL transformation

To Add a Python transformation

To Add a Machine Learning training transformation (called fit or Estimator in SparkML lingo)

To Add a Machine Learning prediction transformation

Including pipelines in other pipelines

Generating Proxies

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

To Add a Machine Learning training transformation (called `fit` or `Estimator` in SparkML lingo)