Skip to main content

Build time saving ML pipelines with built in autosave and reload

Project description

FastPipeline

Persistent, easy to use, fast to code


Documentation: https://shashank-yadav.github.io/fastpipeline/

Source Code: https://github.com/shashank-yadav/fastpipeline


FastPipeline is a framework for creating general purpose pipeline in your ML projects. It helps in keeping track of your experiments by automatically storing all the intermediate data and source code.

The key features are:

  • Persistence: Automatically stores all the intermediate data and variables during the run.
  • Autoreload: Detects if something has been computed before and reloads it instead of a do-over.
  • Accessible Intermediate Data: The intermediate data is stored as pickle and json files, can be easily accessed and analyzed.
  • General Purpose: Unlike sklearn pipelines you don't need to format your data into the required X, y format.
  • Intuitive: Great editor support. Completion everywhere. Less time debugging.
  • Easy: Designed to be easy to use and learn. Less time reading docs.

Installation

$ pip install fastpipeline

---> 100%

Example

Train a classifier over the (in)famous MNIST dataset

  • Create a file mnist_pipeline.py
  • Make necessary imports and create a class DataLoader that extends the BaseNode class from the fastpipeline package. This is something we'll refer to as a Node
# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics
from sklearn.model_selection import train_test_split
import numpy as np

# Import pipeline and node constructs
from fastpipeline.base_node import BaseNode
from fastpipeline.pipeline import Pipeline

# Node for loading data
class DataLoader(BaseNode):
    def __init__(self):
        super().__init__()

    def run(self, input = {}):
        # The digits dataset
        digits = datasets.load_digits()

        # To apply a classifier on this data, we need to flatten the image, to
        # turn the data in a (samples, feature) matrix:
        n_samples = len(digits.images)
        data = digits.images.reshape((n_samples, -1))
        return {
            'data': data,
            'target': digits.target
        }
  • Create another Node whose input is output of DataLoader and that trains an SVM classifier
# Node for training the classifier
class SVMClassifier(BaseNode):
    def __init__(self, config):
        super().__init__(config)
        gamma = config['gamma']
        # Create a classifier: a support vector classifier
        self.classifier = svm.SVC(gamma=gamma)

    def run(self, input):
        data = input['data']
        target = input['target']

        # Split data into train and test subsets
        X_train, X_test, y_train, y_test = train_test_split(
            data, target, test_size=0.5, shuffle=False)

        # We learn the digits on the first half of the digits
        self.classifier.fit(X_train, y_train)

        # Now predict the value of the digit on the second half:
        y_pred = self.classifier.predict(X_test)

        return {
            'acc': np.mean(y_test == y_pred),
            'y_test': y_test,
            'y_pred': y_pred 
        }
  • Now let's instantiate the nodes and create our pipeline
if __name__ == "__main__":
    # Initialize the nodes
    dl_node = DataLoader()
    svm_node = SVMClassifier({'gamma': 0.01})

    # Create the pipeline
    pipeline = Pipeline('mnist', [dl_node, svm_node])

    # Run pipeline and see results
    result = pipeline.run(input={})
    print('Accuracy: %s'%result['acc'])
  • Run the pipeline using $ python mnist.py. You should see somthing like:

Screenshot

As expected it says that this is the first run and hence for both nodes outputs are being computed by calling their run method. The log here shows where the data is being stored

  • Try running it again with the same command: $ python mnist.py. This time you should see something different:

Screenshot

Since all the intermediate outputs are already computed, the pipeline just reloads the data at each step instead of re-computing

  • Let's make a change to the value of config inside __main__:
# svm_node = SVMClassifier({'gamma': 0.01})
svm_node = SVMClassifier({'gamma': 0.05})
  • Run the pipeline again. You'll see something like:

Screenshot

This time it used the result from first node as-is and recomputed for second node, since we made a change to the config.

If you make any changes to the class SVMClassifier same thing will happen again.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastpipeline-0.0.2.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

fastpipeline-0.0.2-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file fastpipeline-0.0.2.tar.gz.

File metadata

  • Download URL: fastpipeline-0.0.2.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.7.4

File hashes

Hashes for fastpipeline-0.0.2.tar.gz
Algorithm Hash digest
SHA256 01faf10d7a0a97120dfef3e7ef0c2213fa697ff2cdfe00c2b2c97d965d60dd7f
MD5 f383644ebba73d15207e9e3c44cf278a
BLAKE2b-256 95f19c16349e356eb4523d4ecd532235d4ecd81858166900c84ed109dd8fc2ac

See more details on using hashes here.

File details

Details for the file fastpipeline-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: fastpipeline-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 12.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.7.4

File hashes

Hashes for fastpipeline-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4e72acb3976456353604c3e9b6b3398ea1e08923639c545bcab74857cf52148e
MD5 04209643873e845cffcff1736729c5d4
BLAKE2b-256 bb360e2701e3b6758cc42c27d67a167f677c4c423a7f46417589d7f12813937c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page