Skip to main content

Fondant - Composable pipelines for foundation model finetuning

Project description

Express

Express is a framework that speeds up the creation of KubeFlow pipelines to process big datasets and train Foundation Models such as:

  • Stable Diffusion
  • CLIP
  • Large Language Models (LLMs like GPT-3)

on them.

Installation

Express can be installed using pip:

pip install express

Usage

Express is built upon KubeFlow, a cloud-agnostic framework built by Google to orchestrate machine learning workflows on Kubernetes. An important aspect of KubeFlow are pipelines, which consist of a set of components being executed, one after the other. This typically involves transforming data and optionally training a machine learning model on it. Check out this page if you want to learn more about KubeFlow pipelines and components.

Express offers ready-made components and helper functions that serve as boilerplate which you can use to speed up the creation of KubeFlow pipelines. To implement your own component, simply overwrite one of the components available in Express. In the example below, we leverage the PandasTransformComponent and overwrite its transform method.

import pandas as pd

from express.components.pandas_components import PandasTransformComponent, PandasDataset, PandasDatasetDraft

class MyFirstTransform(PandasTransformComponent):
    @classmethod
    def transform(cls, data: PandasDataset, extra_args: Optional[Dict] = None) -> PandasDatasetDraft:

        # Reading data
        index: List[str] = data.load_index()
        my_data: Scanner = data.load("my_data_source")

        # Transforming data
        table: pa.Table = my_data.to_table()
        df: pd.DataFrame = table.to_pandas()
        # ...
        transformed_table = pa.Table.from_pandas(df)

        # Returning output.
        return data.extend() \
            .with_index(in) \
            .with_data_source("my_transformed_data_source", \
                              Scanner.from_batches(table.to_batches())

Components zoo

Available components include:

  • Non-distributed Pandas components: express.components.pandas_components.{PandasTransformComponent, PandasLoaderComponent}

Planned components include:

  • Spark-based components and base image.
  • HuggingFace Datasets components.

With Kubeflow, it's possible to share and re-use components across different pipelines. To see an example, checkout this sample notebook that showcases how you can save and load a component.

Note that Google's AI Hub also contains components that you can easily re-use. Some interesting examples:

Pipeline zoo

To do: add ready-made pipelines.

Examples

Example use cases of Express include:

  • collect additional image-text pairs based on a few seed images and fine-tune Stable Diffusion
  • filter an image-text dataset to only include "count" examples and fine-tune CLIP to improve its counting capabilities

Check out the examples folder for some illustrations.

Contributing

We use poetry and pre-commit to enable a smooth developer flow. Run the following commands to set up your development environment:

pip install poetry
poetry install
pre-commit install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fondant-0.1.dev0.tar.gz (20.6 kB view hashes)

Uploaded Source

Built Distributions

fondant-0.1.dev0-py3-none-any.whl (26.0 kB view hashes)

Uploaded Python 3

fondant-0.1.0.dev0-py3-none-any.whl (28.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page