Fondant - Composable pipelines for foundation model finetuning

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Express

Express is a framework that speeds up the creation of KubeFlow pipelines to process big datasets and train Foundation Models such as:

Stable Diffusion
CLIP
Large Language Models (LLMs like GPT-3)

on them.

Installation

Express can be installed using pip:

pip install express

Usage

Express is built upon KubeFlow, a cloud-agnostic framework built by Google to orchestrate machine learning workflows on Kubernetes. An important aspect of KubeFlow are pipelines, which consist of a set of components being executed, one after the other. This typically involves transforming data and optionally training a machine learning model on it. Check out this page if you want to learn more about KubeFlow pipelines and components.

Express offers ready-made components and helper functions that serve as boilerplate which you can use to speed up the creation of KubeFlow pipelines. To implement your own component, simply overwrite one of the components available in Express. In the example below, we leverage the PandasTransformComponent and overwrite its transform method.

import pandas as pd

from express.components.pandas_components import PandasTransformComponent, PandasDataset, PandasDatasetDraft

class MyFirstTransform(PandasTransformComponent):
    @classmethod
    def transform(cls, data: PandasDataset, extra_args: Optional[Dict] = None) -> PandasDatasetDraft:

        # Reading data
        index: List[str] = data.load_index()
        my_data: Scanner = data.load("my_data_source")

        # Transforming data
        table: pa.Table = my_data.to_table()
        df: pd.DataFrame = table.to_pandas()
        # ...
        transformed_table = pa.Table.from_pandas(df)

        # Returning output.
        return data.extend() \
            .with_index(in) \
            .with_data_source("my_transformed_data_source", \
                              Scanner.from_batches(table.to_batches())

Components zoo

Available components include:

Non-distributed Pandas components: express.components.pandas_components.{PandasTransformComponent, PandasLoaderComponent}

Planned components include:

Spark-based components and base image.
HuggingFace Datasets components.

With Kubeflow, it's possible to share and re-use components across different pipelines. To see an example, checkout this sample notebook that showcases how you can save and load a component.

Note that Google's AI Hub also contains components that you can easily re-use. Some interesting examples:

Pipeline zoo

To do: add ready-made pipelines.

Examples

Example use cases of Express include:

collect additional image-text pairs based on a few seed images and fine-tune Stable Diffusion
filter an image-text dataset to only include "count" examples and fine-tune CLIP to improve its counting capabilities

Check out the examples folder for some illustrations.

Contributing

We use poetry and pre-commit to enable a smooth developer flow. Run the following commands to set up your development environment:

pip install poetry
poetry install
pre-commit install

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.0 yanked

Jan 31, 2024

Reason this release was yanked:

Released as 0.10.0 instead

0.12.1

Apr 22, 2024

0.12.0

Apr 17, 2024

0.12.dev0 pre-release

Apr 8, 2024

0.11.2

Apr 4, 2024

0.11.1

Mar 15, 2024

0.11.0

Mar 7, 2024

0.11.dev5 pre-release

Mar 5, 2024

0.11.dev4 pre-release

Feb 24, 2024

0.11.dev3 pre-release

Feb 21, 2024

0.11.dev2 pre-release

Feb 21, 2024

0.11.dev1 pre-release

Feb 20, 2024

0.10.1

Feb 5, 2024

0.10.0

Jan 31, 2024

0.10.dev0 pre-release

Jan 22, 2024

0.9.0

Jan 16, 2024

0.9.dev2 pre-release

Jan 15, 2024

0.9.dev1 pre-release

Jan 12, 2024

0.9.dev0 pre-release

Jan 11, 2024

0.8.0

Dec 13, 2023

0.8.dev6 pre-release

Dec 12, 2023

0.8.dev5 pre-release

Dec 12, 2023

0.8.dev4 pre-release

Dec 7, 2023

0.8.dev3 pre-release

Dec 4, 2023

0.8.dev2 pre-release

Nov 30, 2023

0.8.dev1 pre-release

Nov 27, 2023

0.8.dev0 pre-release

Nov 27, 2023

0.7.0

Nov 20, 2023

0.6.2

Oct 20, 2023

0.6.1

Oct 19, 2023

0.6.0 yanked

Oct 19, 2023

Reason this release was yanked:

Packaged older commit, use repackaged 0.6.1 instead.

0.5.0

Sep 25, 2023

0.4.0

Sep 22, 2023

0.3.2

Aug 24, 2023

0.3.1

Aug 21, 2023

0.3.0

Aug 8, 2023

0.2.1

Jul 6, 2023

0.2.0

Jun 28, 2023

0.2.dev0 pre-release yanked

Apr 14, 2023

Reason this release was yanked:

Old development version

0.1.3

Jun 16, 2023

0.1.2

Jun 1, 2023

0.1.1

May 31, 2023

0.1.0

May 23, 2023

0.1.0.dev6 pre-release

May 23, 2023

This version

0.1.dev0 pre-release

Apr 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fondant-0.1.dev0.tar.gz (20.6 kB view hashes)

Uploaded Apr 14, 2023 Source

Built Distributions

fondant-0.1.dev0-py3-none-any.whl (26.0 kB view hashes)

Uploaded Apr 14, 2023 Python 3

fondant-0.1.0.dev0-py3-none-any.whl (28.4 kB view hashes)

Uploaded May 11, 2023 Python 3

Hashes for fondant-0.1.dev0.tar.gz

Hashes for fondant-0.1.dev0.tar.gz
Algorithm	Hash digest
SHA256	`0d0d03c426f23ce2afa323ffcafbed1a13f6b58f71fdebf3fbc713ed7e46fd2c`
MD5	`9850bc079e2190fa6dfe180b17562317`
BLAKE2b-256	`dc7898f503aea9827f036ded2dbe6c6794330b97777be390f86c3940d1701452`

Hashes for fondant-0.1.dev0-py3-none-any.whl

Hashes for fondant-0.1.dev0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f99f66904e9dda60a141f9528203388fa12e0d478d047ea00dc2305d864fcad8`
MD5	`cb714dc10893e3a16edcd6f949de3b43`
BLAKE2b-256	`4d96500d913fa12ec9918487db9f3116284a6b0122650b86b794d002229b9a54`

Hashes for fondant-0.1.0.dev0-py3-none-any.whl

Hashes for fondant-0.1.0.dev0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0ad168feb96d3af010f122da33c00e65b8ba69a2f891066cfa44f8fc2fe71151`
MD5	`65a90d4a855e630f04c5cce21bc9a984`
BLAKE2b-256	`08841874a5879703b5a37578779074f0df533a995d8593615f87d99264404867`