Skip to main content

Flower Datasets

Project description

Flower Datasets

GitHub license PRs Welcome Build Downloads Slack

Flower Datasets (flwr-datasets) is a library to quickly and easily create datasets for federated learning, federated evaluation, and federated analytics. It was created by the Flower Labs team that also created Flower: A Friendly Federated Learning Framework.

[!TIP] For complete documentation that includes API docs, how-to guides and tutorials, please visit the Flower Datasets Documentation and for full FL example see the Flower Examples page.

Installation

For a complete installation guide visit the Flower Datasets Documentation

pip install flwr-datasets[vision]

Overview

Flower Datasets library supports:

  • downloading datasets - choose the dataset from Hugging Face's datasets,
  • partitioning datasets - customize the partitioning scheme,
  • creating centralized datasets - leave parts of the dataset unpartitioned (e.g. for centralized evaluation).

Thanks to using Hugging Face's datasets used under the hood, Flower Datasets integrates with the following popular formats/frameworks:

  • Hugging Face,
  • PyTorch,
  • TensorFlow,
  • Numpy,
  • Pandas,
  • Jax,
  • Arrow.

Create custom partitioning schemes or choose from the implemented partitioning schemes:

  • Partitioner (the abstract base class) Partitioner
  • IID partitioning IidPartitioner(num_partitions)
  • Dirichlet partitioning DirichletPartitioner(num_partitions, partition_by, alpha)
  • Distribution partitioning DistributionPartitioner(distribution_array, num_partitions, num_unique_labels_per_partition, partition_by, preassigned_num_samples_per_label, rescale)
  • InnerDirichlet partitioning InnerDirichletPartitioner(partition_sizes, partition_by, alpha)
  • Pathological partitioning PathologicalPartitioner(num_partitions, partition_by, num_classes_per_partition, class_assignment_mode)
  • Natural ID partitioning NaturalIdPartitioner(partition_by)
  • Size based partitioning (the abstract base class for the partitioners dictating the division based the number of samples) SizePartitioner
  • Linear partitioning LinearPartitioner(num_partitions)
  • Square partitioning SquarePartitioner(num_partitions)
  • Exponential partitioning ExponentialPartitioner(num_partitions)
  • more to come in the future releases (contributions are welcome).

Comparison of partitioning schemes.
Comparison of Partitioning Schemes on CIFAR10

PS: This plot was generated using a library function (see flwr_datasets.visualization package for more).

Usage

Flower Datasets exposes the FederatedDataset abstraction to represent the dataset needed for federated learning/evaluation/analytics. It has two powerful methods that let you handle the dataset preprocessing: load_partition(partition_id, split) and load_split(split).

Here's a basic quickstart example of how to partition the MNIST dataset:

from flwr_datasets import FederatedDataset
from flwr_datasets.partitioners import IidPartitioner

# The train split of the MNIST dataset will be partitioned into 100 partitions
partitioner = IidPartitioner(num_partitions=100)
fds = FederatedDataset("ylecun/mnist", partitioners={"train": partitioner})

partition = fds.load_partition(0)

centralized_data = fds.load_split("test")

For more details, please refer to the specific how-to guides or tutorial. They showcase customization and more advanced features.

Future release

Here are a few of the things that we will work on in future releases:

  • ✅ Support for more datasets (especially the ones that have user id present).
  • ✅ Creation of custom Partitioners.
  • ✅ More out-of-the-box Partitioners.
  • ✅ Passing Partitioners via FederatedDataset's partitioners argument.
  • ✅ Customization of the dataset splitting before the partitioning.
  • ✅ Simplification of the dataset transformation to the popular frameworks/types.
  • Creation of the synthetic data,
  • Support for Vertical FL.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flwr_datasets-0.3.0.tar.gz (45.5 kB view details)

Uploaded Source

Built Distribution

flwr_datasets-0.3.0-py3-none-any.whl (73.0 kB view details)

Uploaded Python 3

File details

Details for the file flwr_datasets-0.3.0.tar.gz.

File metadata

  • Download URL: flwr_datasets-0.3.0.tar.gz
  • Upload date:
  • Size: 45.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.13 Darwin/23.5.0

File hashes

Hashes for flwr_datasets-0.3.0.tar.gz
Algorithm Hash digest
SHA256 00e4f40e484614c7b5d057d8461433b3118762215b3cfe315c066b6c3d9530d5
MD5 925de03344d378ecc759727ab79e768e
BLAKE2b-256 5a15f8b52fd39d69e740023e00b2f124205f51889ac4fe0ffc8adaf9b8b4c071

See more details on using hashes here.

Provenance

File details

Details for the file flwr_datasets-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: flwr_datasets-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 73.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.13 Darwin/23.5.0

File hashes

Hashes for flwr_datasets-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6ea5c0bd8a1c732307c13732e4aeb57c7cd4ff14e7fd323a6db504aab926387e
MD5 eeb2eb18b364214d79a0b91bf490dcf4
BLAKE2b-256 d7d922aa23d35c1c62a8f001ce9372577a0dfc3ec162e80307f5572ebfe21fba

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page