Skip to main content

Local-first dataset toolkit for multimodal federated learning artifacts (partition/feature/simulation)

Project description

fedops-dataset

fedops-dataset is a local-first dataset toolkit for multimodal federated learning (FedMS2-v8 style).

It helps you:

  • fetch raw multimodal datasets
  • validate dataset roots and expected files
  • generate FL artifacts (partition, feature, simulation)
  • load per-client records in Python for Simulation and Deployment workflows

Python requirement: >=3.8

Who This Is For

  • FL researchers working with multimodal datasets
  • engineers running FedMS2-style experiments repeatedly
  • teams that want reproducible alpha / ps / pm artifact generation

What This Package Covers

  1. Raw data bootstrap:
  • fedops-dataset fetch-raw
  1. Raw path validation:
  • fedops-dataset check-raw-datasets
  1. Artifact generation:
  • fedops-dataset create-v8
  1. Runtime loading API:
  • FedOpsLocalDataset

Supported Datasets

  • crema_d
  • hateful_memes
  • ptb-xl

Default clients:

  • crema_d: 40
  • hateful_memes: 40
  • ptb-xl: 20

Install

pip install fedops-dataset

5-Minute Quickstart

1) Define paths

export REPO_ROOT=/path/to/fed-multimodal
export DATA_ROOT=$REPO_ROOT/fed_multimodal/data
export OUTPUT_DIR=$REPO_ROOT/fed_multimodal/output

2) Fetch raw data

# all supported datasets
fedops-dataset fetch-raw --dataset all --data-root "$DATA_ROOT"

Notes:

  • hateful_memes default fetch method is direct public git from:
    • https://huggingface.co/datasets/neuralcatcher/hateful_memes

3) Validate raw roots

fedops-dataset check-raw-datasets --data-root "$DATA_ROOT"

4) Generate artifacts (example: hateful_memes)

# dry run first
fedops-dataset create-v8 \
  --dataset hateful_memes \
  --alpha 50 \
  --sample-missing-rate 0.2 \
  --modality-missing-rate 0.2 \
  --repo-root "$REPO_ROOT" \
  --data-root "$DATA_ROOT" \
  --dry-run

# real run
fedops-dataset create-v8 \
  --dataset hateful_memes \
  --alpha 50 \
  --sample-missing-rate 0.2 \
  --modality-missing-rate 0.2 \
  --repo-root "$REPO_ROOT" \
  --data-root "$DATA_ROOT"

5) Load client records in Python

from fedops_dataset import FedOpsLocalDataset

ds = FedOpsLocalDataset(
    dataset="hateful_memes",
    alpha=50,
    sample_missing_rate=0.2,
    modality_missing_rate=0.2,
    repo_root="/path/to/fed-multimodal",
    data_root="/path/to/fed-multimodal/fed_multimodal/data",
)

print(ds.is_prepared())
client0 = ds.client_records(0, use_simulation=True)
print(len(client0))

Parameter Semantics

  • alpha: partition heterogeneity control
  • sample_missing_rate (ps): sample-level missingness
  • modality_missing_rate (pm): modality-level missingness

Token naming examples used in artifact filenames:

  • alpha=0.1 -> alpha01
  • alpha=5.0 -> alpha50
  • alpha=50 -> alpha50

So 5.0 and 50 intentionally resolve to the same alpha token.

CLI Guide

fetch-raw

Use this to prepare raw datasets under your data root.

fedops-dataset fetch-raw --dataset all --data-root "$DATA_ROOT"

Hateful Memes (single recommended command)

fedops-dataset fetch-raw \
  --dataset hateful_memes \
  --data-root "$DATA_ROOT"

This uses the package default public source (neuralcatcher/hateful_memes) via git download.

check-raw-datasets

fedops-dataset check-raw-datasets --data-root "$DATA_ROOT"

Use this before create-v8 to catch path/file issues early.

create-v8

Generates:

  • partition JSON
  • feature directories
  • simulation JSON
fedops-dataset create-v8 \
  --dataset crema_d \
  --alpha 50 \
  --sample-missing-rate 0.2 \
  --modality-missing-rate 0.2 \
  --repo-root "$REPO_ROOT" \
  --data-root "$DATA_ROOT"

Optional controls:

  • --no-partition
  • --no-features
  • --no-simulation
  • --num-clients <N>
  • --force
  • --dry-run

Python API Guide

FedOpsLocalDataset

Direct usage

from fedops_dataset import FedOpsLocalDataset

ds = FedOpsLocalDataset(
    dataset="crema_d",
    alpha=50,
    sample_missing_rate=0.2,
    modality_missing_rate=0.2,
    repo_root="/path/to/fed-multimodal",
    data_root="/path/to/fed-multimodal/fed_multimodal/data",
)

if not ds.is_prepared():
    ds.prepare(dry_run=False)

partition = ds.load_partition()
simulation = ds.load_simulation()
records = ds.client_records(0, use_simulation=True)

Runtime config usage (Flower style)

from fedops_dataset import FedOpsLocalDataset

run_config = {
    "repo-root": "/path/to/fed-multimodal",
    "data-root": "/path/to/fed-multimodal/fed_multimodal/data",
}

# Simulation mode
node_config = {"partition-id": 0, "num-partitions": 40}

ds = FedOpsLocalDataset.from_runtime_config(
    dataset="hateful_memes",
    alpha=50,
    sample_missing_rate=0.2,
    modality_missing_rate=0.2,
    run_config=run_config,
    node_config=node_config,
)

mode = ds.node_mode(node_config)  # simulation
records = ds.client_records_from_node_config(node_config, use_simulation=True)

Simulation vs Deployment

Simulation mode:

  • detect with node_config containing partition-id and num-partitions
  • use partition-id to resolve client records

Deployment mode:

  • if node_config has data-path, it is used as data root
  • each node can point to different local storage

Environment Variables (Optional)

export FEDOPS_REPO_ROOT=/path/to/fed-multimodal
export FEDOPS_OUTPUT_DIR=/path/to/fed-multimodal/fed_multimodal/output
export FEDOPS_DATA_ROOT=/path/to/fed-multimodal/fed_multimodal/data
export HATEFUL_MEMES_ROOT=/path/to/fed-multimodal/fed_multimodal/data/hateful_memes

You can use env vars, CLI args, or runtime config keys. No hardcoded path is required.

Troubleshooting

  1. partition file not found:
  • run create-v8 first
  • verify alpha/ps/pm values match existing artifact names
  1. hateful_memes fetch fails in git mode:
  • ensure git and git-lfs are installed
  • rerun the same command after confirming network access
  1. Raw dataset validation errors:
  • run check-raw-datasets and follow printed hints
  1. Alpha confusion (5.0 vs 50):
  • both map to token alpha50
  • this is intentional for compatibility with existing FedMS2 artifacts

FAQ

  1. Do I need to pass --hateful-memes-root always?
  • No. By default it resolves to <data-root>/hateful_memes.
  1. Can I use this package without Hugging Face uploads?
  • Yes. Local-first workflow is the primary mode.
  1. Is FedOpsDatasetClient still available?
  • Yes. Use it if you also host artifacts in an HF dataset repo.

Maintainer Release

cd fedops_dataset
python -m build
python -m twine check dist/*
python -m twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fedops_dataset-0.3.7.tar.gz (30.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fedops_dataset-0.3.7-py3-none-any.whl (30.6 kB view details)

Uploaded Python 3

File details

Details for the file fedops_dataset-0.3.7.tar.gz.

File metadata

  • Download URL: fedops_dataset-0.3.7.tar.gz
  • Upload date:
  • Size: 30.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.20

File hashes

Hashes for fedops_dataset-0.3.7.tar.gz
Algorithm Hash digest
SHA256 352c260abb3f98064d1ad412816b591eb036615cf85dc9df0c627e416125680a
MD5 b4b8cfd14063c11c9ce769330259948b
BLAKE2b-256 c954ed3c628cc7218d8fd767a204871f21537a8dfd020b664816426162d7a004

See more details on using hashes here.

File details

Details for the file fedops_dataset-0.3.7-py3-none-any.whl.

File metadata

  • Download URL: fedops_dataset-0.3.7-py3-none-any.whl
  • Upload date:
  • Size: 30.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.20

File hashes

Hashes for fedops_dataset-0.3.7-py3-none-any.whl
Algorithm Hash digest
SHA256 1363a5e428dc0b6253d2524cc683bef1e1bfdefcdb54aa529c3c33ebfcab240b
MD5 14b01360fec85a53ad4a41f6ab5d7eef
BLAKE2b-256 6a150e296760d0b2e4bf9a11546711d54c38de6d931def40112ddc5eac24e94b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page