Skip to main content

Local-first dataset toolkit for multimodal federated learning artifacts (partition/feature/simulation)

Project description

fedops-dataset

fedops-dataset is a local-first dataset toolkit for multimodal federated learning (FedMS2-v8 style).

It helps you:

  • fetch raw multimodal datasets
  • validate dataset roots and expected files
  • generate FL artifacts (partition, feature, simulation)
  • load per-client records in Python for Simulation and Deployment workflows

Python requirement: >=3.8

Who This Is For

  • FL researchers working with multimodal datasets
  • engineers running FedMS2-style experiments repeatedly
  • teams that want reproducible alpha / ps / pm artifact generation

What This Package Covers

  1. Raw data bootstrap:
  • fedops-dataset fetch-raw
  1. Raw path validation:
  • fedops-dataset check-raw-datasets
  1. Artifact generation:
  • fedops-dataset create-v8
  1. Runtime loading API:
  • FedOpsLocalDataset

Supported Datasets

  • crema_d
  • hateful_memes
  • ptb-xl

Default clients:

  • crema_d: 40
  • hateful_memes: 40
  • ptb-xl: 20

Install

pip install fedops-dataset

5-Minute Quickstart

1) Define paths

export REPO_ROOT=/path/to/fed-multimodal
export DATA_ROOT=$REPO_ROOT/fed_multimodal/data
export OUTPUT_DIR=$REPO_ROOT/fed_multimodal/output

2) Fetch raw data

# all supported datasets
fedops-dataset fetch-raw --dataset all --data-root "$DATA_ROOT"

Notes:

  • hateful_memes default fetch method is direct public git from:
    • https://huggingface.co/datasets/neuralcatcher/hateful_memes

3) Validate raw roots

fedops-dataset check-raw-datasets --data-root "$DATA_ROOT"

4) Generate artifacts (example: hateful_memes)

# dry run first
fedops-dataset create-v8 \
  --dataset hateful_memes \
  --alpha 50 \
  --sample-missing-rate 0.2 \
  --modality-missing-rate 0.2 \
  --repo-root "$REPO_ROOT" \
  --data-root "$DATA_ROOT" \
  --dry-run

# real run
fedops-dataset create-v8 \
  --dataset hateful_memes \
  --alpha 50 \
  --sample-missing-rate 0.2 \
  --modality-missing-rate 0.2 \
  --repo-root "$REPO_ROOT" \
  --data-root "$DATA_ROOT"

5) Load client records in Python

from fedops_dataset import FedOpsLocalDataset

ds = FedOpsLocalDataset(
    dataset="hateful_memes",
    alpha=50,
    sample_missing_rate=0.2,
    modality_missing_rate=0.2,
    repo_root="/path/to/fed-multimodal",
    data_root="/path/to/fed-multimodal/fed_multimodal/data",
)

print(ds.is_prepared())
client0 = ds.client_records(0, use_simulation=True)
print(len(client0))

Parameter Semantics

  • alpha: partition heterogeneity control
  • sample_missing_rate (ps): sample-level missingness
  • modality_missing_rate (pm): modality-level missingness

Token naming examples used in artifact filenames:

  • alpha=0.1 -> alpha01
  • alpha=5.0 -> alpha50
  • alpha=50 -> alpha50

So 5.0 and 50 intentionally resolve to the same alpha token.

CLI Guide

fetch-raw

Use this to prepare raw datasets under your data root.

fedops-dataset fetch-raw --dataset all --data-root "$DATA_ROOT"

Hateful Memes fetch modes

  1. Default public git mode:
fedops-dataset fetch-raw \
  --dataset hateful_memes \
  --data-root "$DATA_ROOT" \
  --hateful-memes-fetch-method git \
  --hateful-memes-repo-id neuralcatcher/hateful_memes
  1. HF snapshot mode (API-based):
export HF_TOKEN=<optional_token>
fedops-dataset fetch-raw \
  --dataset hateful_memes \
  --data-root "$DATA_ROOT" \
  --hateful-memes-fetch-method hf-snapshot
  1. Archive URL mode:
fedops-dataset fetch-raw \
  --dataset hateful_memes \
  --data-root "$DATA_ROOT" \
  --hateful-memes-fetch-method archive \
  --hateful-memes-archive-url https://<host>/hateful_memes.zip
  1. Manual prepared folder mode:
fedops-dataset fetch-raw \
  --dataset hateful_memes \
  --data-root "$DATA_ROOT" \
  --hateful-memes-source-dir /path/to/hateful_memes_source \
  --hateful-memes-mode symlink

check-raw-datasets

fedops-dataset check-raw-datasets --data-root "$DATA_ROOT"

Use this before create-v8 to catch path/file issues early.

create-v8

Generates:

  • partition JSON
  • feature directories
  • simulation JSON
fedops-dataset create-v8 \
  --dataset crema_d \
  --alpha 50 \
  --sample-missing-rate 0.2 \
  --modality-missing-rate 0.2 \
  --repo-root "$REPO_ROOT" \
  --data-root "$DATA_ROOT"

Optional controls:

  • --no-partition
  • --no-features
  • --no-simulation
  • --num-clients <N>
  • --force
  • --dry-run

Python API Guide

FedOpsLocalDataset

Direct usage

from fedops_dataset import FedOpsLocalDataset

ds = FedOpsLocalDataset(
    dataset="crema_d",
    alpha=50,
    sample_missing_rate=0.2,
    modality_missing_rate=0.2,
    repo_root="/path/to/fed-multimodal",
    data_root="/path/to/fed-multimodal/fed_multimodal/data",
)

if not ds.is_prepared():
    ds.prepare(dry_run=False)

partition = ds.load_partition()
simulation = ds.load_simulation()
records = ds.client_records(0, use_simulation=True)

Runtime config usage (Flower style)

from fedops_dataset import FedOpsLocalDataset

run_config = {
    "repo-root": "/path/to/fed-multimodal",
    "data-root": "/path/to/fed-multimodal/fed_multimodal/data",
}

# Simulation mode
node_config = {"partition-id": 0, "num-partitions": 40}

ds = FedOpsLocalDataset.from_runtime_config(
    dataset="hateful_memes",
    alpha=50,
    sample_missing_rate=0.2,
    modality_missing_rate=0.2,
    run_config=run_config,
    node_config=node_config,
)

mode = ds.node_mode(node_config)  # simulation
records = ds.client_records_from_node_config(node_config, use_simulation=True)

Simulation vs Deployment

Simulation mode:

  • detect with node_config containing partition-id and num-partitions
  • use partition-id to resolve client records

Deployment mode:

  • if node_config has data-path, it is used as data root
  • each node can point to different local storage

Environment Variables (Optional)

export FEDOPS_REPO_ROOT=/path/to/fed-multimodal
export FEDOPS_OUTPUT_DIR=/path/to/fed-multimodal/fed_multimodal/output
export FEDOPS_DATA_ROOT=/path/to/fed-multimodal/fed_multimodal/data
export HATEFUL_MEMES_ROOT=/path/to/fed-multimodal/fed_multimodal/data/hateful_memes

You can use env vars, CLI args, or runtime config keys. No hardcoded path is required.

Troubleshooting

  1. partition file not found:
  • run create-v8 first
  • verify alpha/ps/pm values match existing artifact names
  1. hateful_memes fetch fails in git mode:
  • ensure git and git-lfs are installed
  • use hf-snapshot mode as fallback
  1. Raw dataset validation errors:
  • run check-raw-datasets and follow printed hints
  1. Alpha confusion (5.0 vs 50):
  • both map to token alpha50
  • this is intentional for compatibility with existing FedMS2 artifacts

FAQ

  1. Do I need to pass --hateful-memes-root always?
  • No. By default it resolves to <data-root>/hateful_memes.
  1. Can I use this package without Hugging Face uploads?
  • Yes. Local-first workflow is the primary mode.
  1. Is FedOpsDatasetClient still available?
  • Yes. Use it if you also host artifacts in an HF dataset repo.

Maintainer Release

cd fedops_dataset
python -m build
python -m twine check dist/*
python -m twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fedops_dataset-0.3.6.tar.gz (30.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fedops_dataset-0.3.6-py3-none-any.whl (30.7 kB view details)

Uploaded Python 3

File details

Details for the file fedops_dataset-0.3.6.tar.gz.

File metadata

  • Download URL: fedops_dataset-0.3.6.tar.gz
  • Upload date:
  • Size: 30.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.20

File hashes

Hashes for fedops_dataset-0.3.6.tar.gz
Algorithm Hash digest
SHA256 8f7610341ed8b2084d1f9b0f4587a9f9c9185418cdc5fae7909e1c583c3a0804
MD5 8f4deb1a929c0935ce2639d63dbcaeab
BLAKE2b-256 fb4f20e844187cd374486158f8a795af6ab8fcf40ed1ba4b74d47280ee697495

See more details on using hashes here.

File details

Details for the file fedops_dataset-0.3.6-py3-none-any.whl.

File metadata

  • Download URL: fedops_dataset-0.3.6-py3-none-any.whl
  • Upload date:
  • Size: 30.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.20

File hashes

Hashes for fedops_dataset-0.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 c30296baa49f4e27df390e8592c99c76e12ddf3709a8f38f835bf9338d5ecb1d
MD5 68606f74cb9cb720e79a34a121439dd1
BLAKE2b-256 34c821cc5c46010a21c82112cf7635274c471f6606a8c140a4652bd4a04310d9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page