Local-first dataset toolkit for multimodal federated learning artifacts (partition/feature/simulation)
Project description
fedops-dataset
fedops-dataset is a local-first dataset toolkit for multimodal federated learning (FedMS2-v8 style).
It helps you:
- fetch raw multimodal datasets
- validate dataset roots and expected files
- generate FL artifacts (partition, feature, simulation)
- load per-client records in Python for Simulation and Deployment workflows
Python requirement: >=3.8
Who This Is For
- FL researchers working with multimodal datasets
- engineers running FedMS2-style experiments repeatedly
- teams that want reproducible
alpha / ps / pmartifact generation
What This Package Covers
- Raw data bootstrap:
fedops-dataset fetch-raw
- Raw path validation:
fedops-dataset check-raw-datasets
- Artifact generation:
fedops-dataset create-v8
- Runtime loading API:
FedOpsLocalDataset
Supported Datasets
crema_dhateful_memesptb-xl
Default clients:
crema_d: 40hateful_memes: 40ptb-xl: 20
Install
pip install fedops-dataset
5-Minute Quickstart
1) Define paths
export REPO_ROOT=/path/to/fed-multimodal
export DATA_ROOT=$REPO_ROOT/fed_multimodal/data
export OUTPUT_DIR=$REPO_ROOT/fed_multimodal/output
2) Fetch raw data
# all supported datasets
fedops-dataset fetch-raw --dataset all --data-root "$DATA_ROOT"
Notes:
hateful_memesdefault fetch method is direct public git from:https://huggingface.co/datasets/neuralcatcher/hateful_memes
3) Validate raw roots
fedops-dataset check-raw-datasets --data-root "$DATA_ROOT"
4) Generate artifacts (example: hateful_memes)
# dry run first
fedops-dataset create-v8 \
--dataset hateful_memes \
--alpha 50 \
--sample-missing-rate 0.2 \
--modality-missing-rate 0.2 \
--repo-root "$REPO_ROOT" \
--data-root "$DATA_ROOT" \
--dry-run
# real run
fedops-dataset create-v8 \
--dataset hateful_memes \
--alpha 50 \
--sample-missing-rate 0.2 \
--modality-missing-rate 0.2 \
--repo-root "$REPO_ROOT" \
--data-root "$DATA_ROOT"
5) Load client records in Python
from fedops_dataset import FedOpsLocalDataset
ds = FedOpsLocalDataset(
dataset="hateful_memes",
alpha=50,
sample_missing_rate=0.2,
modality_missing_rate=0.2,
repo_root="/path/to/fed-multimodal",
data_root="/path/to/fed-multimodal/fed_multimodal/data",
)
print(ds.is_prepared())
client0 = ds.client_records(0, use_simulation=True)
print(len(client0))
Parameter Semantics
alpha: partition heterogeneity controlsample_missing_rate(ps): sample-level missingnessmodality_missing_rate(pm): modality-level missingness
Token naming examples used in artifact filenames:
alpha=0.1->alpha01alpha=5.0->alpha50alpha=50->alpha50
So 5.0 and 50 intentionally resolve to the same alpha token.
CLI Guide
fetch-raw
Use this to prepare raw datasets under your data root.
fedops-dataset fetch-raw --dataset all --data-root "$DATA_ROOT"
Hateful Memes (single recommended command)
fedops-dataset fetch-raw \
--dataset hateful_memes \
--data-root "$DATA_ROOT"
This uses the package default public source (neuralcatcher/hateful_memes) via git download.
check-raw-datasets
fedops-dataset check-raw-datasets --data-root "$DATA_ROOT"
Use this before create-v8 to catch path/file issues early.
create-v8
Generates:
- partition JSON
- feature directories
- simulation JSON
fedops-dataset create-v8 \
--dataset crema_d \
--alpha 50 \
--sample-missing-rate 0.2 \
--modality-missing-rate 0.2 \
--repo-root "$REPO_ROOT" \
--data-root "$DATA_ROOT"
Optional controls:
--no-partition--no-features--no-simulation--num-clients <N>--force--dry-run
Python API Guide
FedOpsLocalDataset
Direct usage
from fedops_dataset import FedOpsLocalDataset
ds = FedOpsLocalDataset(
dataset="crema_d",
alpha=50,
sample_missing_rate=0.2,
modality_missing_rate=0.2,
repo_root="/path/to/fed-multimodal",
data_root="/path/to/fed-multimodal/fed_multimodal/data",
)
if not ds.is_prepared():
ds.prepare(dry_run=False)
partition = ds.load_partition()
simulation = ds.load_simulation()
records = ds.client_records(0, use_simulation=True)
Runtime config usage (Flower style)
from fedops_dataset import FedOpsLocalDataset
run_config = {
"repo-root": "/path/to/fed-multimodal",
"data-root": "/path/to/fed-multimodal/fed_multimodal/data",
}
# Simulation mode
node_config = {"partition-id": 0, "num-partitions": 40}
ds = FedOpsLocalDataset.from_runtime_config(
dataset="hateful_memes",
alpha=50,
sample_missing_rate=0.2,
modality_missing_rate=0.2,
run_config=run_config,
node_config=node_config,
)
mode = ds.node_mode(node_config) # simulation
records = ds.client_records_from_node_config(node_config, use_simulation=True)
Simulation vs Deployment
Simulation mode:
- detect with
node_configcontainingpartition-idandnum-partitions - use
partition-idto resolve client records
Deployment mode:
- if
node_confighasdata-path, it is used as data root - each node can point to different local storage
Environment Variables (Optional)
export FEDOPS_REPO_ROOT=/path/to/fed-multimodal
export FEDOPS_OUTPUT_DIR=/path/to/fed-multimodal/fed_multimodal/output
export FEDOPS_DATA_ROOT=/path/to/fed-multimodal/fed_multimodal/data
export HATEFUL_MEMES_ROOT=/path/to/fed-multimodal/fed_multimodal/data/hateful_memes
You can use env vars, CLI args, or runtime config keys. No hardcoded path is required.
Troubleshooting
partition file not found:
- run
create-v8first - verify
alpha/ps/pmvalues match existing artifact names
hateful_memesfetch fails in git mode:
- ensure
gitandgit-lfsare installed - rerun the same command after confirming network access
- Raw dataset validation errors:
- run
check-raw-datasetsand follow printed hints
- Alpha confusion (
5.0vs50):
- both map to token
alpha50 - this is intentional for compatibility with existing FedMS2 artifacts
FAQ
- Do I need to pass
--hateful-memes-rootalways?
- No. By default it resolves to
<data-root>/hateful_memes.
- Can I use this package without Hugging Face uploads?
- Yes. Local-first workflow is the primary mode.
- Is
FedOpsDatasetClientstill available?
- Yes. Use it if you also host artifacts in an HF dataset repo.
Maintainer Release
cd fedops_dataset
python -m build
python -m twine check dist/*
python -m twine upload dist/*
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fedops_dataset-0.3.7.tar.gz.
File metadata
- Download URL: fedops_dataset-0.3.7.tar.gz
- Upload date:
- Size: 30.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
352c260abb3f98064d1ad412816b591eb036615cf85dc9df0c627e416125680a
|
|
| MD5 |
b4b8cfd14063c11c9ce769330259948b
|
|
| BLAKE2b-256 |
c954ed3c628cc7218d8fd767a204871f21537a8dfd020b664816426162d7a004
|
File details
Details for the file fedops_dataset-0.3.7-py3-none-any.whl.
File metadata
- Download URL: fedops_dataset-0.3.7-py3-none-any.whl
- Upload date:
- Size: 30.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1363a5e428dc0b6253d2524cc683bef1e1bfdefcdb54aa529c3c33ebfcab240b
|
|
| MD5 |
14b01360fec85a53ad4a41f6ab5d7eef
|
|
| BLAKE2b-256 |
6a150e296760d0b2e4bf9a11546711d54c38de6d931def40112ddc5eac24e94b
|