FMS Acceleration Plugin for Mixture-of-Experts
Project description
FMS Acceleration for Mixture-of-Experts
This library contains plugins to accelerate finetuning with the following optimizations:
- Expert-Parallel MoE with Triton Kernels from ScatterMoE, and some extracted from megablocks.
- Megablocks kernels for
gatherandscatter
- Megablocks kernels for
Plugins
| Plugin | Description | Depends | Loading | Augmentation | Callbacks |
|---|---|---|---|---|---|
| scattermoe | MoE Expert Parallel with Triton Kernels from scattermoe (& megablocks) | ScatterMoE / extracted kernels from megablocks | ✅ | ✅ |
Adding New Models
Our ScatterMoe implementation is a module-swap; to add new models we need to update the specifications in scattermoe_constants.py.
- See the code documentation within to understand how to add new models.
Using ScatterMoE Saved Checkpoints
ScatterMoE checkpoints are saved using torch.distributed.checkpoint (DCP) and which is by default StateDictType.SHARDED_STATE_DICT:
DTensorslimited support for full state dicts.- sharded state dicts are the extremely efficient, and require little comms overhead when saving.
We provide a script to recover back the original checkpoint:
- currently the script is only tested in the case where DCP has saved the model in a single node.
If the checkpoint is stored in hf/checkpoint-10, call the following to have the converted checkpoint written into output_dir:
python -m fms_acceleration_moe.utils.checkpoint_utils \
hf/checkpoint-10 output_dir \
mistralai/Mixtral-8x7B-Instruct-v0.1
Code Extracted from Megablocks
Notes on code extraction:
- we have only extracted two
autogradfunctions GatherOp and ScatterOp, - and the associated triton kernels from backend/kernels.py; mostly the
_padded_copy.
Running Benchmarks
Run the below in the top-level directory of this repo:
- the
scattermoedep is not included by default, so the-xswitch installs it. - consider disabling the
torchmemory logging to see improved speeds.
tox -e run-benches \
-x testenv:run-benches.setenv+="MEMORY_LOGGING=nvidia" \
-- \
"1 2 4" 128 benchmark_outputs scenarios-moe.yaml accelerated-moe-full
or run the larger Mixtral-8x7B bench:
tox ... \
8 128 benchmark_outputs scenarios-moe.yaml accelerated-moe-full-mixtral
NOTE: if FileNotFoundError is observed on the triton cache, similar to issues like these:
then somehow tox is causing problems with triton and multiprocessing (there is some race condition).
But the workaound is to first activate the tox env and
running in bash:
# if FileNotFoundError in the triton cache is observed
# - then activate the env and run the script manually
source .tox/run-benches/bin/activate
bash scripts/run_benchmarks.sh \
....
Triton Kernel Dependencies
Triton Kernels are copied into scattermoe_utils and were copied from kernel hyperdrive which is a fork of cute kernels
Known Issues
These are currently some known issues not yet resolved:
- should eventually remove the dependency on an external
kernel-hyperdriverepository. - now support only loading sharded
safetensornon-GGUF MoE checkpoints. This is a reasonable assumption since MoE checkpoints are typically above the size limit that prevents it being saved into a single checkpoint filed. - when used together with FSDP, the FSDP's
clip_grad_normwill not properly compute forScatterMoE, see issue here.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fms_acceleration_moe-0.4.6-py3-none-any.whl.
File metadata
- Download URL: fms_acceleration_moe-0.4.6-py3-none-any.whl
- Upload date:
- Size: 51.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ff199666431a76244dbfc47e05ee3ed31177dcb60163b2c2da4f0b78bf15354
|
|
| MD5 |
64537f544be61eac2eec26f649e8caf5
|
|
| BLAKE2b-256 |
cc171d0568956b7cfb80b468bbb8e574815dd70cf3662e4eef1dd9bc2abe074c
|
Provenance
The following attestation bundles were made for fms_acceleration_moe-0.4.6-py3-none-any.whl:
Publisher:
build-and-publish.yml on foundation-model-stack/fms-acceleration
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fms_acceleration_moe-0.4.6-py3-none-any.whl -
Subject digest:
8ff199666431a76244dbfc47e05ee3ed31177dcb60163b2c2da4f0b78bf15354 - Sigstore transparency entry: 872121119
- Sigstore integration time:
-
Permalink:
foundation-model-stack/fms-acceleration@b93674fe66135ef78f7f2c3d0a69bc65ee53c63e -
Branch / Tag:
refs/tags/v0.6.4 - Owner: https://github.com/foundation-model-stack
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-and-publish.yml@b93674fe66135ef78f7f2c3d0a69bc65ee53c63e -
Trigger Event:
release
-
Statement type: