An Efficient LLM Fine-Tuning Factory Optimized for MoE PEFT
Project description
MoE-PEFT: An Efficient LLM Fine-Tuning Factory for Mixture of Expert (MoE) Parameter-Efficient Fine-Tuning.
MoE-PEFT is an open-source LLMOps framework built on m-LoRA. It is designed for high-throughput fine-tuning, evaluation, and inference of Large Language Models (LLMs) using techniques such as MoE + Others (like LoRA, DoRA). Key features of MoE-PEFT include:
-
Concurrent fine-tuning, evaluation, and inference of multiple adapters with a shared pre-trained model.
-
MoE PEFT optimization, mainly for MixLoRA and other MoE implementation.
-
Support for multiple PEFT algorithms and various pre-trained models.
-
Seamless integration with the HuggingFace ecosystem.
You can try MoE-PEFT with Google Colab before local installation.
Supported Platform
OS | Backend | Model Precision | Quantization | Flash Attention |
---|---|---|---|---|
Linux | CUDA | FP32, FP16, TF32, BF16 | 8bit and 4bit | ✓ |
Windows | CUDA | FP32, FP16, TF32, BF16 | 8bit and 4bit | - |
macOS | MPS | FP32, FP16, BF16 | ✗ | ✗ |
All | CPU | FP32, FP16, BF16 | ✗ | ✗ |
You can use the MOE_PEFT_BACKEND_TYPE
environment variable to force MoE-PEFT to use a specific backend. For example, if you want MoE-PEFT to run only on CPU, you can set MOE_PEFT_BACKEND_TYPE=CPU
before importing moe_peft
.
Supported Pre-trained Models
Model | Model Size | |
---|---|---|
✓ | LLaMA 1/2 | 7B/13B/70B |
✓ | LLaMA 3/3.1 | 8B/70B |
✓ | Yi 1/1.5 | 6B/9B/34B |
✓ | TinyLLaMA | 1.1B |
✓ | Qwen 1.5/2 | 0.5B ~ 72B |
✓ | Gemma | 2B/7B |
✓ | Gemma 2 | 9B/27B |
✓ | Mistral | 7B |
✓ | Phi 1.5/2 | 2.7B |
✓ | Phi 3/3.5 | 3.8B/7B/14B |
✓ | ChatGLM 1/2/3 | 6B |
✓ | GLM 4 | 6B |
Supported PEFT Methods
PEFT Methods | Arguments* | |
---|---|---|
✓ | MoLA | "routing_strategy": "mola", "num_experts": 8 |
✓ | LoRAMoE | "routing_strategy": "loramoe", "num_experts": 8 |
✓ | MixLoRA | "routing_strategy": "mixlora", "num_experts": 8 |
✓ | MixLoRA-Switch | "routing_strategy": "mixlora-switch", "num_experts": 8 |
✓ | MixLoRA-Dynamic | "routing_strategy": "mixlora-dynamic", "num_experts": 8 |
✓ | QLoRA | See Quantize Methods |
✓ | LoRA+ | "loraplus_lr_ratio": 20.0 |
✓ | DoRA | "use_dora": true |
✓ | rsLoRA | "use_rslora": true |
*: Arguments of configuration file
Notice of PEFT supports
- MoE-PEFT supports specific optimized operators for these PEFT methods, which can effectively improve the computing performance during training, evaluation and inference. However, these operators may cause a certain degree of accuracy loss (less than 5%). You can disable the optimized operators by defining the
MOE_PEFT_EVALUATE_MODE
environment variable in advance. - Auxiliary Loss is not currently supported for MoE PEFT methods other than MixLoRA.
- You can check detailed arguments of MixLoRA in TUDB-Labs/MixLoRA.
Supported Attention Methods
Attention Methods | Name | Arguments* | |
---|---|---|---|
✓ | Scaled Dot Product | "eager" |
--attn_impl eager |
✓ | Flash Attention 2 | "flash_attn" |
--attn_impl flash_attn |
✓ | Sliding Window Attention | - | --sliding_window |
*: Arguments of moe_peft.py
MoE-PEFT only supports scaled-dot product attention (eager) by default. Additional requirements are necessary for flash attention.
For flash attention, manual installation of the following dependencies is required:
pip3 install ninja
pip3 install flash-attn==2.5.8 --no-build-isolation
If any attention method is not specified, flash attention is used if available.
Supported Quantize Methods
Quantize Methods | Arguments* | |
---|---|---|
✓ | Full Precision (FP32) | by default |
✓ | Tensor Float 32 | --tf32 |
✓ | Half Precision (FP16) | --fp16 |
✓ | Brain Float 16 | --bf16 |
✓ | 8bit Quantize | --load_8bit |
✓ | 4bit Quantize | --load_4bit |
*: Arguments of moe_peft.py
MoE-PEFT offers support for various model accuracy and quantization methods. By default, MoE-PEFT utilizes full precision (Float32), but users can opt for half precision (Float16) using --fp16
or BrainFloat16 using --bf16
. Enabling half precision reduces the model size by half, and for further reduction, quantization methods can be employed.
Quantization can be activated using --load_4bit
for 4-bit quantization or --load_8bit
for 8-bit quantization. However, when only quantization is enabled, MoE-PEFT utilizes Float32 for calculations. To achieve memory savings during training, users can combine quantization and half-precision modes.
To enable quantization support, please manually install bitsandbytes
:
pip3 install bitsandbytes==0.43.1
It's crucial to note that regardless of the settings, LoRA weights are always calculated and stored at full precision. For maintaining calculation accuracy, MoE-PEFT framework mandates the use of full precision for calculations when accuracy is imperative.
For users with NVIDIA Ampere or newer GPU architectures, the --tf32
option can be utilized to enable full-precision calculation acceleration.
Offline Configuration
MoE-PEFT relies on HuggingFace Hub to download necessary models, datasets, etc. If you cannot access the Internet or need to deploy MoE-PEFT in an offline environment, please refer to the following guide.
- Use
git-lfs
manually downloads models and datasets from HuggingFace Hub. - Set
--data_path
to the local path to datasets when executinglaunch.py gen
. - Clone the evaluate code repository locally.
- Set environment variable
MOE_PEFT_METRIC_PATH
to the local path tometrics
folder of evaluate code repository. - Set
--base_model
to the local path to models when executinglaunch.py run
.
Example of (4): export MOE_PEFT_METRIC_PATH=/path-to-your-git-repo/evaluate/metrics
Known issues
- Quantization with Qwen2 have no effect (same with transformers).
- Applying quantization with DoRA will result in higher memory and computation cost (same with PEFT).
- Sliding window attention with generate cache may product abnormal output.
Installation
Please refer to MoE-PEFT Install Guide.
Quickstart
You can conveniently utilize MoE-PEFT via launch.py
. The following example demonstrates a streamlined approach to training a dummy model with MoE-PEFT.
# Generating configuration
python launch.py gen --template lora --tasks ./tests/dummy_data.json
# Running the training task
python launch.py run --base_model TinyLlama/TinyLlama_v1.1
# Try with gradio web ui
python inference.py \
--base_model TinyLlama/TinyLlama_v1.1 \
--template alpaca \
--lora_weights ./casual_0
For further detailed usage information, please refer to the help
command:
python launch.py help
MoE-PEFT
The moe_peft.py
code is a starting point for finetuning on various datasets.
Basic command for finetuning a baseline model on the Alpaca Cleaned dataset:
# Generating configuration
python launch.py gen \
--template lora \
--tasks yahma/alpaca-cleaned
python moe_peft.py \
--base_model meta-llama/Llama-2-7b-hf \
--config moe_peft.json \
--bf16
You can check the template finetune configuration in templates folder.
For further detailed usage information, please use --help
option:
python moe_peft.py --help
Use Docker
Firstly, ensure that you have installed Docker Engine and NVIDIA Container Toolkit correctly.
After that, you can launch the container using the following typical command:
docker run --gpus all -it --rm mikecovlee/moe_peft
You can check all available tags from: mikecovlee/moe_peft/tags
Please note that this container only provides a proper environment to run MoE-PEFT. The codes of MoE-PEFT are not included.
Copyright
This project is licensed under the Apache 2.0 License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file moe_peft-1.0.1.tar.gz
.
File metadata
- Download URL: moe_peft-1.0.1.tar.gz
- Upload date:
- Size: 73.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c2593021d55601ed8e3eaf15e3b388d68f3b5aeea3aadea1e4d159c176df230 |
|
MD5 | a6563a25be1c864d153c0befee0c9a10 |
|
BLAKE2b-256 | 8f3e1f12cbb64f9c41a153bdcc6fa942788615a14746b0d3f7133dbb5743201d |
File details
Details for the file moe_peft-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: moe_peft-1.0.1-py3-none-any.whl
- Upload date:
- Size: 92.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 10f4e45b6814ed30ed9175ea5697ea97dea2963f40be56c2e0032ca22a096b4c |
|
MD5 | 4e0a7aee34fd7532ecbaa12473fcd738 |
|
BLAKE2b-256 | f74a2f2cdc207560d83af8382c20f6d9e647dc48dd957edf41c84833af979e4d |