Skip to main content

An Efficient LLM Fine-Tuning Factory Optimized for MoE PEFT

Project description

MoE-PEFT: An Efficient LLM Fine-Tuning Factory Optimized for MoE PEFT

MoE-PEFT is an open-source LLMOps framework built on m-LoRA developed by the IDs Lab at Sichuan University. It is designed for high-throughput fine-tuning, evaluation, and inference of Large Language Models (LLMs) using techniques such as LoRA, DoRA, MixLoRA, and others. Key features of MoE-PEFT include:

  • Concurrent fine-tuning of multiple adapters with a shared pre-trained model.

  • Support for multiple PEFT algorithms and various pre-trained models.

  • MoE PEFT optimization, mainly for MixLoRA.

You can try MoE-PEFT with Google Colab before local installation.

Supported Platform

OS Backend Model Precision Quantization Flash Attention
Linux CUDA FP32, FP16, TF32, BF16 8bit and 4bit
Windows CUDA FP32, FP16, TF32, BF16 8bit and 4bit -
macOS MPS FP32, FP16, BF16
All CPU FP32, FP16, BF16

You can use the MOE_PEFT_BACKEND_TYPE environment variable to force MoE-PEFT to use a specific backend. For example, if you want MoE-PEFT to run only on CPU, you can set MOE_PEFT_BACKEND_TYPE=CPU before importing moe_peft.

Supported Pre-trained Models

Model Model Size
LLaMA 1/2 7B/13B/70B
LLaMA 3/3.1 8B/70B
Yi 1/1.5 6B/9B/34B
TinyLLaMA 1.1B
Qwen 1.5/2 0.5B ~ 72B
Gemma 2B/7B
Gemma 2 9B/27B
Mistral 7B
Phi 1.5/2 2.7B
Phi 3 3.8B/7B/14B
ChatGLM 1/2/3 6B
GLM 4 6B

Supported PEFT Methods

PEFT Methods Arguments*
QLoRA See Quantize Methods
LoRA+ "loraplus_lr_ratio": 20.0
DoRA "use_dora": true
rsLoRA "use_rslora": true
MoLA "routing_strategy": "mola", "num_experts": 8
LoRAMoE "routing_strategy": "loramoe", "num_experts": 8
MixLoRA "routing_strategy": "mixlora", "num_experts": 8
MixLoRA-Dynamic "routing_strategy": "mixlora-dynamic", "num_experts": 8
MixLoRA-Switch "routing_strategy": "mixlora-switch", "num_experts": 8

*: Arguments of configuration file

Notice of PEFT supports

  1. MoE-PEFT supports specific optimized operators for these PEFT methods, which can effectively improve the computing performance during training, evaluation and inference. However, these operators may cause a certain degree of accuracy loss (less than 5%). You can disable the optimized operators by defining the MOE_PEFT_EVALUATE_MODE environment variable in advance.
  2. Auxiliary Loss is not currently supported for Mo-LoRA (Mixture of LoRAs) methods other than MixLoRA.
  3. You can check detailed arguments of MixLoRA in TUDB-Labs/MixLoRA.

Supported Attention Methods

Attention Methods Name Arguments*
Scaled Dot Product "eager" --attn_impl eager
Flash Attention 2 "flash_attn" --attn_impl flash_attn
Sliding Window Attention - --sliding_window

*: Arguments of moe_peft.py

MoE-PEFT only supports scaled-dot product attention (eager) by default. Additional requirements are necessary for flash attention.

For flash attention, manual installation of the following dependencies is required:

pip3 install ninja
pip3 install flash-attn==2.5.8 --no-build-isolation

If any attention method is not specified, flash attention is used if available.

Supported Quantize Methods

Quantize Methods Arguments*
Full Precision (FP32) by default
Tensor Float 32 --tf32
Half Precision (FP16) --fp16
Brain Float 16 --bf16
8bit Quantize --load_8bit
4bit Quantize --load_4bit

*: Arguments of moe_peft.py

MoE-PEFT offers support for various model accuracy and quantization methods. By default, MoE-PEFT utilizes full precision (Float32), but users can opt for half precision (Float16) using --fp16 or BrainFloat16 using --bf16. Enabling half precision reduces the model size by half, and for further reduction, quantization methods can be employed.

Quantization can be activated using --load_4bit for 4-bit quantization or --load_8bit for 8-bit quantization. However, when only quantization is enabled, MoE-PEFT utilizes Float32 for calculations. To achieve memory savings during training, users can combine quantization and half-precision modes.

To enable quantization support, please manually install bitsandbytes:

pip3 install bitsandbytes==0.43.1

It's crucial to note that regardless of the settings, LoRA weights are always calculated and stored at full precision. For maintaining calculation accuracy, MoE-PEFT framework mandates the use of full precision for calculations when accuracy is imperative.

For users with NVIDIA Ampere or newer GPU architectures, the --tf32 option can be utilized to enable full-precision calculation acceleration.

Offline Configuration

MoE-PEFT relies on HuggingFace Hub to download necessary models, datasets, etc. If you cannot access the Internet or need to deploy MoE-PEFT in an offline environment, please refer to the following guide.

  1. Use git-lfs manually downloads models and datasets from HuggingFace Hub.
  2. Set --data_path to the local path to datasets when executing launch.py gen.
  3. Clone the evaluate code repository locally.
  4. Set environment variable MOE_PEFT_METRIC_PATH to the local path to metrics folder of evaluate code repository.
  5. Set --base_model to the local path to models when executing launch.py run.

Example of (4): export MOE_PEFT_METRIC_PATH=/path-to-your-git-repo/evaluate/metrics

Known issues

  • Quantization with Qwen2 have no effect (same with transformers).
  • Applying quantization with DoRA will result in higher memory and computation cost (same with PEFT).
  • Sliding window attention with generate cache may product abnormal output.
  • Lack of Long RoPE support.

Installation

Please refer to MoE-PEFT Install Guide.

Quickstart

You can conveniently utilize MoE-PEFT via launch.py. The following example demonstrates a streamlined approach to training a dummy model with MoE-PEFT.

# Generating configuration
python launch.py gen --template lora --tasks ./tests/dummy_data.json

# Running the training task
python launch.py run --base_model TinyLlama/TinyLlama_v1.1

# Try with gradio web ui
python inference.py \
  --base_model TinyLlama/TinyLlama_v1.1 \
  --template alpaca \
  --lora_weights ./casual_0

For further detailed usage information, please refer to the help command:

python launch.py help

MoE-PEFT

The moe_peft.py code is a starting point for finetuning on various datasets.

Basic command for finetuning a baseline model on the Alpaca Cleaned dataset:

# Generating configuration
python launch.py gen \
  --template lora \
  --tasks yahma/alpaca-cleaned

python moe_peft.py \
  --base_model meta-llama/Llama-2-7b-hf \
  --config moe_peft.json \
  --bf16

You can check the template finetune configuration in templates folder.

For further detailed usage information, please use --help option:

python moe_peft.py --help

Use Docker

Firstly, ensure that you have installed Docker Engine and NVIDIA Container Toolkit correctly.

After that, you can launch the container using the following typical command:

docker run --gpus all -it --rm mikecovlee/moe_peft

You can check all available tags from: mikecovlee/moe_peft/tags

Please note that this container only provides a proper environment to run MoE-PEFT. The codes of MoE-PEFT are not included.

Copyright

Copyright © 2023-2024 IDs Lab, Sichuan University

This project is licensed under the Apache 2.0 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

moe_peft-1.0.0.tar.gz (72.9 kB view details)

Uploaded Source

Built Distribution

moe_peft-1.0.0-py3-none-any.whl (91.7 kB view details)

Uploaded Python 3

File details

Details for the file moe_peft-1.0.0.tar.gz.

File metadata

  • Download URL: moe_peft-1.0.0.tar.gz
  • Upload date:
  • Size: 72.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for moe_peft-1.0.0.tar.gz
Algorithm Hash digest
SHA256 0a31cce394198bcc846855f7ed18f9c89100174a858cf5487204f6b95f47bb48
MD5 ce0d0de7338265203d05e1d1a581b185
BLAKE2b-256 1c949684c46ee7db5501056276d3a1250aa5267b3544cc37bb3af08142bc122e

See more details on using hashes here.

File details

Details for the file moe_peft-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: moe_peft-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 91.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for moe_peft-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2622e71b008447c8c37a890e7ee851b155e9f1bfc24757b2c3b0e386917d8a4c
MD5 0e4acd86660f6e23cff9965ca70281ff
BLAKE2b-256 2fc3586c43c2f05e91ae8864687ee97fa5bae0c16599468bd4b5b0e852171fbb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page