Skip to main content

An Efficient Factory to Build Multiple LoRA Adapters

Project description

m-LoRA: An Efficient LLM Fine-tuning Framework

m-LoRA (short for Multi-LoRA) is an open-source LLMOps framework developed by the IDs Lab at Sichuan University. It is designed for high-throughput fine-tuning, evaluation, and inference of Large Language Models (LLMs) using techniques such as LoRA, DoRA, MixLoRA, and others. Key features of mLoRA include:

  • Concurrent fine-tuning of multiple adapters with a shared pre-trained model.

  • Support for multiple PEFT algorithms and various pre-trained models.

  • Exclusive Mo-LoRA (Mixture of LoRAs) optimization for MixLoRA.

You can try m-LoRA with Google Colab before local installation.

Note from the maintainer of this repository

This is an actively developing fork of the official m-LoRA repository, focusing on the PEFT algorithm and its related improvements. It is maintained by the authors of m-LoRA. Currently, this fork does not support pipeline parallelism and can only utilize a single compute device, such as a GPU or NPU, for each m-LoRA process. Please note that the functions, interfaces, and performance of this fork differ from those of the original m-LoRA. Compatibility is not guaranteed. For production use, please prefer the original m-LoRA.

Supported Platform

OS Backend Model Precision Quantization Flash Attention
Linux CUDA FP32, FP16, TF32, BF16 8bit and 4bit
Windows CUDA FP32, FP16, TF32, BF16 8bit and 4bit -
macOS MPS FP32, FP16, BF16
All CPU FP32, FP16, BF16

You can use the MLORA_BACKEND_TYPE environment variable to force m-LoRA to use a specific backend. For example, if you want m-LoRA to run only on CPU, you can set MLORA_BACKEND_TYPE=CPU before importing mlora.

Supported Pre-trained Models

Model # Parameters
LLaMA 1/2/3 7B/8B/13B/70B
TinyLLaMA 1.1B
Qwen 1.5/2 1.5B/4B/7B/57B/72B
Gemma 2B/7B
Mistral 7B
Phi 2 2.7B
ChatGLM 1/2/3/4 6B

Supported PEFT Methods

PEFT Methods Arguments*
QLoRA See Quantize Methods
LoRA+ loraplus_lr_ratio: 20.0
DoRA use_dora: true
rsLoRA use_rslora: true
MixLoRA See MixLoRA

*: Arguments of configuration file

Supported Attention Methods

Attention Methods Name Arguments*
Scaled Dot Product "eager" --attn_impl eager
Flash Attention 2 "flash_attn" --attn_impl flash_attn

*: Arguments of mlora.py

m-LoRA only supports scaled-dot product attention (eager) by default. Additional requirements are necessary for flash attention.

For flash attention, manual installation of the following dependencies is required:

pip3 install ninja
pip3 install flash-attn==2.5.8 --no-build-isolation

If any attention method is not specified, flash attention is used if available.

Supported Quantize Methods

Quantize Methods Arguments*
Full Precision (FP32) by default
Tensor Float 32 --tf32
Half Precision (FP16) --fp16
Brain Float 16 --bf16
8bit Quantize --load_8bit
4bit Quantize --load_4bit

*: Arguments of mlora.py

m-LoRA offers support for various model accuracy and quantization methods. By default, m-LoRA utilizes full precision (Float32), but users can opt for half precision (Float16) using --fp16 or BrainFloat16 using --bf16. Enabling half precision reduces the model size by half, and for further reduction, quantization methods can be employed.

Quantization can be activated using --load_4bit for 4-bit quantization or --load_8bit for 8-bit quantization. However, when only quantization is enabled, m-LoRA utilizes Float32 for calculations. To achieve memory savings during training, users can combine quantization and half-precision modes.

To enable quantization support, please manually install bitsandbytes:

pip3 install bitsandbytes==0.43.1

It's crucial to note that regardless of the settings, LoRA weights are always calculated and stored at full precision. For maintaining calculation accuracy, m-LoRA framework mandates the use of full precision for calculations when accuracy is imperative.

For users with NVIDIA Ampere or newer GPU architectures, the --tf32 option can be utilized to enable full-precision calculation acceleration.

Known issues

  • Quantization with Qwen2 have no effect (same with transformers).
  • Applying quantization with DoRA will result in higher memory and computation cost (same with PEFT).

Installation

Please refer to m-LoRA Install Guide.

Quickstart

You can conveniently utilize m-LoRA via launch.py. The following example demonstrates a streamlined approach to training a dummy model with m-LoRA.

# Generating configuration
python launch.py gen --template lora --tasks ./data/dummy_data.json
# Running the training task
python launch.py run --base_model TinyLlama/TinyLlama_v1.1
# Try with gradio web ui
python inference.py \
  --base_model TinyLlama/TinyLlama_v1.1 \
  --template ./template/alpaca.json \
  --lora_weights ./casual_0

For further detailed usage information, please refer to the help command:

python launch.py help

m-LoRA

The mlora.py code is a starting point for finetuning on various datasets. Basic command for finetuning a baseline model on the Alpaca Cleaned dataset:

python mlora.py \
  --base_model meta-llama/Llama-2-7b-hf \
  --config ./config/alpaca.json \
  --bf16

You can check the template finetune configuration in template folder.

For further detailed usage information, please use --help option:

python mlora.py --help

Use Docker

Firstly, ensure that you have installed Docker Engine and NVIDIA Container Toolkit correctly.

After that, you can launch the container using the following typical command:

docker run --gpus all -it --rm mikecovlee/mlora

You can check all available tags from: mikecovlee/mlora/tags

Please note that this container only provides a proper environment to run m-LoRA. The codes of m-LoRA are not included.

Copyright

Copyright © 2023-2024 IDs Lab, Sichuan University

This project is licensed under the Apache 2.0 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlora-0.4.0.tar.gz (63.7 kB view details)

Uploaded Source

Built Distribution

mlora-0.4.0-py3-none-any.whl (77.5 kB view details)

Uploaded Python 3

File details

Details for the file mlora-0.4.0.tar.gz.

File metadata

  • Download URL: mlora-0.4.0.tar.gz
  • Upload date:
  • Size: 63.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for mlora-0.4.0.tar.gz
Algorithm Hash digest
SHA256 a31c7a0adac1d47e0c143dfca77c4cdf7428a56849a14fb23fe5d4b8bfca559f
MD5 eae76e253ec0dd83f2aef77b6e713e5a
BLAKE2b-256 88d8251d23317a8c56fb7970a67ffadc2e6353b54790bd1547b6851f63087b80

See more details on using hashes here.

File details

Details for the file mlora-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: mlora-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 77.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for mlora-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9d86905e94b25d5e16907204e182b2df5e8bfbc1c0c23f6e49198eabd5b2c21f
MD5 3bb0d940aea0eadc87e5fa39b3378f99
BLAKE2b-256 39e1572d5ff2ffaaa14f90cfec0f921b00800e1d6007645aaa47d53c47f19c2f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page