TILEARN for LLM
Project description
Tilearn.llm使用说明
1. CUDA Kernel(以LLAMA为例)
支持显卡:Ampere, Ada, or Hopper GPUs (e.g., A100, A800, H100, H800)
新版本
新版本Dependencies: pytorch >= 2.0.0
该版本完全兼容huggingface接口,不需要额外的转模型操作
LLAMA1/LLAMA2 A800 16GPU seq=1024相比deepspeed zero2训练加速约20%
cuda kernel使用方法-启动脚本修改如下
### TIACC CUDA Kernel
### Open: TIACC_TRAINING_CUDA_KERNEL=1
### Close: TIACC_TRAINING_CUDA_KERNEL=0
export TIACC_TRAINING_CUDA_KERNEL=1
cuda kernel使用方法-代码修改如下
### TIACC
TIACC_TRAINING_CUDA_KERNEL = int(os.getenv('TIACC_TRAINING_CUDA_KERNEL', '0'))
if TIACC_TRAINING_CUDA_KERNEL == 1:
from tilearn.llm.transformers import LlamaForCausalLM
### 模型接口与标准huggingface一致
model = LlamaForCausalLM.from_pretrained(...)
### TIACC
TIACC_TRAINING_CUDA_KERNEL = int(os.getenv('TIACC_TRAINING_CUDA_KERNEL', '0'))
if TIACC_TRAINING_CUDA_KERNEL == 1:
from tilearn.llm.transformers import AutoModelForCausalLM
### 模型接口与标准huggingface一致
model = AutoModelForCausalLM.from_pretrained(...)
旧版本
旧版本Dependencies: flash-attention 请安装https://github.com/Dao-AILab/flash-attention, 建议源码安装
### compile from source
git clone --recursive https://github.com/Dao-AILab/flash-attention
cd flash-attention && python setup.py install
### install layer_norm, fused_dense and rotary kernel
cd flash-attention/csrc/layer_norm && pip3 install .
cd flash-attention/csrc/fused_dense_lib && pip install .
cd flash-attention/csrc/rotary && pip install .
该版本不兼容huggingface接口,可直接读取huggingface模型和原始cuda kernel模型(训练保存的模型结构)
由于训练保存的模型为原始cuda kernel模型,非huggingface结构,若需要huggingface模型则手动执行脚本转换
LLAMA1/LLAMA2 A800 16GPU seq=1024相比deepspeed zero2训练加速约30%
cuda kernel使用方法-启动脚本修改如下
### TIACC CUDA Kernel
### Open: TIACC_TRAINING_CUDA_KERNEL_V0=1
### Close: TIACC_TRAINING_CUDA_KERNEL_V0=0
export TIACC_TRAINING_CUDA_KERNEL_V0=1
export TIACC_TRAINING_MODEL_FORMAT=llama-hf
# 若读取huggingface模型结构,则设置llama-hf
export TIACC_TRAINING_MODEL_FORMAT=llama-hf
# 若原始cuda kernel模型,则设置llama-origin
export TIACC_TRAINING_MODEL_FORMAT=llama-origin
cuda kernel使用方法-代码修改如下
### TIACC
TIACC_TRAINING_CUDA_KERNEL_V0 = int(os.getenv('TIACC_TRAINING_CUDA_KERNEL_V0', '0'))
if TIACC_TRAINING_CUDA_KERNEL_V0 == 1:
from tilearn import llm
### LLAMA模型初始化
TIACC_TRAINING_MODEL_FORMAT = os.getenv('TIACC_TRAINING_MODEL_FORMAT', 'llama-origin')
model = llm.models.llama(model_args.model_name_or_path, model_format=TIACC_TRAINING_MODEL_FORMAT)
2. Static Zero
适用场景:在deepspeed zero1、zero2、zero3、offload、int8等不同优化状态间切换
启动脚本修改如下
### TIACC STATIC ZERO
### Open: TIACC_TRAINING_CUDA_KERNEL='O2'
### support 'O2' / 'O2.5' / 'O3' / 'O3.5' / 'O3_Q8'(doing)
### Close: TIACC_TRAINING_CUDA_KERNEL='None'
export TIACC_TRAINING_STATIC_ZERO='None' #'O2'
代码修改如下
from transformers import HfArgumentParser
TIACC_TRAINING_STATIC_ZERO = os.getenv('TIACC_TRAINING_STATIC_ZERO', 'None')
if TIACC_TRAINING_STATIC_ZERO != 'None':
from tilearn.llm.transformers import TrainingArguments
### 接口与标准huggingface一致
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
3. Dynamic Zero
适用场景:适用于zero3 + offload场景,大幅优化显存从而提升batchsize
启动脚本修改如下
### TIACC DYNAMIC ZERO
### Open: TIACC_TRAINING_DYNAMIC_ZERO=1 and set TIACC_ZERO_STAGE/TIACC_ZERO_STAGE/TIACC_PLACEMENT/TIACC_SHARD_INIT/TIACC_CPU_INIT
### Close: TIACC_TRAINING_DYNAMIC_ZERO=0
export TIACC_TRAINING_DYNAMIC_ZERO=0
export TIACC_ZERO_STAGE=3 #work when TIACC_TRAINING_DYNAMIC_ZERO=1
export TIACC_PLACEMENT='cpu' #'cuda' #work when TIACC_TRAINING_DYNAMIC_ZERO=1
export TIACC_SHARD_INIT=0 #work when TIACC_TRAINING_DYNAMIC_ZERO=1
export TIACC_CPU_INIT=1 #work when TIACC_TRAINING_DYNAMIC_ZERO=1
if [ ${TIACC_TRAINING_DYNAMIC_ZERO} = 0 ]; then
#USE_DS="--deepspeed=./ds_config_zero3.json"
USE_DS="--deepspeed=${deepspeed_config_file}"
else
USE_DS=""
fi
torchrun --nnodes 1 --nproc_per_node 8 run_clm.py \
${USE_DS} \
...
代码修改如下
TIACC_TRAINING_DYNAMIC_ZERO = int(os.getenv('TIACC_TRAINING_DYNAMIC_ZERO', '0'))
from contextlib import nullcontext
if TIACC_TRAINING_DYNAMIC_ZERO == 1:
from tilearn.llm.trainer import TrainerTiacc as Trainer
from tilearn.llm import init as llm_init
from tilearn.llm import get_config as llm_get_config
### init in main func
def main():
if TIACC_TRAINING_DYNAMIC_ZERO == 1:
llm_config = llm_get_config()
llm_init_context = llm_init(init_in_cpu=llm_config.cpu_init,
shard_init=llm_config.shard_init,
model_dtype=torch.half)
### add init_context when model init
init_context = llm_init_context if TIACC_TRAINING_DYNAMIC_ZERO == 1 else nullcontext
with init_context():
### 接口与标准huggingface一致
model = LlamaForCausalLM.from_pretrained(
model_args.model_name_or_path,
config=config,
low_cpu_mem_usage=False #True,
...
)
### use trainer
### 接口与标准huggingface一致
trainer = Trainer(
model=model,
...
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for tilearn_llm-0.5.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aace02064ec4904651dad46c4e12fea98cd84859318c490372093b6efaaa4e04 |
|
MD5 | 6870c99c3b531149ae325b50fcf96750 |
|
BLAKE2b-256 | ab0522e484d156f45c21547cb43a30dbc054ea51559f65663fbd611c4cc1da7c |
Hashes for tilearn_llm-0.5.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c03ac3ded153cc7899898daaea0de3de8140beff20a49f8846537e0eb4fcf7f9 |
|
MD5 | 5427a54459d60875e6b6aedbd69a62ca |
|
BLAKE2b-256 | 9a2dd6104a86704f704375fdc74454863ed5fa7022df860476af011a87c30543 |
Hashes for tilearn_llm-0.5.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7eaf94550e0dd452a4222705834c9e0040348ff87bd2394bbcc5fbd722eefa31 |
|
MD5 | 40c1e5230f8713795eba76dac0b2e041 |
|
BLAKE2b-256 | fc95cd6d870ce05df57c3d42a40ee1efa2001123066ac9ce3f8849b5b48b5d34 |