Skip to main content

A profile visualization tool for bm1690

Project description

bigTpuProfile

bigTpuProfile 是一个板卡性能可视化的工具(目前仅支持bm1690)。

目录

快速开始

bigTpuProfile的使用主要分为两个步骤:

  1. profile数据导出
  2. 通过tpu-mlir进行可视化

profile数据导出

算子

  1. bigTpuProfile有三种模式:

    1)模式0 pmu only:不关心算子中具体cmd的类型, 只关心时间维度,该模式对性能影响最小

    2)模式1 精简cmd:关心算子中各cmd的类型,该模式对性能影响较小,dma带宽统计不准确(额外开销 ~4%)

    3)模式2 详细cmd:关心算子中各cmd的详细信息,通常用于调试、各dma带宽统计(额外开销 7 ~ 10%)

  2. max_record_num指的是profile的最大记录条数,需注意设置的值要大于记录值。

  3. profile输出文件命名规则:

    cdm_profile_data_dev{DeviceID}-{CallNum}

    profile 可在多个设备上独立运行文件名中标记为DeviceID, 也可在该设备上多次被调用(CallNum)

tpu-train/tgi

torch.ops.my_ops.enable_profile(max_record_num, mode)  # 设置记录起始点(记录cmd信息, mode: 0 pmu only, 1 精简cmd, 2 详细cmd)
 
torch.ops.my_ops.disable_profile()  # 设置记录结束点
# tpu-train example
# part 0
torch.ops.my_ops.enable_profile(max_record_num, 0)  # enable profile without cmd info (pure pmu
_ = a tpu * b tpu
torch.ops.my_ops.disable_profile()  # disable profile and dump data (cdm_profile_data_dev0-0)
# part 1
torch.ops.my_ops.enable_profile(max_record_num, 1)  # enable profile with condensed cmd info
_ = a tpu + b tpu
torch.ops.my_ops.disable_profile()  #(cdm_profile_data_dev0-1)
# part 2
torch.ops.my_ops.enable_profile(max_record_num, 2)  # enable profile with detailed cmd info
_ = a tpu + b tpu
torch.ops.my_ops.disable_profile()  #(cdm_profile_data_dev0-2)
# tgi (text-generation-inference) example 
# test_whole_parallel.py

def test_whole_model(batches=1, model_id="llama", model_path='/data', quantize=None, mode="chat"):
    .....
    generated_text = {}
    time_list = []
    for i in range(DECODE_TOKEN_LEN):
        os.environ['TOKEN_IDX'] = str(i)
        if it == 1 and i == DECODE_TOKEN_LEN - 4:           # condition
            torch.ops.my_ops.enable_profile(100960, 1)      # enable profile
            generate_start = time.time_ns()
            generations, next_batch, (forward_ns, decode_ns) = model.generate_token(
                next_batch
            )
            generate_end = time.time_ns()
            time_list.append(generate_end - generate_start)
            for generation in generations:
                if i == 0:
                    generated_text[generation.request_id] = generation.tokens.texts[0]
                    else:
                        generated_text[generation.request_id] += generation.tokens.texts[0]
                        logger.info(f"Token {i} {[g.tokens.texts[0] for g in generations]}")
                        if it == 1 and i == DECODE_TOKEN_LEN - 1:     # condition
                            torch.ops.my_ops.disable_profile()		  # disable profile
                            if next_batch is None:
                                break
     .....        
    

tpudnn

// 假定handle类型为: tpudnnHandle_t
auto pimpl = static_cast<TPUDNNImpl *>(handle);
 
pimpl->enableProfile(max_record_num, mode);  // # 设置记录起始点(记录cmd信息, mode: 0 pmu only, 1 精简cmd, 2 详细cmd)
pimpl->disableProfile();  // 设置记录结束点
// tpudnn example
....
const int group_num =1;
const int group_size = pimpl->getCoreNum();
pimpl->enableProfile();    // enable profile
status = pimpl->launchKernel("gelu_forward_multi_core", &api, sizeof(api), group_num, group_size);
pimpl->disableProfile();   // disable profile
pimpl->enableProfile(80);  // enable profile
status = pimpl->launchKernel("gelu_forward_multi_core", &api, sizeof(api), group_num, group_size);
pimpl->disableProfile();   // disable profile
return status;

bmodel

与针对算子的使用方式不同,bmodel主要通过环境变量进行控制,仅需关注最大记录条目数是否合适,无需设置mode模式:

ENABLE_ALL_PROFILE=1 启动profile

TPUKERNEL_FIRMWARE_PATH=/home/xxx/libfirmware_core.so 设置fimrware.so 若bmodel版本太老需设置

CDM_BDC_RECORD_SIZE 设置bdc pmu 最大记录的条目

CDM_GDMA_RECORD_SIZE 设置gdma pmu 最大记录的条目

CDM_SDMA_RECORD_SIZE 设置sdma pmu 最大记录的条目

CDM_CDMA_RECORD_SIZE 设置cdma pmu 最大记录的条目

# 用默认的libfirmware_core.so和条目数
ENABLE_ALL_PROFILE=1 tpu-model-rt --bmodel ./xxx.bmodel
 
# 指定fimrware.so, 指定记录条目数
TPUKERNEL_FIRMWARE_PATH=/home/xxx/libfirmware_core.so ENABLE_ALL_PROFILE=1  CDM_BDC_RECORD_SIZE=362144 CDM_SDMA_RECORD_SIZE=262144 CDM_GDMA_RECORD_SIZE=262144 CDM_CDMA_RECORD_SIZE=10000 tpu-model-rt  --bmodel ./xxx.bmodel

可视化

通过bigTpuProfile进行解析与可视化

# 安装bigTpuProfile
tar xvzf ./bigTpuProfile-xxx.tar.gz
pip install ./dist/bigtpuprofile-xxx-manylinux1_x86_64.whl
# 使用 bigTpuProfile -h 查看可用参数
bigTpuProfile cdm_profile_data_devX-X/ result_out --arch BM1690  # 可视化结果存储于result_out

数据分析

文件夹结构

/result_out/
├── tiuRegInfo_x
├── tdmaRegInfo_x
├── cdmaRegInfo_x
├── PerfDoc   文档可视化结果
└── PerfWeb   web可视化结果

doc

PerfAI_output.xlsx 中包括总览,及各个core中各engine(tiu gdma sdma cdma)所记录到的有效数据

命名规则为:engineType_coreId

web

result.html 中包括总览,各个core的可视化,以及core间的比对图

web中可通过滚轮或滑块进行局部缩放,悬停到某条指令后可通过global_idx,在doc对应的engineType_coreId子栏中的C列找到对应指令的详细信息

通过其他选项也有助于性能分析(TIU uArch Rate 功能暂未支持)

许可证

bigTpuProfile 采用 2-Clause BSD 许可证,但第三方组件除外。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bigtpuprofile-0.0.2-py3-none-manylinux1_x86_64.whl (851.5 kB view details)

Uploaded Python 3

File details

Details for the file bigtpuprofile-0.0.2-py3-none-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for bigtpuprofile-0.0.2-py3-none-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 310fe92a93285438e79a756b3a421dd9c0956cd70540c1007fcf93e9e24f8620
MD5 1b4d5de278c5217a719e1e1dc32101f3
BLAKE2b-256 dcaedb59974c5e5b913551cb3d82aa10801e77d0701a3bdc7d1ab6ffb949636f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page