Skip to main content

Everything about modelbest data include data format mbtable, dataset, dataloader, and tools

Project description

项目名称

设计文档: https://www.kdocs.cn/l/cvKBY8SqJoCC
Release Note: https://www.kdocs.cn/l/cvR7iKFqYQqY
快速上手: https://modelbest.feishu.cn/docx/SfK7d4qSEoDKNgxEW38c8Nffnbf
使用样例: example/*.py

Update repo

cd to modelbest_sdk dir

git submodule update --init --recursive
git submodule update --remote --recursive
git submodule foreach --recursive git checkout master
git submodule foreach --recursive git pull origin master

0.3版本升级

  1. 开放Segment的注册机制(同BatchPacker)。不局限于token level的切分,可以自行设计数据结构,将物理数据映射到自定义数据结构中并给DetailedDoc.udd赋值,在采样时返回单条,一个简单的例子:
    • 参考example/custom/custom_udd_example.py (定义了Sentence,将多轮对话Messages拆分为单句,每个单句在采样时返回)
  2. DatasetInfo支持设置proto_type参数,用于指定数据集的格式,默认值为BaseDoc,支持的格式有BaseDoc、Messages、Zip、Raw
    • Messages:sdk默认反序列化完整的Messages对象,目前proto仅更新到Audio。如果需要再Messages proto定义中添加自定义字段,请使用Raw格式
    • Raw:sdk默认不做任何处理,直接返回原始数据bytes,请自行再自定义Segment和BatchPacker中处理
    • Zip:将多个文件夹当做一个数据集读取,支持基础数据打标能力,需要自行控制多个数据集间的行级对应关系
  3. DatasetInfo新增name变量,不设置时默认为path。进而支持了更改DatasetInfoList顺序,包括删除一些不用的dataset信息,不会影响datasetcheckpoint的加载;也支持了同一个数据集通过设置不同name来当成多个数据集来用的能力
  4. 将chunk_size拆分为chunk_size何cache_size,
    • chunk_size用于数据集拆分,明确每个卡的读取范围,建议设置为最接近sqrt(平均数据集行数)的2的幂,默认1024可以无需更改
    • cache_size用于设置单卡单次缓存的数据量,默认500MB,无需更改
  5. 针对超大数据的报警(超过100MB)和跳过(超过500MB)
  6. 移除了cuda_prefetch参数,由默认pin_memory提供快速拷贝到gpu的能力
  7. 特定场景下有用的环境变量:
    • CACHE_BASE_DIR:设置数据集元信息缓存目录,在知乎场景下可以避免初始化时加载缓存过久的问题
    • RETRY_ON_NOT_EXIST:如果文件系统崩溃导致数据集找不到,可以通过设置该变量来重试直到文件系统恢复

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modelbest_sdk-0.3.1.tar.gz (6.7 MB view details)

Uploaded Source

Built Distribution

modelbest_sdk-0.3.1-py3-none-any.whl (6.8 MB view details)

Uploaded Python 3

File details

Details for the file modelbest_sdk-0.3.1.tar.gz.

File metadata

  • Download URL: modelbest_sdk-0.3.1.tar.gz
  • Upload date:
  • Size: 6.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.19

File hashes

Hashes for modelbest_sdk-0.3.1.tar.gz
Algorithm Hash digest
SHA256 083b97e4dc6f893c9ac8d999bf319b955ac9c65a46593c429b7a037840d05da2
MD5 f85450501e908e1aacb721bc3609e352
BLAKE2b-256 be4fe5af05621de6c05211170614d6d3fbbdb11c6dd8b7ba1fb294b240d7b58b

See more details on using hashes here.

File details

Details for the file modelbest_sdk-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: modelbest_sdk-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 6.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.19

File hashes

Hashes for modelbest_sdk-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2e24a47851d40e49f865c5a058dca1755beae0eb8154f81ac2ab34a436370372
MD5 cbfcd7ef80b594b144cf1690d07498ad
BLAKE2b-256 4950511043ca9e36791492bb9c3c876a8535d7f72f84787e497f87b597e0e376

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page