Skip to main content

一种基于神经网络和启发式策略的深度学习模型分布式训练切分(3D parallelism)快速策略搜索算法

Project description

APSS(for Training): Automatically Distributed Deep Learning Parallelism Strategies Search by Self Play

APSS 是一种基于神经网络和启发式策略的深度学习模型分布式训练切分(3D parallelism)快速策略搜索算法,它结合启发式策略和训练集群环境初步生成候选策略,然后通过深度管道策略网络(DPSN)为每个候选策略提供详细的pipeline划分,采用自我对弈的对比强化学习(CRLSP)进行离线训练,无需实际数据收集和后续应用中的微调。此仓库我们使用Mindspore进行实现。


Context

Installation

Requirements:

  • Python >= 3.7
  • Mindspore >= 2.1.1

Method 1: With pip

pip install apss

Method 2: From source

git clone https://github.com/Cheny1m/APSS
cd APSS
pip install -e .

Usage and Examples

一步执行训练

python -m apss.training.apss_run --graph_size 8 --num_split 3 --rebuild_data
  • graph_size , num_split 分别代表了问题的层数大小和需要执行pipeline划分的数量,两个命令行参数共同描述了所训练问题的大小,可根据需求动态调整。
  • rebuild_data 表示是否在执行训练前,从Data Synthesizer中生成训练数据,默认建议开启。如果需要从.ckpt中接续训练或无需改变之前生成的训练数据直接禁用--rebuild_data参数即可。训练数据可在/data目录下找到。
  • 已经完成过执行训练后,.ckpt保存在/output文件夹下,日志保存在/log文件夹下,可以通过tensorboard_logger在浏览器中实时查看训练过程及其数据。

How It Works

The pipeline of APSS.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apss-0.3.0.tar.gz (35.2 kB view hashes)

Uploaded Source

Built Distribution

apss-0.3.0-py3-none-any.whl (40.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page