ByteCheckpoint: An Unified Checkpointing Library for LFMs

Project description

👋 Hi, everyone!
We are ByteDance Seed team.

You can get to know us better through the following channels👇

seed logo

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

ByteCheckpoint is a unified, efficient and production-grade checkpointing system for large foundation model development.

ByteCheckpoint is the open-source implementation of our research paper: ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development.

ByteCheckpoint is easy to use and efficient with:

✔ Framework-Agnostic API: Provides a unified checkpointing entrypoint, i.e., bytecheckpoint.save and bytecheckpoint.load, to support various parallelism configurations across different frameworks.

✔ Load-time Checkpoint Resharding: Enables seamless checkpoint reloading with arbitrary new parallelism configurations, eliminating the need for manual resharding scripts.

✔ Optimized I/O Performance: Integrates advanced techniques such as asynchronous and parallel I/O, D2H tensor copying with pinned memory, load-balanced checkpointing, decomposed tensor representation.

✔ Comprehensive Toolset: Provides utilities for checkpoint merging/conversion/modification and metadata/tensor file inspection. Enables flexible checkpoint transfer and management.

📰 News

[2025/04] We officially released ByteCheckpoint! 🔥

[2024/12] ByteCheckpoint is accepted to NSDI 2025.

🚀 Getting started

Installation

Install ByteCheckpoint from source.

git clone https://github.com/ByteDance-Seed/ByteCheckpoint.git
cd bytecheckpoint
pip install -e .

Install ByteCheckpoint from PyPI.

pip install bytecheckpoint

Basic Usage

We introduce how to use Bytecheckpoint to save, load, and merge checkpoint.

In ByteCheckpoint, a checkpoint consists of three parts (folders):

model: It contains model checkpoint, including one .metadata checkpoint metadata file and multiple .distcp tensor data files.
optimizer: It contains optimizer checkpoint, including one .metadata checkpoint metadata file and multiple .distcp tensor data files.
extra_state: It contains user-saved pickable objects, e.g., the dataloader state dictionary and RNG states.

Save and Load Checkpoint

Get model, optimizer, and extra states (RNG states, learning rate scheduler) from training code.

checkpoint_state = {
    "model": model, 
    "optimizer": optimizer, 
    "extra_state": {'torch_rng_state': torch.get_rng_state()}
}

Save them with ByteCheckpoint save API.

import bytecheckpoint as bcp
bcp.save(ckpt_path, checkpoint_state, framework="fsdp")

Load them with ByteCheckpoint load API. The model and optimizer will be loaded in an in-place manner. The extra state will be loaded in checkpoint_state["extra_state"].

bcp.load(ckpt_path, checkpoint_state, framework="fsdp")
torch.set_rng_state(checkpoint_state["extra_state"]['torch_rng_state'])

Training Code Example (FSDP)

A simple single-machine FSDP training demo with ByteCheckpoint is on demo/fsdp_save_reshard.py

Start training and save checkpoint at each step:

# Train on 8 GPUs
torchrun --master_addr=localhost --master_port=6000 --nproc_per_node=8 --nnodes=1 demo/fsdp_save_reshard.py --mode normal

Load checkpoint and resume training:

# Load on 4 GPUs
torchrun --master_addr=localhost --master_port=6000 --nproc_per_node=4 --nnodes=1 demo/fsdp_save_reshard.py --mode resume

For multi-machine training, we recommend operating checkpoint in a shared file system that supports POSIX semantics, such as NFS.

Merge Model checkpoint

To merge model checkpoint, you can use scripts/merge_bcp.py

Merge saved checkpoint in the demo training code with safetensors format:

python3 scripts/merge_bcp.py --framework fsdp \
--ckpt_path tmp_checkpoint_dir_fsdp/global_step_0 \
--output_path merged_ckpt_fsdp \
--safetensors_format \
--model_only

🔧 Advanced Usage Guide

API Arguments

Enable fast_saving and fast_loading to use asynchronous and parallel I/O techniques.
Enable save_decomposed_model_optimizer and load_decomposed_model_optimizer for FSDP (use_orig_params=True is required) to obtain model/optimizer state dict without additional communication and GPU-CPU synchronization.
Pass the role keyword (e.g., actor, critic) to support checkpointing in multi-role training scenarios, such as PPO training.
Enable strict in load API to check whether the fqns in a given state_dict are strictly the same as those recorded in the .metadata file.

Configuration

Enable BYTECHECKPOINT_ENABLE_TREE_TOPO to improve the stability of large-scale planning for model/optimizer planning.
Enable BYTECHECKPOINT_ENABLE_PINNED_MEM_D2H to use the pinned CPU memory pool to accelerate D2H tensor copying.
Adjust BYTECHECKPOINT_STORE_WORKER_COUNT and BYTECHECKPOINT_LOAD_WORKER_COUNT to tune the I/O performance.

Please refer to config.py for more details.

🤝 Contribution Guide

Community contributions are welcome. Please checkout Contribution Guidance.

Code Formatting

We use ruff to enforce strict code formatting when reviewing PRs. To reformat your code locally, make sure you have installed the latest version of ruff.

pip install ruff

Then you can format code with:

bash format_code.sh

Testing

Run local tests with:

bash test.sh

📄 License

This project is licensed under Apache License 2.0. See the LICENSE file for details.

😊 Citation and Acknowledgement

If you find this project helpful, please give us a star ⭐ and cite our paper:

@article{wan2024bytecheckpoint,
  title={ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development},
  author={Borui, Wan and Mingji, Han and Yiyao, Sheng and Yanghua, Peng and Haibin, Lin and Mofan, Zhang and Zhichao, Lai and Menghan, Yu and Junda, Zhang and Zuquan, Song and Xin, Liu and Chuan, Wu},
  journal={arXiv preprint arXiv:2407.20143},
  year={2024}
}

ByteCheckpoint is inspired by the design of PyTorch Distributed Checkpoint (DCP).

🌱 About ByteDance Seed Team

Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.

Project details

Release history Release notifications | RSS feed

0.0.2

Jul 10, 2025

This version

0.0.1

Apr 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bytecheckpoint-0.0.1.tar.gz (12.0 kB view details)

Uploaded Apr 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bytecheckpoint-0.0.1-py2.py3-none-any.whl (498.2 kB view details)

Uploaded Apr 2, 2025 Python 2Python 3

File details

Details for the file bytecheckpoint-0.0.1.tar.gz.

File metadata

Download URL: bytecheckpoint-0.0.1.tar.gz
Upload date: Apr 2, 2025
Size: 12.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for bytecheckpoint-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`94e11fd145789582171d4d313863692dd56f2b87f1f850b6c7ef45bfa2ea6440`
MD5	`7006399b09035df72e20e32beb5e9ce4`
BLAKE2b-256	`82a46dddafc95f5a14dfeead0f1813ecd3a1a9ad33496c1e49d5e756c182cbb1`

See more details on using hashes here.

File details

Details for the file bytecheckpoint-0.0.1-py2.py3-none-any.whl.

File metadata

Download URL: bytecheckpoint-0.0.1-py2.py3-none-any.whl
Upload date: Apr 2, 2025
Size: 498.2 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for bytecheckpoint-0.0.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`223ccd10ab02ea2aaedcc6b986b5016eb5bbfe6bf17249dbb3887f8a2134284a`
MD5	`5e05b87d9899d8d0b4809ee0b5a9929d`
BLAKE2b-256	`76059b9d376df7ece1d487176c0df77d41e970056a87e9a769da580ed7bca44f`

See more details on using hashes here.

bytecheckpoint 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

📰 News

🚀 Getting started

Installation

Basic Usage

Save and Load Checkpoint

Training Code Example (FSDP)

Merge Model checkpoint

🔧 Advanced Usage Guide

API Arguments

Configuration

🤝 Contribution Guide

Code Formatting

Testing

📄 License

😊 Citation and Acknowledgement

🌱 About ByteDance Seed Team

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes