Fault-tolerant algorithms for large models based on bit-flip error correction
Project description
fault-tolerance
Fault-tolerant algorithms for large and little models based on bit-flip error correction.
Requirements
- Python >= 3.8
- torch
- tqdm
Installation
Install PyTorch first:
https://pytorch.org/get-started/locally/
Then install this package:
pip install fault-tolerance
Usage
1. Error Injection (eject_error.py)
The eject_error.py module is used to inject bit-flip errors into a model.
Main function:
- inject_error_to_model
Parameters:
- model: A model implemented using PyTorch.
- error_rate: Bit-flip error rate used during error injection. The default value is
1e-6. - seed
- chunk_size
This function injects random bit errors into model parameters to simulate hardware faults.
2. FRP Protection for Large Models (frp_large_model.py)
This module implements FRP-based protection for large models.
Main functions:
-
encode
Encodes model parameters using BCH codes. The encoding result is written in-place to the model parameters, where the 63-bit BCH codeword is stored using
int64. -
decode
Recovers the original
float32parameters in-place from the BCH-encodedint64values stored inparam.data.
3. FRP Protection for Small Models (frp_little_model.py)
This module provides the same FRP-based protection mechanism as frp_large_model.py, but is optimized for smaller models.
4. ZMORP Protection (zmorp_large_model.py and zmorp_little_model.py)
These modules implement ZMORP-based fault-tolerance protection.
Main functions:
-
protect_model
Adds fault-tolerance protection to all parameters of the model:
zmorp_large_model.pyprotects float32 parameterszmorp_little_model.pyprotects float16 parameters
-
recover_model
Recovers protected parameters of the model after potential bit-flip errors.
Example
import torch
import fault_tolerance as ft
model = ...
# Inject errors
ft.inject_error_to_model(model, error_rate=1e-6)
# Apply protection
ft.protect_model(model)
# Recover model parameters
ft.recover_model(model)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fault_tolerance-0.1.2.tar.gz.
File metadata
- Download URL: fault_tolerance-0.1.2.tar.gz
- Upload date:
- Size: 13.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a6cd02826f46c1b82b27b730e54a05103fb087a044efc0949120746b505019bb
|
|
| MD5 |
c25eb2dfa72c003920143b6d9c555ba2
|
|
| BLAKE2b-256 |
56c3f29a36ce716921ca6c5801afbbe71bd28db9b152e25b352f3a1cac990333
|
File details
Details for the file fault_tolerance-0.1.2-py3-none-any.whl.
File metadata
- Download URL: fault_tolerance-0.1.2-py3-none-any.whl
- Upload date:
- Size: 17.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcfeb0a46dd71935f55c3b7cbfda4273eb592daf3db3f9553adc233ad237f04a
|
|
| MD5 |
2987266735333369757bcc2f0f981b9f
|
|
| BLAKE2b-256 |
c6796c4498c30979c44da460ffe4baa1255a604f5e3f9517f383f214c1f5b82e
|