Skip to main content

Fault-tolerant algorithms for large models based on bit-flip error correction

Project description

fault-tolerance

Fault-tolerant algorithms for large and little models based on bit-flip error correction.

Requirements

  • Python >= 3.8
  • torch
  • tqdm

Installation

Install PyTorch first:

https://pytorch.org/get-started/locally/

Then install this package:

pip install fault-tolerance

Usage

1. Error Injection (eject_error.py)

The eject_error.py module is used to inject bit-flip errors into a model.

Main function:

  • inject_error_to_model

Parameters:

  • model: A model implemented using PyTorch.
  • error_rate: Bit-flip error rate used during error injection. The default value is 1e-6.
  • seed
  • chunk_size

This function injects random bit errors into model parameters to simulate hardware faults.


2. FRP Protection for Large Models (frp_large_model.py)

This module implements FRP-based protection for large models.

Main functions:

  • encode

    Encodes model parameters using BCH codes. The encoding result is written in-place to the model parameters, where the 63-bit BCH codeword is stored using int64.

  • decode

    Recovers the original float32 parameters in-place from the BCH-encoded int64 values stored in param.data.


3. FRP Protection for Small Models (frp_little_model.py)

This module provides the same FRP-based protection mechanism as frp_large_model.py, but is optimized for smaller models.


4. ZMORP Protection (zmorp_large_model.py and zmorp_little_model.py)

These modules implement ZMORP-based fault-tolerance protection.

Main functions:

  • protect_model

    Adds fault-tolerance protection to all parameters of the model:

    • zmorp_large_model.py protects float32 parameters
    • zmorp_little_model.py protects float16 parameters
  • recover_model

    Recovers protected parameters of the model after potential bit-flip errors.


Example

import torch
import fault_tolerance as ft

model = ...

# Inject errors
ft.inject_error_to_model(model, error_rate=1e-6)

# Apply protection
ft.protect_model(model)

# Recover model parameters
ft.recover_model(model)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fault_tolerance-0.1.2.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fault_tolerance-0.1.2-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file fault_tolerance-0.1.2.tar.gz.

File metadata

  • Download URL: fault_tolerance-0.1.2.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for fault_tolerance-0.1.2.tar.gz
Algorithm Hash digest
SHA256 a6cd02826f46c1b82b27b730e54a05103fb087a044efc0949120746b505019bb
MD5 c25eb2dfa72c003920143b6d9c555ba2
BLAKE2b-256 56c3f29a36ce716921ca6c5801afbbe71bd28db9b152e25b352f3a1cac990333

See more details on using hashes here.

File details

Details for the file fault_tolerance-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for fault_tolerance-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 fcfeb0a46dd71935f55c3b7cbfda4273eb592daf3db3f9553adc233ad237f04a
MD5 2987266735333369757bcc2f0f981b9f
BLAKE2b-256 c6796c4498c30979c44da460ffe4baa1255a604f5e3f9517f383f214c1f5b82e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page