cuPyLMA: a Multi-GPU Levenberg-Marquardt Optimizer powered by cuPyNumeric
Project description
cuPyLMA: a Multi-GPU Levenberg-Marquardt (Deep Learning) Optimizer Powered by NVIDIA cuPyNumeric.
cuPyLMA is a scalable (deep learning) optimizer based on Levenberg-Marquardt algoritm. It supports multi-GPU execution via NVIDIA cuPyNumeric, which is a NumPy-like scientific computing framework.
cuPyLMA exploits the performance of multiple GPUs. cuPyLMA explicitly stores the full Jacobian matrix required by Levenberg-Marquardt algorithm for performance, which is in contrast to the most common solutions which implicitly represents the Jacobian matrix via Jacobian-vector product (JVP) and vector-Jacobian product (VJP) and thus lacks parallelism.
cuPyLMA's design consists of two components and each one holds a seperate set of GPUs.
- Model component hosts a PyTorch deep learning model with its data-parallelism replicas on each GPU and computes the Jacobian matrix.
- Optimizer component receives the Jacobian matrix from the model component and solves the optimal parameter updates by the Levenberg-Marqurdt algorithm via cuPyNumeric.
Installation
TODO: upload to pip
Usage
The following codes show steps to adapt exisitng PyTorch training code to utilize cuPyLMA.
import cuPyLMA
import torch
class MyModel(torch.nn.Module):
# Implementation
model = MyModel() # Instantiate the deep learning model
# Configure optimizer
devices = [torch.device('cuda:2'), torch.device('cuda:3')] # Cuda devices held by the model component
loss_fn = torch.nn.MSELoss() # Loss function
residual_fn = lambda a, b : torch.flatten(a - b) # Residual function: the output should be an 1-d array
lma = cuPyLMA.LMA(
model, devices,
loss_fn, residual_fn
)
# Train one step
x_train, y_train = # train data
slice_size = # Jaocbian slice size.
# The Jacobian matrix is decomposed into row slices for reducing the peak memory.
# It is recommended to start from `<batch size> / <#GPUs in the model component>`.
# If out of memory, it should be set to a smaller one.
loss, terminated = lma.step(x_train, y_train, slice_size)
Performance
cuPyLMA automatically selects the best strategy for Jacobian matrix computation to reduce the peak memory usage and boost the performance.
Changelog
Release 0.1
- First release
Citation
In construction ...
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cupylma-0.1.tar.gz.
File metadata
- Download URL: cupylma-0.1.tar.gz
- Upload date:
- Size: 38.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
358919b52768c01a16a7ac6471c9848fef51fa70f873e032e3fd4fcdba9b96d4
|
|
| MD5 |
fea6c83eb3bfeee17e6671cf435f4547
|
|
| BLAKE2b-256 |
232f1ec0222f5366e8901985d188b63e4bd3a8fa5c6c73fc064065458b00da2c
|
File details
Details for the file cupylma-0.1-py3-none-any.whl.
File metadata
- Download URL: cupylma-0.1-py3-none-any.whl
- Upload date:
- Size: 7.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dcabeabd605ba719b262ff159a41d60352c4b54acf3ff562229ad344ee7d172d
|
|
| MD5 |
8489438967bcc8a1976f8f2eecfc82ff
|
|
| BLAKE2b-256 |
5ff9d0610c4d41ad1a9904bb6d08c8894d41d01d90d8600f963d5de1eb798778
|