Utilities to pack and load HuggingFace transformer models quickly

Project description

fasthug

fasthug moves HuggingFace models from disk to GPU 5x faster. To get started, install:

pip install git+https://github.com/alvinwan/fasthug.git

To load a model with fasthug, use the fasthug.from_pretrained function, instead of AutoModelForCausalLM.from_pretrained:

import fasthug
model = fasthug.from_pretrained("facebook/opt-125m").cuda() # 5x faster

5x speedup, 2x less memory loading full-precision models on server-grade GPUs (H100)

The below benchmarks compare these two lines:
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True).cuda()
model = fasthug.from_pretrained(model_id).cuda()
To rerun these benchmarks, use the following command to launch benchmarks remotely.
modal run utils/app.py::run_model --model-id facebook/opt-1.3b
Model GPU HuggingFace (s) Mem (GiB) fasthug (s) Mem (GiB) Speedup

facebook/opt-13b H100 26.20 ± 0.49 49.03 5.83 ± 0.46 24.52 4.5x

facebook/opt-6.7b H100 10.79 ± 0.23 25.40 2.44 ± 0.01 12.70 4.4x

facebook/opt-2.7b H100 7.07 ± 0.06 10.24 1.09 ± 0.05 5.12 6.5x

facebook/opt-1.3b H100 3.10 ± 0.37 5.02 0.61 ± 0.00 2.51 5.1x

If we instead use load_cpu_mem_usage=False, HuggingFace is overall slower to load.

Model GPU HuggingFace (s) Mem (GiB) fasthug (s) Mem (GiB) Speedup

facebook/opt-13b H100 45.33 ± 2.75 49.03 5.83 ± 0.46 24.52 4.5x

facebook/opt-6.7b H100 23.12 ± 3.22 25.40 2.44 ± 0.01 12.70 4.4x

facebook/opt-2.7b H100 6.85 ± 0.32 10.24 1.09 ± 0.05 5.12 6.3x

facebook/opt-1.3b H100 4.12 ± 0.14 5.02 0.61 ± 0.00 2.51 6.7x

Model	GPU	HuggingFace (s)	Mem (GiB)	fasthug (s)	Mem (GiB)	Speedup
facebook/opt-13b	H100	26.20 ± 0.49	49.03	5.83 ± 0.46	24.52	4.5x
facebook/opt-6.7b	H100	10.79 ± 0.23	25.40	2.44 ± 0.01	12.70	4.4x
facebook/opt-2.7b	H100	7.07 ± 0.06	10.24	1.09 ± 0.05	5.12	6.5x
facebook/opt-1.3b	H100	3.10 ± 0.37	5.02	0.61 ± 0.00	2.51	5.1x

Model	GPU	HuggingFace (s)	Mem (GiB)	fasthug (s)	Mem (GiB)	Speedup
facebook/opt-13b	H100	45.33 ± 2.75	49.03	5.83 ± 0.46	24.52	4.5x
facebook/opt-6.7b	H100	23.12 ± 3.22	25.40	2.44 ± 0.01	12.70	4.4x
facebook/opt-2.7b	H100	6.85 ± 0.32	10.24	1.09 ± 0.05	5.12	6.3x
facebook/opt-1.3b	H100	4.12 ± 0.14	5.02	0.61 ± 0.00	2.51	6.7x

3x speedup, 2x less memory loading full-precision models on consumer GPUs (T4, 3060, M3).

The below benchmarks compare these two lines on Nvidia T4, Nvidia RTX 3060, and the Apple M3:
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True).cuda()
model = fasthug.from_pretrained(model_id).cuda()
To rerun these benchmarks, use the following command, locally.
fhb facebook/opt-1.3b
Model GPU HuggingFace (s) Mem (GiB) fasthug (s) Mem (GiB) Speedup

facebook/opt-1.3b T4 9.09 ± 1.43 5.02 6.07 ± 0.39 2.51 1.5x

facebook/opt-350m T4 2.57 ± 1.04 1.26 1.19 ± 0.49 0.63 2.2x

facebook/opt-125m T4 1.06 ± 0.04 0.48 0.27 ± 0.00 0.25 3.9x

facebook/opt-1.3b 3060 6.96 ± 0.03 5.02 1.66 ± 0.02 2.51 4.2x

facebook/opt-350m 3060 1.09 ± 0.06 1.26 0.39 ± 0.00 0.63 2.2x

facebook/opt-125m 3060 0.73 ± 0.06 0.48 0.20 ± 0.01 0.25 3.9x

facebook/opt-1.3b M3 11.9 ± 1.17 - 2.65 ± 0.62 - 4.5x

facebook/opt-350m M3 1.49 ± 0.22 - 0.49 ± 0.22 - 3.0x

facebook/opt-125m M3 0.78 ± 0.12 - 0.27 ± 0.02 - 2.9x

Model	GPU	HuggingFace (s)	Mem (GiB)	fasthug (s)	Mem (GiB)	Speedup
facebook/opt-1.3b	T4	9.09 ± 1.43	5.02	6.07 ± 0.39	2.51	1.5x
facebook/opt-350m	T4	2.57 ± 1.04	1.26	1.19 ± 0.49	0.63	2.2x
facebook/opt-125m	T4	1.06 ± 0.04	0.48	0.27 ± 0.00	0.25	3.9x

facebook/opt-1.3b	3060	6.96 ± 0.03	5.02	1.66 ± 0.02	2.51	4.2x
facebook/opt-350m	3060	1.09 ± 0.06	1.26	0.39 ± 0.00	0.63	2.2x
facebook/opt-125m	3060	0.73 ± 0.06	0.48	0.20 ± 0.01	0.25	3.9x

facebook/opt-1.3b	M3	11.9 ± 1.17	-	2.65 ± 0.62	-	4.5x
facebook/opt-350m	M3	1.49 ± 0.22	-	0.49 ± 0.22	-	3.0x
facebook/opt-125m	M3	0.78 ± 0.12	-	0.27 ± 0.02	-	2.9x

fasthug currently only supports the quantization_config keyword argument, which you can use to quantize models on-the-fly.

from transformers.utils.quantization_config import BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_8bit=True)
model = fasthug.from_pretrained(model_id, quantization_config=config)  # 3x faster

3x speedup, 200MB less memory loading and quantizing models on-the-fly

The below benchmarks compare these lines:
cfg8b = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True, quantization_config=cfg8b)
model = fasthug.from_pretrained(model_id, quantization_config=cfg8b)
To rerun these benchmarks for 8bit quantization, use the following command.
modal run utils/app.py::run_model --model-id facebook/opt-1.3b --load-in-8bit
Model GPU HuggingFace (s) Mem (GiB) fasthug (s) Mem (GiB) Speedup

facebook/opt-13b H100 18.35 ± 0.17 12.9 5.13 ± 0.03 12.7 3.6x

facebook/opt-6.7b H100 8.07 ± 0.07 6.82 2.30 ± 0.01 6.69 3.5x

facebook/opt-2.7b H100 3.58 ± 0.10 2.91 1.13 ± 0.01 2.71 3.2x

facebook/opt-1.3b H100 3.20 ± 0.37 1.56 0.64 ± 0.00 1.39 5.0x

The next benchmarks compare quantization to 4bit on-the-fly, which compares these lines:
cfg4b = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True, quantization_config=cfg4b)
model = fasthug.from_pretrained(model_id, quantization_config=cfg4b)
To rerun these benchmarks for 4bit quantization, use the following command.
modal run utils/app.py::run_model --model-id facebook/opt-1.3b --load-in-4bit
Note: Peak memory usage fluctuates wildly for these 4 bit benchmarks. Additionally, they're much larger than the peak memory usage from 8 bit benchmarks. This is definitely a bug. Whether in fasthug or in bitsandbytes, I'm not sure at the moment.

Model GPU HuggingFace (s) Mem (GiB) fasthug (s) Mem (GiB) Speedup

facebook/opt-13b H100 15.60 ± 0.21 21.2 4.53 ± 0.14 27.7 3.4x

facebook/opt-6.7b H100 8.39 ± 0.08 11.0 2.47 ± 0.06 10.9 3.4x

facebook/opt-2.7b H100 3.58 ± 0.10 5.91 1.13 ± 0.01 7.07 3.2x

facebook/opt-1.3b H100 3.58 ± 0.70 3.00 0.85 ± 0.18 2.83 4.2x

Model	GPU	HuggingFace (s)	Mem (GiB)	fasthug (s)	Mem (GiB)	Speedup
facebook/opt-13b	H100	18.35 ± 0.17	12.9	5.13 ± 0.03	12.7	3.6x
facebook/opt-6.7b	H100	8.07 ± 0.07	6.82	2.30 ± 0.01	6.69	3.5x
facebook/opt-2.7b	H100	3.58 ± 0.10	2.91	1.13 ± 0.01	2.71	3.2x
facebook/opt-1.3b	H100	3.20 ± 0.37	1.56	0.64 ± 0.00	1.39	5.0x

Model	GPU	HuggingFace (s)	Mem (GiB)	fasthug (s)	Mem (GiB)	Speedup
facebook/opt-13b	H100	15.60 ± 0.21	21.2	4.53 ± 0.14	27.7	3.4x
facebook/opt-6.7b	H100	8.39 ± 0.08	11.0	2.47 ± 0.06	10.9	3.4x
facebook/opt-2.7b	H100	3.58 ± 0.10	5.91	1.13 ± 0.01	7.07	3.2x
facebook/opt-1.3b	H100	3.58 ± 0.70	3.00	0.85 ± 0.18	2.83	4.2x

For the fastest load times, save the quantized model, and load the pre-quantized model with fasthug.

model.save_pretrained("/tmp/quantized")
model = fasthug.from_pretrained("/tmp/quantized") # 8x faster

8x speedup, 150MB less memory loading previously-quantized models

The below benchmarks compare these lines:
# save the quantized checkpoint first
cfg8b = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True, quantization_config=cfg8b)
model.save_pretrained('/tmp/quantized')

# compare these two lines
model = AutoModelForCausalLM.from_pretrained('/tmp/quantized', low_cpu_mem_usage=True)
model = fasthug.from_pretrained('/tmp/quantized')
To rerun these benchmarks, use the following command.
modal run utils/app.py::run_model --model-id facebook/opt-1.3b --use-8bit-checkpoint
If you see an error like the following, just run the same command again. Modal's container just hasn't loaded an updated copy of the on-disk cache.
OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, 
model.ckpt.index or flax_model.msgpack found in directory
Note: For opt-13B, the quantization checkpoint may be corrupted. Need to rerun.

Model GPU HuggingFace (s) Mem (GiB) fasthug (s) Mem (GiB) Speedup

facebook/opt-6.7b H100 14.14 ± 0.53 6.69 1.50 ± 0.01 6.56 9.4x

facebook/opt-2.7b H100 6.16 ± 0.14 2.71 0.70 ± 0.01 2.66 8.8x

facebook/opt-1.3b H100 2.40 ± 0.08 1.39 0.49 ± 0.03 1.36 4.9x

Model	GPU	HuggingFace (s)	Mem (GiB)	fasthug (s)	Mem (GiB)	Speedup
facebook/opt-6.7b	H100	14.14 ± 0.53	6.69	1.50 ± 0.01	6.56	9.4x
facebook/opt-2.7b	H100	6.16 ± 0.14	2.71	0.70 ± 0.01	2.66	8.8x
facebook/opt-1.3b	H100	2.40 ± 0.08	1.39	0.49 ± 0.03	1.36	4.9x

Customization

Fastload only supports the quantization_config kwarg, to stay minimal and lightweight. The eventual goal is to support other commonly-used arguments for model development.

import fasthug
import torch
from transformers import AutoModelForCausalLM
from transformers.utils.quantization_config import BitsAndBytesConfig

# If you pass args that fasthug doesn't support, pass `skip_unsupported_check=True`
model = fasthug.from_pretrained(
    "facebook/opt-125m",
    torch_dtype=torch.float16,
    skip_unsupported_check=True
)

# For args that fasthug doesn't support, initialize a model 'normally', save, then fasthug
model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m", torch_dtype=torch.float16)
model.save_pretrained('/tmp/half')
model = fasthug.from_pretrained("/tmp/half")

# You can load load a 'normally' saved quantized model too, with any extra args you want
cfg8b = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-125m",
    torch_dtype=torch.float16,
    quantization_config=cfg8b
)
model.save_pretrained('/tmp/quantized')
model = fasthug.from_pretrained("/tmp/quantized")

Expand for more example usage

import fasthug
from transformers import AutoModelForCausalLM
from transformers.utils.quantization_config import BitsAndBytesConfig

# Load model on GPU
model = fasthug.from_pretrained("facebook/opt-125m").cuda()

# Load and quantize model in 8 bits per weight
cfg8b = BitsAndBytesConfig(load_in_8bit=True)
model = fasthug.from_pretrained("facebook/opt-125m", quantization_config=cfg8b)

# Load already-quantized 8-bit model
model = fasthug.from_pretrained("facebook/opt-125m", quantization_config=cfg8b)
model.save_pretrained('/tmp/quantized')
model = fasthug.from_pretrained("/tmp/quantized")  # No need to pass in quantization_config again

4-bit on-the-fly quantization sees wildly fluctuating and larger peak memory usage than even 8-bit quantized models. This is true of both the baseline transformer model and the fasthug- loaded models.

# Can do all of the above using 4 bit quantization too.
cfg4b = BitsAndBytesConfig(load_in_4bit=True)
model = fasthug.from_pretrained("facebook/opt-125m", quantization_config=cfg4b)

model = fasthug.from_pretrained("facebook/opt-125m", quantization_config=cfg4b)
model.save_pretrained('/tmp/quantized')
model = fasthug.from_pretrained("/tmp/quantized")

model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m", quantization_config=cfg4b)
model.save_pretrained('/tmp/quantized')
model = fasthug.from_pretrained("/tmp/quantized")

Development

Benchmarks

To run benchmarks locally, you can use the fhb utility. To run benchmarks remotely, use the Modal launcher script in utils/app.py.

# run benchmarks locally
fhb facebook/opt-125m
fhb facebook/opt-125m --load-in-8bit  # on Nvidia GPUs

# run benchmarks remotely
modal run utils/app.py::run_model --model-id facebook/opt-125m
modal run utils/app.py::run_model --model-id facebook/opt-125m --load-in-8bit
modal run utils/app.py::run_model --model-id facebook/opt-125m --use-8bit-checkpoint

Expand for details on running local benchmarks using fhb.

The utility is also available as fhbench or fasthugbench. In short, it compares the loading speed of fasthug vs HuggingFace. For Nvidia GPUs, this script also records peak memory usage.

usage: fhb [-h] [-n NUM_TRIALS] [-d {cpu,cuda,mps,none}] [-w WARMUP]
           [--load-in-8bit] [--load-in-4bit]
           [--quantization-config QUANTIZATION_CONFIG]
           model_id

Benchmark fasthug vs HuggingFace

positional arguments:
  model_id              Model identifier, e.g. facebook/opt-125m

options:
  -h, --help            show this help message and exit
  -n NUM_TRIALS, --num-trials NUM_TRIALS
                        Number of times to run each benchmark
  -d {cpu,cuda,mps,none}, --device {cpu,cuda,mps,none}
                        Device to load the model on (e.g., 'cuda', 'cpu', 'mps' or
                        'none' to automatically select)
  -w WARMUP, --warmup WARMUP
                        Number of warmup runs
  --load-in-8bit        Quantize the model to 8-bit using bitsandbytes
  --load-in-4bit        Quantize the model to 4-bit using bitsandbytes
  --quantization-config QUANTIZATION_CONFIG
                        Path to a quantization config file

Expand for details on running remote benchmarks using the Modal script

The command above will spin up a CPU Modal instance to download the weights to a persisted volume, then spin up a GPU Modal instance to benchmark the model loading itself.

For the --use-8bit-checkpoint flag, we similarly first load and quantize an 8bit checkpoint on a CPU job first, then benchmark loading that checkpoint on a GPU job.

Usage: modal run utils/app.py::run_model [OPTIONS]

Options:
  --use-8bit-checkpoint / --no-use-8bit-checkpoint
  --load-in-4bit / --no-load-in-4bit
  --load-in-8bit / --no-load-in-8bit
  --num-trials INTEGER
  --warmup INTEGER
  --device TEXT
  --model-id TEXT                 [required]
  --help                          Show this message and exit.

Tests

Run tests using the following

modal run utils/app.py::run_tests  # remotely
pytest tests  # locally

How it Works

fasthug uses existing PyTorch and safetensors memory mapping to load model weights into CPU.

For full-precision models, the user can later move these weights to GPU.
To quantize models on-the-fly, we simply use bitsandbytes normally to move and quantize weights.
To load pre-quantized models, we move weights to GPU immediately, so that bitsandbytes recognizes that the weights are pre-quantized. The goal is make loading models as fast as possible, to shorten the dev cycle for quick experiments.

Project details

Release history Release notifications | RSS feed

0.0.6

Jul 24, 2025

0.0.5

Jul 23, 2025

This version

0.0.4

Jul 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fasthug-0.0.4.tar.gz (18.8 kB view details)

Uploaded Jul 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fasthug-0.0.4-py3-none-any.whl (17.5 kB view details)

Uploaded Jul 23, 2025 Python 3

File details

Details for the file fasthug-0.0.4.tar.gz.

File metadata

Download URL: fasthug-0.0.4.tar.gz
Upload date: Jul 23, 2025
Size: 18.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for fasthug-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`d1b0aaa1cfc0f76aefb25f8f1742bada97fda96e90e2e0887275285523f5b3dd`
MD5	`82a6823fdd97f6042d3f424e561d2672`
BLAKE2b-256	`1eae54a3571996989262f3e89eb5ce62a411e3e6f4dda2b09b0d5d7ff42ce532`

See more details on using hashes here.

File details

Details for the file fasthug-0.0.4-py3-none-any.whl.

File metadata

Download URL: fasthug-0.0.4-py3-none-any.whl
Upload date: Jul 23, 2025
Size: 17.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for fasthug-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`552a752da850c4d50109f65ca147656280820767f60190fd29e13a3d01a8abd5`
MD5	`134307734830cb22ed783d6ef8dfab7f`
BLAKE2b-256	`0ff9baea0243d5f6f14bc3f4fd4411d2769ece1e8e347f64296c39f1b1b3a9e2`

See more details on using hashes here.

fasthug 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

fasthug

Customization

Development

Benchmarks

Tests

How it Works

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes