Skip to main content

Automatic monkey-patches for quantized model fine-tuning — fixes QLoRA issues with custom architectures

Project description

qpatch

CI PyPI Python License

Automatic fixes for quantized model fine-tuning.

qpatch monkey-patches known incompatibilities when training 4-bit/8-bit quantized models with LoRA adapters — especially on models with custom CUDA kernels (Mamba, MoE, RWKV, xLSTM).

The Problem

QLoRA training fails on many non-standard architectures with cryptic errors:

AttributeError: 'NoneType' object has no attribute 'get'          # safetensors metadata
RuntimeError: expected mat1 and mat2 to have the same dtype        # uint8 vs float16
RuntimeError: mat1 and mat2 shapes cannot be multiplied            # fused kernels + 4-bit
RuntimeError: index_add_(): self (BFloat16) and source (Float)     # MoE dtype mismatch
RuntimeError: "fused_dropout" not implemented for 'Byte'           # dropout on uint8

The Fix

import qpatch
qpatch.patch_all()

# That's it. Now train normally.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

model = AutoModelForCausalLM.from_pretrained(
    "nvidia/Nemotron-3-Nano-30B",
    quantization_config=BitsAndBytesConfig(load_in_4bit=True),
    device_map="auto",
    trust_remote_code=True,
)
model = get_peft_model(model, LoraConfig(r=32, target_modules=["up_proj", "down_proj"]))
trainer.train()  # Just works

What It Fixes

Issue Error Affected Models
Safetensors metadata 'NoneType' has no attribute 'get' Any model with metadata-less safetensors
LoRA dtype cast unsigned char != c10::Half Any 4-bit model with LoRA
MoE dtype mismatch index_add_(): BFloat16 and Float Nemotron, Mixtral, any MoE
Fused kernel bypass shapes cannot be multiplied Mamba, Nemotron-H, hybrid architectures

Install

pip install qpatch

Individual Patches

Apply only what you need:

import qpatch

qpatch.patch_safetensors_metadata()   # Fix None metadata in safetensors
qpatch.patch_lora_dtype_cast()        # Cast uint8 inputs to float16
qpatch.patch_moe_dtype_mismatch()     # Auto-cast MoE index_add_ dtypes
qpatch.patch_fused_kernel_bypass()    # Skip fused kernels for quantized models

GPU Architecture Notes

  • Volta (V100, GV100): Use qpatch.patch_all(compute_dtype=torch.float16) — Volta does not support bf16
  • Ampere+ (A100, H100): Use qpatch.patch_all(compute_dtype=torch.bfloat16)

How It Works

qpatch applies targeted monkey-patches at import time:

  1. patch_safetensors_metadata — Wraps transformers.modeling_utils.load_state_dict to detect None metadata and fall back to safetensors.torch.load_file
  2. patch_lora_dtype_cast — Wraps peft.tuners.lora.bnb.Linear4bit.forward to cast uint8 inputs to the compute dtype
  3. patch_moe_dtype_mismatch — Wraps torch.Tensor.index_add_ to auto-cast source tensors when dtypes mismatch
  4. patch_fused_kernel_bypass — Scans HuggingFace's cached model code and replaces fused-path conditions with if False: to force the quantization-safe slow path

All patches are idempotent — calling patch_all() multiple times is safe.

Tested With

  • Nemotron-3-Nano-30B (Mamba hybrid)
  • Transformers 4.46–5.3
  • PEFT 0.12–0.18
  • bitsandbytes 0.42–0.49
  • PyTorch 2.5–2.10
  • CUDA 12.2–12.8
  • Volta (GV100), Pascal (P100), Ampere (A100)

License

MIT

Citation

@software{bond2026qpatch,
  author = {Bond, Andrew H.},
  title = {qpatch: Automatic fixes for quantized model fine-tuning},
  year = {2026},
  url = {https://github.com/ahb-sjsu/qpatch},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qpatch-0.2.0.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qpatch-0.2.0-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file qpatch-0.2.0.tar.gz.

File metadata

  • Download URL: qpatch-0.2.0.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for qpatch-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d25d7ff1ce6df31967d7398038156dfae8d33126b0e2a518891ad3b0144d56c5
MD5 28f70edd34020b62755faa289824d2f0
BLAKE2b-256 b8208cd3ac2b83fc45fcb040c131d6ac6a9899df4aac185fa15e4af459769def

See more details on using hashes here.

File details

Details for the file qpatch-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: qpatch-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for qpatch-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c3ae6e9eda4701444c814f19e24580d6a5a3802274b7b073898305d58072ee31
MD5 c49b897010d05413652184677e3c0922
BLAKE2b-256 9f78dedf6710167ff5816f791488aa0d7ee90179a0e50a2ede0c3a7647141ed1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page