Skip to main content

fraQtl runtime — drop-in loader for fraQtl-compressed Hugging Face checkpoints. Production LLM inference with calibration-aware compression.

Project description

fraQtl

Runtime KV-cache and weight compression for production LLM inference.

Drop-in. No retraining. Calibration-aware.


What it is

fraqtl-runtime is the runtime loader for fraQtl-compressed model artifacts. It enables:

  • Weight compression: load fraQtl-compressed Hugging Face checkpoints (e.g. fraQtl/Qwen3.6-35B-A3B-compressed) via standard transformers with trust_remote_code=True. The wheel ships the compiled loader that decodes the packed weights at load time.
  • Runtime KV-cache compression (separate, in active validation): a llama.cpp-compatible runtime layer that compresses the V cache at runtime — independent of weight format.

Install

pip install fraqtl-runtime

That's the entire setup. No license token required for loading published artifacts.


Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "fraQtl/Qwen3.6-35B-A3B-compressed"
model = AutoModelForCausalLM.from_pretrained(
    repo, trust_remote_code=True,
    torch_dtype=torch.bfloat16, device_map="auto",
)
tok = AutoTokenizer.from_pretrained(repo)

ids = tok("The capital of France is", return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**ids, max_new_tokens=20, do_sample=False)[0]))

trust_remote_code=True pulls a small stub from the model repo that imports the compiled loader from this wheel. You never write import fraqtl directly.


High-level approach

fraQtl combines two ideas:

  1. Calibration-aware eigenbasis rotation — protect the input directions that matter for the deployment task; quantize the rest. The calibration corpus determines which directions are protected (this is FPT — fraQtl Pullback Theorem).
  2. Per-row sign correction primitive — additional precision on top of low-bit quantization where it matters most for reasoning.

Both compose with standard quantization machinery (Lloyd-Max centroids, INT3 packing) and standard inference engines (HF transformers, llama.cpp).


Status

  • Public weight-compression artifacts on Hugging Face: huggingface.co/fraQtl
  • Runtime KV-cache compression layer: in active validation. Public benchmark numbers landing after H100 measurement lock and manual review.
  • Methodology paper in preparation.

Links


License

Proprietary. The compressed model weights and loader are free to install and use for research and evaluation. Production / commercial use: contact fraQtl.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fraqtl_runtime-0.1.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (857.4 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

File details

Details for the file fraqtl_runtime-0.1.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for fraqtl_runtime-0.1.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 212a0e1636e75bd5f417d243805a35e1f66fbfd68ea87a234f4ba426560fec52
MD5 d5acd02a3226f8589e74c979a3f66dc3
BLAKE2b-256 5c3c2dd218eaf471fdcb2057cc96b723ef1cc73e9cd55853309a7729cfe064da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page