fraQtl runtime — drop-in loader for fraQtl-compressed Hugging Face checkpoints. Production LLM inference with calibration-aware compression.
Project description
fraQtl
Runtime KV-cache and weight compression for production LLM inference.
Drop-in. No retraining. Calibration-aware.
What it is
fraqtl-runtime is the runtime loader for fraQtl-compressed model artifacts. It enables:
- Weight compression: load fraQtl-compressed Hugging Face checkpoints (e.g.
fraQtl/Qwen3.6-35B-A3B-compressed) via standardtransformerswithtrust_remote_code=True. The wheel ships the compiled loader that decodes the packed weights at load time. - Runtime KV-cache compression (separate, in active validation): a llama.cpp-compatible runtime layer that compresses the V cache at runtime — independent of weight format.
Install
pip install fraqtl-runtime
That's the entire setup. No license token required for loading published artifacts.
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
repo = "fraQtl/Qwen3.6-35B-A3B-compressed"
model = AutoModelForCausalLM.from_pretrained(
repo, trust_remote_code=True,
torch_dtype=torch.bfloat16, device_map="auto",
)
tok = AutoTokenizer.from_pretrained(repo)
ids = tok("The capital of France is", return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**ids, max_new_tokens=20, do_sample=False)[0]))
trust_remote_code=True pulls a small stub from the model repo that imports the compiled loader from this wheel. You never write import fraqtl directly.
High-level approach
fraQtl combines two ideas:
- Calibration-aware eigenbasis rotation — protect the input directions that matter for the deployment task; quantize the rest. The calibration corpus determines which directions are protected (this is FPT — fraQtl Pullback Theorem).
- Per-row sign correction primitive — additional precision on top of low-bit quantization where it matters most for reasoning.
Both compose with standard quantization machinery (Lloyd-Max centroids, INT3 packing) and standard inference engines (HF transformers, llama.cpp).
Status
- Public weight-compression artifacts on Hugging Face: huggingface.co/fraQtl
- Runtime KV-cache compression layer: in active validation. Public benchmark numbers landing after H100 measurement lock and manual review.
- Methodology paper in preparation.
Links
- Site: fraqtl.ai
- Hugging Face: huggingface.co/fraQtl
- Diagnostic tool (open-source):
fraqtl-diagnostic - Contact: contact@fraqtl.ai
License
Proprietary. The compressed model weights and loader are free to install and use for research and evaluation. Production / commercial use: contact fraQtl.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fraqtl_runtime-0.1.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.
File metadata
- Download URL: fraqtl_runtime-0.1.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
- Upload date:
- Size: 857.4 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
212a0e1636e75bd5f417d243805a35e1f66fbfd68ea87a234f4ba426560fec52
|
|
| MD5 |
d5acd02a3226f8589e74c979a3f66dc3
|
|
| BLAKE2b-256 |
5c3c2dd218eaf471fdcb2057cc96b723ef1cc73e9cd55853309a7729cfe064da
|