Skip to main content

Geometric LLM compression: factor+int4 with verifiable quality certificates

Project description

HyperRetro

HyperTensor, retrofitted into the PyTorch / HuggingFace / vLLM ecosystem.

HyperTensor proper is a standalone runtime. HyperRetro is the integrated sibling project: it takes the same geometric primitives (UGT shared basis, GRC / sink-aware projection, geodesic speculative draft, fused dual-Q8 GEMV) and exposes them as drop-in pieces of the standard inference stack.

hyperretro/
├── kernels/        # PyTorch C++ extension (gemv_dual_q8_0, ...)
├── hf/             # offline HuggingFace compression -> .safetensors
├── vllm/           # speculative-decoding draft adapter
└── bench/          # 3-way benchmark harness (baseline | retro | HyperTensor)

Three retrofits

1. Fused kernels as a PyTorch extension

The CUDA kernel kernel_gemv_dual_q8_0 from runtime/nn/cuda_kernels.cu is wrapped as a JIT-built torch.utils.cpp_extension so users can call it from regular PyTorch:

import hyperretro
import torch

x = torch.randn(4096)
# Wa, Wb may be float matrices or pre-quantized (scale, codes) tuples
out_a, out_b = hyperretro.gemv_dual_q8_0(x, Wa, Wb)

Backend resolution: cext (JIT-compiled C extension) → torch (pure torch reference) → numpy (always works). Force the fallback with HYPERRETRO_FORCE_FALLBACK=1.

2. Offline HuggingFace compression

A single CLI takes a vanilla HF model, runs the GRC projection / sink-aware GRC pipeline (Paper E), and writes the result back out as standard .safetensors shards that load with stock AutoModelForCausalLM.from_pretrained:

pip install -e hyperretro[hf]
hyperretro-compress \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --out ./qwen-grc-1024/ \
    --rank 1024 \
    --sink 4

The output directory is 100 % HuggingFace-native — no HyperTensor runtime needed at inference time. A hyperretro_report.json is written alongside recording the per-layer Frobenius rel-err.

3. Geodesic speculative draft for vLLM

hyperretro.vllm.GeodesicDraft replaces the random / smaller-model draft proposer in vLLM-style speculative decoding with the geodesic-step draft from Paper C. The adapter is framework-agnostic (propose(h_curr, h_prev) -> (token_ids, confidences)) and includes a register_with_vllm() hook for live deployments.

Benchmarks

hyperretro-bench kernel  --rows 4096 --in-dim 4096
hyperretro-bench spec    --d-model 512 --k 64 --vocab 2048 --steps 64
hyperretro-bench compress --model Qwen/Qwen2.5-0.5B --out /tmp/qwen-retro \
                         --rank 256 --eval-text "The quick brown fox..."

Each subcommand emits a JSON report comparing standard baseline, HyperRetro, and (where applicable) standalone HyperTensor.

License

MIT for code, CC-BY-4.0 for the accompanying documentation/papers — same as the parent HyperTensor project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hyperretro-0.3.4-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file hyperretro-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: hyperretro-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 5.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for hyperretro-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 da9e7ed04512c642443dbf44ec1ab07def67d9a4ac5f571346016519f9d157f0
MD5 bba58f2e61714a922ed37300d7de128a
BLAKE2b-256 44c4bedd7d6d55b190f49d5697fb81328a996eb363eab877edf7b42f14470708

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page