Skip to main content

Geometric LLM compression: factor+int4 with verifiable quality certificates

Project description

HyperRetro

HyperTensor, retrofitted into the PyTorch / HuggingFace / vLLM ecosystem.

HyperTensor proper is a standalone runtime. HyperRetro is the integrated sibling project: it takes the same geometric primitives (UGT shared basis, GRC / sink-aware projection, geodesic speculative draft, fused dual-Q8 GEMV) and exposes them as drop-in pieces of the standard inference stack.

hyperretro/
├── kernels/        # PyTorch C++ extension (gemv_dual_q8_0, ...)
├── hf/             # offline HuggingFace compression -> .safetensors
├── vllm/           # speculative-decoding draft adapter
└── bench/          # 3-way benchmark harness (baseline | retro | HyperTensor)

Three retrofits

1. Fused kernels as a PyTorch extension

The CUDA kernel kernel_gemv_dual_q8_0 from runtime/nn/cuda_kernels.cu is wrapped as a JIT-built torch.utils.cpp_extension so users can call it from regular PyTorch:

import hyperretro
import torch

x = torch.randn(4096)
# Wa, Wb may be float matrices or pre-quantized (scale, codes) tuples
out_a, out_b = hyperretro.gemv_dual_q8_0(x, Wa, Wb)

Backend resolution: cext (JIT-compiled C extension) → torch (pure torch reference) → numpy (always works). Force the fallback with HYPERRETRO_FORCE_FALLBACK=1.

2. Offline HuggingFace compression

A single CLI takes a vanilla HF model, runs the GRC projection / sink-aware GRC pipeline (Paper E), and writes the result back out as standard .safetensors shards that load with stock AutoModelForCausalLM.from_pretrained:

pip install -e hyperretro[hf]
hyperretro-compress \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --out ./qwen-grc-1024/ \
    --rank 1024 \
    --sink 4

The output directory is 100 % HuggingFace-native — no HyperTensor runtime needed at inference time. A hyperretro_report.json is written alongside recording the per-layer Frobenius rel-err.

3. Geodesic speculative draft for vLLM

hyperretro.vllm.GeodesicDraft replaces the random / smaller-model draft proposer in vLLM-style speculative decoding with the geodesic-step draft from Paper C. The adapter is framework-agnostic (propose(h_curr, h_prev) -> (token_ids, confidences)) and includes a register_with_vllm() hook for live deployments.

Benchmarks

hyperretro-bench kernel  --rows 4096 --in-dim 4096
hyperretro-bench spec    --d-model 512 --k 64 --vocab 2048 --steps 64
hyperretro-bench compress --model Qwen/Qwen2.5-0.5B --out /tmp/qwen-retro \
                         --rank 256 --eval-text "The quick brown fox..."

Each subcommand emits a JSON report comparing standard baseline, HyperRetro, and (where applicable) standalone HyperTensor.

License

MIT for code, CC-BY-4.0 for the accompanying documentation/papers — same as the parent HyperTensor project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hyperretro-0.3.3-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file hyperretro-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: hyperretro-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 5.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for hyperretro-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3848cc4521b6a8de1088848f44ad4de22b544c96b41fbd833f1273c22c463e4d
MD5 9cb494f8c28608caf93988ffcfaa0e4c
BLAKE2b-256 891aee54968b4ea15f4f82cf05b944a14b51a649a8f0768d2b0a9aede6e2a642

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page