Skip to main content

Native fp16 inference engine for Llama models — optional grilly extension

Project description

GrillyInference

Native fp16 inference engine for Llama-family models — optional grilly extension.

Features

  • Native fp16 inference — runs Llama 3.2 3B at ~6.4 GB VRAM, zero quality loss
  • Paged KV-Cache — 256-token SRAM pages with LRU eviction, 4x context extension
  • H2O Eviction — exponential decay on old KV heads, 32k context on 12GB
  • VSA Multi-Scale Summaries — hypervector bind/bundle for 128k effective context
  • SmoothQuant INT8 — per-group-64 weight quantization, <1% PPL loss
  • 4-bit Block Quantization — run 100B models on 12GB VRAM with layer offloading
  • Llama 3.2 Instruct — chat template, streaming generation, top-k/top-p sampling

Quick Start

pip install grillyinference
from grillyinference import LlamaForCausalLM, TextGenerator
from transformers import AutoTokenizer

model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
gen = TextGenerator(model, tokenizer)

# Simple generation
response = gen.generate("What is the meaning of life?", max_tokens=256)
print(response)

# Chat
response = gen.chat([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain transformers in 3 sentences."},
])
print(response)

# Streaming
for token in gen.generate("Once upon a time", stream=True):
    print(token, end="", flush=True)

Context Extension (12GB VRAM)

Context Decode Speed PPL Hit Technique
2k 9 t/s 0% Baseline
8k 8 t/s 0% PagedAttention
32k 7 t/s 1.5% + H2O eviction
128k 6 t/s 3% + VSA summaries
from grillyinference import KVCache, LlamaConfig

config = LlamaConfig.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
kv = KVCache(config, raw_window=2048, h2o_lambda=0.0002, enable_vsa=True)

SmoothQuant INT8

from grillyinference.inference.quantize import SmoothQuantCalibrator, SmoothQuantizer

calibrator = SmoothQuantCalibrator(model, tokenizer)
stats = calibrator.calibrate()
quantizer = SmoothQuantizer(group_size=64)
quantized = quantizer.smooth_and_quantize(model._weights, stats)

Requirements

  • Python 3.12+
  • grilly >= 0.4.0
  • numpy, safetensors
  • Optional: huggingface_hub, transformers (for from_pretrained)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grillyinference-0.1.0.tar.gz (31.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grillyinference-0.1.0-py3-none-any.whl (23.3 kB view details)

Uploaded Python 3

File details

Details for the file grillyinference-0.1.0.tar.gz.

File metadata

  • Download URL: grillyinference-0.1.0.tar.gz
  • Upload date:
  • Size: 31.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for grillyinference-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c4558ddd0406f99ae012ff709dcd6da20a90dcbdd32db35caa55ccd282c57321
MD5 5c837f8a0d5fefa34c13e95d5b70f346
BLAKE2b-256 b25c99b3791a328f79f0af05bc0c719de0f70ed1fa62f1268b3a9821ee4d299f

See more details on using hashes here.

Provenance

The following attestation bundles were made for grillyinference-0.1.0.tar.gz:

Publisher: publish.yml on Grillcheese-AI/GrillyInference

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file grillyinference-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for grillyinference-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ad234a0f98003d34eaa35c7596242b245a73d47d3e2a27a2557943c8e6a5657b
MD5 91b865ed7b99e7928b2da48aa5b86728
BLAKE2b-256 2b588c87ce6e29385bd9bf2963484319ce62977cde79b01c7452426f003fefe3

See more details on using hashes here.

Provenance

The following attestation bundles were made for grillyinference-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Grillcheese-AI/GrillyInference

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page