Skip to main content

HuggingFace Optimum backend for Grilly — Vulkan GPU inference on any GPU

Project description

Optimum Grilly

HuggingFace Optimum backend for Grilly — Vulkan GPU inference on any GPU

PyPI License Python

Alpha software. APIs may change. We welcome early adopters and feedback.

optimum-grilly bridges HuggingFace Transformers to Grilly's Vulkan compute backend. Load any supported model with from_pretrained, run inference on AMD, NVIDIA, or Intel GPUs — no CUDA required.

Features

  • Any GPU: AMD, NVIDIA, Intel — anything with Vulkan drivers
  • HuggingFace compatible: Same from_pretrained / generate API you already know
  • Zero PyTorch runtime: Export once, run forever without PyTorch installed
  • Automatic CPU fallback: Works without a GPU (slower, but functional)
  • Supported architectures: LLaMA, Mistral, BERT, GPT-2 (T5 planned)

Installation

# Core package (CPU fallback only)
pip install optimum-grilly

# With Vulkan GPU acceleration
pip install optimum-grilly[gpu]

# With export support (requires PyTorch)
pip install optimum-grilly[export]

# Everything
pip install optimum-grilly[all]

Requirements

  • Python >= 3.10
  • grilly >= 0.4.5 (for GPU acceleration)
  • Vulkan drivers installed on your system
  • For export: PyTorch >= 2.0

Quick Start

1. Export a HuggingFace model

Convert a HuggingFace model to .grilly format (safetensors + config):

from optimum.grilly import export_to_grilly

# Export a causal LM
export_to_grilly(
    "meta-llama/Llama-3.2-1B",
    output_dir="./llama-1b-grilly",
)

# Export a BERT model for feature extraction
export_to_grilly(
    "bert-base-uncased",
    output_dir="./bert-grilly",
    task="feature-extraction",
)

Or from the command line:

optimum-grilly-export --model meta-llama/Llama-3.2-1B --output ./llama-1b-grilly
optimum-grilly-export --model bert-base-uncased --output ./bert-grilly --task feature-extraction

2. Run inference

from optimum.grilly import GrillyModelForCausalLM
from transformers import AutoTokenizer

# Load model and tokenizer
model = GrillyModelForCausalLM.from_pretrained("./llama-1b-grilly")
tokenizer = AutoTokenizer.from_pretrained("./llama-1b-grilly")

# Generate text
input_ids = tokenizer("The meaning of life is", return_tensors="np")["input_ids"]
output_ids = model.generate(input_ids, max_new_tokens=50, temperature=0.8, top_k=40)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

3. Feature extraction (embeddings)

from optimum.grilly import GrillyModelForFeatureExtraction
from optimum.grilly.pipelines import grilly_feature_extraction_pipeline
from transformers import AutoTokenizer

model = GrillyModelForFeatureExtraction.from_pretrained("./bert-grilly")
tokenizer = AutoTokenizer.from_pretrained("./bert-grilly")

# Get sentence embeddings
embedding = grilly_feature_extraction_pipeline(
    model, tokenizer, "Hello world", pooling="mean"
)
print(embedding.shape)  # (1, 768)

API Reference

Configuration

from optimum.grilly import GrillyConfig

# From a HuggingFace config dict
config = GrillyConfig.from_hf_config(hf_config_dict)

# Save / load
config.save("./model-dir")
config = GrillyConfig.load("./model-dir")

# Inspect
print(config)  # GrillyConfig(model_type='llama', hidden_size=4096, ...)
print(config.get_layer_map())  # Layer descriptors for weight loading

Models

Class Description
GrillyModel Base class — embed + transformer blocks + final norm
GrillyModelForCausalLM + LM head + generate() for text generation
GrillyModelForFeatureExtraction Returns last_hidden_state for embeddings
GrillyModelForSequenceClassification + classifier head for classification tasks

All models support:

  • from_pretrained(path) — Load from a .grilly directory
  • save_pretrained(path) — Save config + weights
  • forward(input_ids, attention_mask=None) — Run inference

Export

from optimum.grilly import export_to_grilly

export_to_grilly(
    model_name_or_path="meta-llama/Llama-3.2-1B",
    output_dir="./output",
    task="causal-lm",         # "causal-lm", "feature-extraction",
                               # "sequence-classification", "auto"
    dtype="float32",
    include_tokenizer=True,
)

Pipelines

from optimum.grilly.pipelines import (
    grilly_text_generation_pipeline,
    grilly_feature_extraction_pipeline,
)

# Text generation
text = grilly_text_generation_pipeline(model, tokenizer, "Once upon a time")

# Feature extraction with pooling
embedding = grilly_feature_extraction_pipeline(
    model, tokenizer, "Hello", pooling="mean"  # "mean", "cls", "last"
)

Architecture

optimum-grilly
├── optimum/grilly/
│   ├── __init__.py          # Lazy imports
│   ├── configuration.py     # GrillyConfig (HF config mapping)
│   ├── modeling.py           # GrillyModel + task subclasses
│   ├── export.py             # HF PyTorch → .grilly converter
│   ├── pipelines.py          # Pipeline helpers
│   ├── utils.py              # safetensors I/O
│   └── version.py
├── tests/
│   ├── test_configuration.py
│   ├── test_modeling.py
│   ├── test_export.py
│   ├── test_pipelines.py
│   └── test_utils.py
└── pyproject.toml

How it works

  1. Export (export.py): Downloads a HuggingFace PyTorch model, extracts all named_parameters() and named_buffers() as float32 numpy arrays, saves them as safetensors alongside a grilly_config.json that maps the HF architecture to grilly ops.

  2. Load (modeling.py): Reads the safetensors weights and config, builds a graph of _TransformerBlock objects that hold numpy weight arrays. Each block dispatches linear/norm/attention/FFN operations to grilly_core (the C++ Vulkan extension) with automatic CPU numpy fallbacks.

  3. Inference: All computation happens in float32. The Vulkan backend handles GPU upload/download transparently. When grilly_core is not available, all ops fall back to numpy — slower but correct.

Supported architectures

Architecture Status Notes
LLaMA / LLaMA 2 / LLaMA 3 Supported Pre-norm, SwiGLU, RoPE, GQA
Mistral Supported Same as LLaMA (sliding window not yet implemented)
BERT Supported Post-norm, standard FFN
GPT-2 Supported Pre-norm, fused QKV, Conv1D weight handling
T5 Planned Encoder-decoder not yet implemented

Environment Variables

Variable Description
VK_GPU_INDEX Select GPU by index (default: 0)
GRILLY_DEBUG Set to 1 for debug logging
ALLOW_CPU_VULKAN Set to 1 to allow llvmpipe CPU fallback

Known Limitations

  • No KV-cache: generate() recomputes the full forward pass per token (O(n²)). KV-cache support is planned.
  • Float32 only: No fp16/bf16/int8 quantization yet.
  • No beam search: Only greedy and top-k sampling.
  • No streaming: generate() returns the full sequence.
  • T5 not supported: Encoder-decoder architectures are not yet implemented.

Development

git clone https://github.com/grillcheese-ai/optimum-grilly.git
cd optimum-grilly
pip install -e ".[dev]"
pytest tests/ -v

License

Apache 2.0 — see LICENSE for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

optimum_grilly-0.1.0.tar.gz (28.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

optimum_grilly-0.1.0-py3-none-any.whl (23.4 kB view details)

Uploaded Python 3

File details

Details for the file optimum_grilly-0.1.0.tar.gz.

File metadata

  • Download URL: optimum_grilly-0.1.0.tar.gz
  • Upload date:
  • Size: 28.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for optimum_grilly-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0ebb1961c13b4241331b0b46d1294ffa0c0407ffaa191ddf11d0c9de6b977c63
MD5 4f9f10921bdb0e4724d15d6b92a45259
BLAKE2b-256 510450c3d8e8af1f1570f4e8d99353d0e8121d622d9725f14c3137de7ce62f92

See more details on using hashes here.

Provenance

The following attestation bundles were made for optimum_grilly-0.1.0.tar.gz:

Publisher: publish.yml on Grillcheese-AI/optimum-grilly

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file optimum_grilly-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: optimum_grilly-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for optimum_grilly-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0f525b9ef027c2c17610f1a3a75d8bf77b5ddb179dd9b6c797b97bfcdd218160
MD5 a0837c3a6214483a5fcc4026dfe391cb
BLAKE2b-256 ff5966d702369583318c857fcdff77ed4dd87b2cf860ad117b444e5b3305236b

See more details on using hashes here.

Provenance

The following attestation bundles were made for optimum_grilly-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Grillcheese-AI/optimum-grilly

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page