HuggingFace Optimum backend for Grilly — Vulkan GPU inference on any GPU
Project description
Optimum Grilly
HuggingFace Optimum backend for Grilly — Vulkan GPU inference on any GPU
Alpha software. APIs may change. We welcome early adopters and feedback.
optimum-grilly bridges HuggingFace Transformers to Grilly's Vulkan compute backend. Load any supported model with from_pretrained, run inference on AMD, NVIDIA, or Intel GPUs — no CUDA required.
Features
- Any GPU: AMD, NVIDIA, Intel — anything with Vulkan drivers
- HuggingFace compatible: Same
from_pretrained/generateAPI you already know - Zero PyTorch runtime: Export once, run forever without PyTorch installed
- Automatic CPU fallback: Works without a GPU (slower, but functional)
- Supported architectures: LLaMA, Mistral, BERT, GPT-2 (T5 planned)
Installation
# Core package (CPU fallback only)
pip install optimum-grilly
# With Vulkan GPU acceleration
pip install optimum-grilly[gpu]
# With export support (requires PyTorch)
pip install optimum-grilly[export]
# Everything
pip install optimum-grilly[all]
Requirements
- Python >= 3.10
- grilly >= 0.4.5 (for GPU acceleration)
- Vulkan drivers installed on your system
- For export: PyTorch >= 2.0
Quick Start
1. Export a HuggingFace model
Convert a HuggingFace model to .grilly format (safetensors + config):
from optimum.grilly import export_to_grilly
# Export a causal LM
export_to_grilly(
"meta-llama/Llama-3.2-1B",
output_dir="./llama-1b-grilly",
)
# Export a BERT model for feature extraction
export_to_grilly(
"bert-base-uncased",
output_dir="./bert-grilly",
task="feature-extraction",
)
Or from the command line:
optimum-grilly-export --model meta-llama/Llama-3.2-1B --output ./llama-1b-grilly
optimum-grilly-export --model bert-base-uncased --output ./bert-grilly --task feature-extraction
2. Run inference
from optimum.grilly import GrillyModelForCausalLM
from transformers import AutoTokenizer
# Load model and tokenizer
model = GrillyModelForCausalLM.from_pretrained("./llama-1b-grilly")
tokenizer = AutoTokenizer.from_pretrained("./llama-1b-grilly")
# Generate text
input_ids = tokenizer("The meaning of life is", return_tensors="np")["input_ids"]
output_ids = model.generate(input_ids, max_new_tokens=50, temperature=0.8, top_k=40)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
3. Feature extraction (embeddings)
from optimum.grilly import GrillyModelForFeatureExtraction
from optimum.grilly.pipelines import grilly_feature_extraction_pipeline
from transformers import AutoTokenizer
model = GrillyModelForFeatureExtraction.from_pretrained("./bert-grilly")
tokenizer = AutoTokenizer.from_pretrained("./bert-grilly")
# Get sentence embeddings
embedding = grilly_feature_extraction_pipeline(
model, tokenizer, "Hello world", pooling="mean"
)
print(embedding.shape) # (1, 768)
API Reference
Configuration
from optimum.grilly import GrillyConfig
# From a HuggingFace config dict
config = GrillyConfig.from_hf_config(hf_config_dict)
# Save / load
config.save("./model-dir")
config = GrillyConfig.load("./model-dir")
# Inspect
print(config) # GrillyConfig(model_type='llama', hidden_size=4096, ...)
print(config.get_layer_map()) # Layer descriptors for weight loading
Models
| Class | Description |
|---|---|
GrillyModel |
Base class — embed + transformer blocks + final norm |
GrillyModelForCausalLM |
+ LM head + generate() for text generation |
GrillyModelForFeatureExtraction |
Returns last_hidden_state for embeddings |
GrillyModelForSequenceClassification |
+ classifier head for classification tasks |
All models support:
from_pretrained(path)— Load from a.grillydirectorysave_pretrained(path)— Save config + weightsforward(input_ids, attention_mask=None)— Run inference
Export
from optimum.grilly import export_to_grilly
export_to_grilly(
model_name_or_path="meta-llama/Llama-3.2-1B",
output_dir="./output",
task="causal-lm", # "causal-lm", "feature-extraction",
# "sequence-classification", "auto"
dtype="float32",
include_tokenizer=True,
)
Pipelines
from optimum.grilly.pipelines import (
grilly_text_generation_pipeline,
grilly_feature_extraction_pipeline,
)
# Text generation
text = grilly_text_generation_pipeline(model, tokenizer, "Once upon a time")
# Feature extraction with pooling
embedding = grilly_feature_extraction_pipeline(
model, tokenizer, "Hello", pooling="mean" # "mean", "cls", "last"
)
Architecture
optimum-grilly
├── optimum/grilly/
│ ├── __init__.py # Lazy imports
│ ├── configuration.py # GrillyConfig (HF config mapping)
│ ├── modeling.py # GrillyModel + task subclasses
│ ├── export.py # HF PyTorch → .grilly converter
│ ├── pipelines.py # Pipeline helpers
│ ├── utils.py # safetensors I/O
│ └── version.py
├── tests/
│ ├── test_configuration.py
│ ├── test_modeling.py
│ ├── test_export.py
│ ├── test_pipelines.py
│ └── test_utils.py
└── pyproject.toml
How it works
-
Export (
export.py): Downloads a HuggingFace PyTorch model, extracts allnamed_parameters()andnamed_buffers()as float32 numpy arrays, saves them as safetensors alongside agrilly_config.jsonthat maps the HF architecture to grilly ops. -
Load (
modeling.py): Reads the safetensors weights and config, builds a graph of_TransformerBlockobjects that hold numpy weight arrays. Each block dispatches linear/norm/attention/FFN operations togrilly_core(the C++ Vulkan extension) with automatic CPU numpy fallbacks. -
Inference: All computation happens in float32. The Vulkan backend handles GPU upload/download transparently. When
grilly_coreis not available, all ops fall back to numpy — slower but correct.
Supported architectures
| Architecture | Status | Notes |
|---|---|---|
| LLaMA / LLaMA 2 / LLaMA 3 | Supported | Pre-norm, SwiGLU, RoPE, GQA |
| Mistral | Supported | Same as LLaMA (sliding window not yet implemented) |
| BERT | Supported | Post-norm, standard FFN |
| GPT-2 | Supported | Pre-norm, fused QKV, Conv1D weight handling |
| T5 | Planned | Encoder-decoder not yet implemented |
Environment Variables
| Variable | Description |
|---|---|
VK_GPU_INDEX |
Select GPU by index (default: 0) |
GRILLY_DEBUG |
Set to 1 for debug logging |
ALLOW_CPU_VULKAN |
Set to 1 to allow llvmpipe CPU fallback |
Known Limitations
- No KV-cache:
generate()recomputes the full forward pass per token (O(n²)). KV-cache support is planned. - Float32 only: No fp16/bf16/int8 quantization yet.
- No beam search: Only greedy and top-k sampling.
- No streaming:
generate()returns the full sequence. - T5 not supported: Encoder-decoder architectures are not yet implemented.
Development
git clone https://github.com/grillcheese-ai/optimum-grilly.git
cd optimum-grilly
pip install -e ".[dev]"
pytest tests/ -v
License
Apache 2.0 — see LICENSE for details.
Links
- Grilly — The GPU framework
- HuggingFace Optimum — HF's optimization toolkit
- GrillCheese AI — Research lab
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file optimum_grilly-0.3.1.tar.gz.
File metadata
- Download URL: optimum_grilly-0.3.1.tar.gz
- Upload date:
- Size: 36.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
08fbba4aeaef061ad52f231b580804464ca6f87e41f15c131ce960480b735062
|
|
| MD5 |
9224eb8e7c38d579e7d6e995d96f0e2e
|
|
| BLAKE2b-256 |
449cf171142c6a23e7ef99df4a9f4e762f01be98d765f064a97829a67bc814e7
|
Provenance
The following attestation bundles were made for optimum_grilly-0.3.1.tar.gz:
Publisher:
publish.yml on Grillcheese-AI/optimum-grilly
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
optimum_grilly-0.3.1.tar.gz -
Subject digest:
08fbba4aeaef061ad52f231b580804464ca6f87e41f15c131ce960480b735062 - Sigstore transparency entry: 1200872160
- Sigstore integration time:
-
Permalink:
Grillcheese-AI/optimum-grilly@bafdd52d949cb9eeb9a2ae08e27e97f01ff0a5bc -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/Grillcheese-AI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bafdd52d949cb9eeb9a2ae08e27e97f01ff0a5bc -
Trigger Event:
release
-
Statement type:
File details
Details for the file optimum_grilly-0.3.1-py3-none-any.whl.
File metadata
- Download URL: optimum_grilly-0.3.1-py3-none-any.whl
- Upload date:
- Size: 30.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82e3545d3f901932c7d7353adec5ff889f95b7eb0981ac02754d8e7545de8605
|
|
| MD5 |
8cabca7e40a9a7bccf0038e61e96688e
|
|
| BLAKE2b-256 |
2439e84114aa51d59ed570f1334031a875cff5af15c9d77a14be2b01d07ea437
|
Provenance
The following attestation bundles were made for optimum_grilly-0.3.1-py3-none-any.whl:
Publisher:
publish.yml on Grillcheese-AI/optimum-grilly
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
optimum_grilly-0.3.1-py3-none-any.whl -
Subject digest:
82e3545d3f901932c7d7353adec5ff889f95b7eb0981ac02754d8e7545de8605 - Sigstore transparency entry: 1200872233
- Sigstore integration time:
-
Permalink:
Grillcheese-AI/optimum-grilly@bafdd52d949cb9eeb9a2ae08e27e97f01ff0a5bc -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/Grillcheese-AI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bafdd52d949cb9eeb9a2ae08e27e97f01ff0a5bc -
Trigger Event:
release
-
Statement type: