HuggingFace Optimum backend for Grilly — Vulkan GPU inference on any GPU

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

grillcheese

These details have not been verified by PyPI

Project description

Optimum Grilly

HuggingFace Optimum backend for Grilly — Vulkan GPU inference on any GPU

Alpha software. APIs may change. We welcome early adopters and feedback.

optimum-grilly bridges HuggingFace Transformers to Grilly's Vulkan compute backend. Load any supported model with from_pretrained, run inference on AMD, NVIDIA, or Intel GPUs — no CUDA required.

Features

Any GPU: AMD, NVIDIA, Intel — anything with Vulkan drivers
HuggingFace compatible: Same from_pretrained / generate API you already know
Zero PyTorch runtime: Export once, run forever without PyTorch installed
Automatic CPU fallback: Works without a GPU (slower, but functional)
Supported architectures: LLaMA, Mistral, BERT, GPT-2 (T5 planned)

Installation

# Core package (CPU fallback only)
pip install optimum-grilly

# With Vulkan GPU acceleration
pip install optimum-grilly[gpu]

# With export support (requires PyTorch)
pip install optimum-grilly[export]

# Everything
pip install optimum-grilly[all]

Requirements

Python >= 3.10
grilly >= 0.4.5 (for GPU acceleration)
Vulkan drivers installed on your system
For export: PyTorch >= 2.0

Quick Start

1. Export a HuggingFace model

Convert a HuggingFace model to .grilly format (safetensors + config):

from optimum.grilly import export_to_grilly

# Export a causal LM
export_to_grilly(
    "meta-llama/Llama-3.2-1B",
    output_dir="./llama-1b-grilly",
)

# Export a BERT model for feature extraction
export_to_grilly(
    "bert-base-uncased",
    output_dir="./bert-grilly",
    task="feature-extraction",
)

Or from the command line:

optimum-grilly-export --model meta-llama/Llama-3.2-1B --output ./llama-1b-grilly
optimum-grilly-export --model bert-base-uncased --output ./bert-grilly --task feature-extraction

2. Run inference

from optimum.grilly import GrillyModelForCausalLM
from transformers import AutoTokenizer

# Load model and tokenizer
model = GrillyModelForCausalLM.from_pretrained("./llama-1b-grilly")
tokenizer = AutoTokenizer.from_pretrained("./llama-1b-grilly")

# Generate text
input_ids = tokenizer("The meaning of life is", return_tensors="np")["input_ids"]
output_ids = model.generate(input_ids, max_new_tokens=50, temperature=0.8, top_k=40)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

3. Feature extraction (embeddings)

from optimum.grilly import GrillyModelForFeatureExtraction
from optimum.grilly.pipelines import grilly_feature_extraction_pipeline
from transformers import AutoTokenizer

model = GrillyModelForFeatureExtraction.from_pretrained("./bert-grilly")
tokenizer = AutoTokenizer.from_pretrained("./bert-grilly")

# Get sentence embeddings
embedding = grilly_feature_extraction_pipeline(
    model, tokenizer, "Hello world", pooling="mean"
)
print(embedding.shape)  # (1, 768)

API Reference

Configuration

from optimum.grilly import GrillyConfig

# From a HuggingFace config dict
config = GrillyConfig.from_hf_config(hf_config_dict)

# Save / load
config.save("./model-dir")
config = GrillyConfig.load("./model-dir")

# Inspect
print(config)  # GrillyConfig(model_type='llama', hidden_size=4096, ...)
print(config.get_layer_map())  # Layer descriptors for weight loading

Models

Class	Description
`GrillyModel`	Base class — embed + transformer blocks + final norm
`GrillyModelForCausalLM`	+ LM head + `generate()` for text generation
`GrillyModelForFeatureExtraction`	Returns `last_hidden_state` for embeddings
`GrillyModelForSequenceClassification`	+ classifier head for classification tasks

All models support:

from_pretrained(path) — Load from a .grilly directory
save_pretrained(path) — Save config + weights
forward(input_ids, attention_mask=None) — Run inference

Export

from optimum.grilly import export_to_grilly

export_to_grilly(
    model_name_or_path="meta-llama/Llama-3.2-1B",
    output_dir="./output",
    task="causal-lm",         # "causal-lm", "feature-extraction",
                               # "sequence-classification", "auto"
    dtype="float32",
    include_tokenizer=True,
)

Pipelines

from optimum.grilly.pipelines import (
    grilly_text_generation_pipeline,
    grilly_feature_extraction_pipeline,
)

# Text generation
text = grilly_text_generation_pipeline(model, tokenizer, "Once upon a time")

# Feature extraction with pooling
embedding = grilly_feature_extraction_pipeline(
    model, tokenizer, "Hello", pooling="mean"  # "mean", "cls", "last"
)

Architecture

optimum-grilly
├── optimum/grilly/
│   ├── __init__.py          # Lazy imports
│   ├── configuration.py     # GrillyConfig (HF config mapping)
│   ├── modeling.py           # GrillyModel + task subclasses
│   ├── export.py             # HF PyTorch → .grilly converter
│   ├── pipelines.py          # Pipeline helpers
│   ├── utils.py              # safetensors I/O
│   └── version.py
├── tests/
│   ├── test_configuration.py
│   ├── test_modeling.py
│   ├── test_export.py
│   ├── test_pipelines.py
│   └── test_utils.py
└── pyproject.toml

How it works

Export (export.py): Downloads a HuggingFace PyTorch model, extracts all named_parameters() and named_buffers() as float32 numpy arrays, saves them as safetensors alongside a grilly_config.json that maps the HF architecture to grilly ops.
Load (modeling.py): Reads the safetensors weights and config, builds a graph of _TransformerBlock objects that hold numpy weight arrays. Each block dispatches linear/norm/attention/FFN operations to grilly_core (the C++ Vulkan extension) with automatic CPU numpy fallbacks.
Inference: All computation happens in float32. The Vulkan backend handles GPU upload/download transparently. When grilly_core is not available, all ops fall back to numpy — slower but correct.

Supported architectures

Architecture	Status	Notes
LLaMA / LLaMA 2 / LLaMA 3	Supported	Pre-norm, SwiGLU, RoPE, GQA
Mistral	Supported	Same as LLaMA (sliding window not yet implemented)
BERT	Supported	Post-norm, standard FFN
GPT-2	Supported	Pre-norm, fused QKV, Conv1D weight handling
T5	Planned	Encoder-decoder not yet implemented

Environment Variables

Variable	Description
`VK_GPU_INDEX`	Select GPU by index (default: 0)
`GRILLY_DEBUG`	Set to `1` for debug logging
`ALLOW_CPU_VULKAN`	Set to `1` to allow llvmpipe CPU fallback

Known Limitations

No KV-cache: generate() recomputes the full forward pass per token (O(n²)). KV-cache support is planned.
Float32 only: No fp16/bf16/int8 quantization yet.
No beam search: Only greedy and top-k sampling.
No streaming: generate() returns the full sequence.
T5 not supported: Encoder-decoder architectures are not yet implemented.

Development

git clone https://github.com/grillcheese-ai/optimum-grilly.git
cd optimum-grilly
pip install -e ".[dev]"
pytest tests/ -v

License

Apache 2.0 — see LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

grillcheese

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.1

Mar 30, 2026

0.2.1

Mar 15, 2026

This version

0.1.0

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

optimum_grilly-0.1.0.tar.gz (28.7 kB view details)

Uploaded Feb 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

optimum_grilly-0.1.0-py3-none-any.whl (23.4 kB view details)

Uploaded Feb 28, 2026 Python 3

File details

Details for the file optimum_grilly-0.1.0.tar.gz.

File metadata

Download URL: optimum_grilly-0.1.0.tar.gz
Upload date: Feb 28, 2026
Size: 28.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for optimum_grilly-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0ebb1961c13b4241331b0b46d1294ffa0c0407ffaa191ddf11d0c9de6b977c63`
MD5	`4f9f10921bdb0e4724d15d6b92a45259`
BLAKE2b-256	`510450c3d8e8af1f1570f4e8d99353d0e8121d622d9725f14c3137de7ce62f92`

See more details on using hashes here.

Provenance

The following attestation bundles were made for optimum_grilly-0.1.0.tar.gz:

Publisher: publish.yml on Grillcheese-AI/optimum-grilly

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: optimum_grilly-0.1.0.tar.gz
- Subject digest: 0ebb1961c13b4241331b0b46d1294ffa0c0407ffaa191ddf11d0c9de6b977c63
- Sigstore transparency entry: 1004797554
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: Grillcheese-AI/optimum-grilly@ead7560d6d13304cf4544f382e7bbd3e53d02bf3
- Branch / Tag: refs/tags/0.1.0
- Owner: https://github.com/Grillcheese-AI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ead7560d6d13304cf4544f382e7bbd3e53d02bf3
- Trigger Event: release

File details

Details for the file optimum_grilly-0.1.0-py3-none-any.whl.

File metadata

Download URL: optimum_grilly-0.1.0-py3-none-any.whl
Upload date: Feb 28, 2026
Size: 23.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for optimum_grilly-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0f525b9ef027c2c17610f1a3a75d8bf77b5ddb179dd9b6c797b97bfcdd218160`
MD5	`a0837c3a6214483a5fcc4026dfe391cb`
BLAKE2b-256	`ff5966d702369583318c857fcdff77ed4dd87b2cf860ad117b444e5b3305236b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for optimum_grilly-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Grillcheese-AI/optimum-grilly

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: optimum_grilly-0.1.0-py3-none-any.whl
- Subject digest: 0f525b9ef027c2c17610f1a3a75d8bf77b5ddb179dd9b6c797b97bfcdd218160
- Sigstore transparency entry: 1004797556
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: Grillcheese-AI/optimum-grilly@ead7560d6d13304cf4544f382e7bbd3e53d02bf3
- Branch / Tag: refs/tags/0.1.0
- Owner: https://github.com/Grillcheese-AI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ead7560d6d13304cf4544f382e7bbd3e53d02bf3
- Trigger Event: release

optimum-grilly 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Optimum Grilly

Features

Installation

Requirements

Quick Start

1. Export a HuggingFace model

2. Run inference

3. Feature extraction (embeddings)

API Reference

Configuration

Models

Export

Pipelines

Architecture

How it works

Supported architectures

Environment Variables

Known Limitations

Development

License

Links

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance