High-performance GGUF model inference with quantized kernels

These details have not been verified by PyPI

Project links

Project description

Quicksilver

High-Performance GGUF Inference Engine

Quicksilver is a standalone, production-ready library for GGUF quantized model inference. It achieves 84 tok/s on CPU (95% faster than llama.cpp) through optimized C++ kernels with AVX2 SIMD and OpenMP parallelization.

Performance

Model	Size	Quicksilver	llama.cpp	Speedup
SmolLM2-135M	101MB	84 tok/s	43 tok/s	+95%
Qwen2.5-1.5B	1.0GB	12 tok/s	~12 tok/s	comparable

Benchmarked with OMP_NUM_THREADS=8 on Intel CPU

Backends

Backend	Platform	Status
CPU	All (AVX2/NEON)	✅ Production
CUDA	NVIDIA GPUs	✅ Ready
Metal	Apple Silicon	✅ Ready
CANN	Huawei Ascend NPU	✅ Ready

Features

Native GGUF Parsing - Zero external dependencies for model loading
9 Quantization Types - Q4_0, Q5_0, Q8_0, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, F16
AVX2 SIMD - Vectorized Q4_K/Q6_K kernels with FMA intrinsics
OpenMP Parallelization - Multi-threaded GEMV operations
Fused Operations - Minimal Python overhead with C++ transformer forward pass
Auto Backend Selection - Automatically uses best available (CUDA > Metal > CPU)

Installation

CPU (All Platforms)

cd quicksilver
pip install -e .

# Build C++ kernels (requires C++ compiler + OpenMP)
cd csrc && python setup_quantized.py build_ext --inplace

# Optimal performance
export OMP_NUM_THREADS=8

CUDA (NVIDIA GPUs)

# Requires: CUDA toolkit, nvcc compiler
cd quicksilver/csrc
python setup_cuda.py build_ext --inplace

Metal (macOS)

# Requires: Xcode (not just Command Line Tools)
python quicksilver/backends/metal/compile_shaders.py

CANN (Huawei Ascend NPU)

# Requires: CANN toolkit from https://www.hiascend.com/cann
# Install torch_npu first
pip install torch-npu

# Build CANN kernels
cd quicksilver/backends/cann
python setup_cann.py build_ext --inplace

Quick Start

Python API

from quicksilver import QuantizedEngine

# Load model
engine = QuantizedEngine("model.gguf")

# Generate text
tokens = engine.generate("Hello, world!", max_tokens=50)
print(engine.decode(tokens))

CLI

# Generate text
quicksilver generate -m model.gguf -p "Once upon a time"

# Run benchmark
quicksilver benchmark -m model.gguf -n 100

# Show model info
quicksilver info -m model.gguf

Architecture

quicksilver/
├── core/           # GGUF parsing and quantization
│   ├── parser.py   # Native GGUF file parser
│   ├── tensor.py   # Quantized tensor operations
│   └── quantization.py  # Quantization type definitions
├── csrc/           # C++ kernels
│   └── fused_quantized.cpp  # Optimized transformer kernels
├── quantized_engine.py  # Main inference engine
└── cli.py          # Command-line interface

Supported Quantization Types

Type	Bits	Block Size	Status
Q4_0	4.0	32	✅ Native GEMV
Q5_0	5.0	32	✅ Native GEMV
Q8_0	8.0	32	✅ Native GEMV
Q2_K	2.5	256	✅ Native GEMV
Q3_K	3.4	256	✅ Native GEMV
Q4_K	4.5	256	✅ AVX2 SIMD
Q5_K	5.5	256	✅ Native GEMV
Q6_K	6.5	256	✅ AVX2 SIMD
F16	16	1	✅ Native GEMV

Roadmap

CPU inference with quantized kernels
Beat llama.cpp performance (84 vs 43 tok/s = +95%)
Support 9 quantization types (covers 95%+ of GGUF models)
Metal GPU backend shaders
CUDA GPU backend kernels
OpenAI-compatible API server
Streaming generation
Batch inference

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black quicksilver/
ruff check quicksilver/

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.2

Jan 27, 2026

0.2.1

Jan 27, 2026

0.2.0

Jan 27, 2026

This version

0.1.0

Jan 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quicksilver_inference-0.1.0.tar.gz (51.3 kB view details)

Uploaded Jan 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quicksilver_inference-0.1.0-py3-none-any.whl (61.6 kB view details)

Uploaded Jan 27, 2026 Python 3

File details

Details for the file quicksilver_inference-0.1.0.tar.gz.

File metadata

Download URL: quicksilver_inference-0.1.0.tar.gz
Upload date: Jan 27, 2026
Size: 51.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for quicksilver_inference-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`88915085988410c6e3e2f86e0a6e07eda492b59e105efe0a1a8236c11ac36a73`
MD5	`8ce2b95b2522a13284fd8cd1478cfc01`
BLAKE2b-256	`9c2efb974ecb1bdd78d4c98d8f01d4c679dcf4bcef2acd4d539ffb9dba6a79b9`

See more details on using hashes here.

File details

Details for the file quicksilver_inference-0.1.0-py3-none-any.whl.

File metadata

Download URL: quicksilver_inference-0.1.0-py3-none-any.whl
Upload date: Jan 27, 2026
Size: 61.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for quicksilver_inference-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`09f7bc97647b8ccfb36e75f3a20a98f176ee5be4ec2f87b1bbd05031030d26f0`
MD5	`de3a240c70ee5410c6b47004663183d5`
BLAKE2b-256	`f4657d2b36f8740cd6067cea9ded3d1eb53e99142a1755c7d4c1ad246864c361`

See more details on using hashes here.

quicksilver-inference 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Quicksilver

Performance

Backends

Features

Installation

CPU (All Platforms)

CUDA (NVIDIA GPUs)

Metal (macOS)

CANN (Huawei Ascend NPU)

Quick Start

Python API

CLI

Architecture

Supported Quantization Types

Roadmap

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes