Skip to main content

High-performance GGUF model inference with quantized kernels

Project description

Quicksilver

High-Performance GGUF Inference Engine

Quicksilver is a standalone, production-ready library for GGUF quantized model inference. It achieves 84 tok/s on CPU (95% faster than llama.cpp) through optimized C++ kernels with AVX2 SIMD and OpenMP parallelization.

Performance

Model Size Quicksilver llama.cpp Speedup
SmolLM2-135M 101MB 84 tok/s 43 tok/s +95%
Qwen2.5-1.5B 1.0GB 12 tok/s ~12 tok/s comparable

Benchmarked with OMP_NUM_THREADS=8 on Intel CPU

Backends

Backend Platform Status
CPU All (AVX2/NEON) ✅ Production
CUDA NVIDIA GPUs ✅ Ready
Metal Apple Silicon ✅ Ready
CANN Huawei Ascend NPU ✅ Ready

Features

  • Native GGUF Parsing - Zero external dependencies for model loading
  • 9 Quantization Types - Q4_0, Q5_0, Q8_0, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, F16
  • AVX2 SIMD - Vectorized Q4_K/Q6_K kernels with FMA intrinsics
  • OpenMP Parallelization - Multi-threaded GEMV operations
  • Fused Operations - Minimal Python overhead with C++ transformer forward pass
  • Auto Backend Selection - Automatically uses best available (CUDA > Metal > CPU)

Installation

CPU (All Platforms)

cd quicksilver
pip install -e .

# Build C++ kernels (requires C++ compiler + OpenMP)
cd csrc && python setup_quantized.py build_ext --inplace

# Optimal performance
export OMP_NUM_THREADS=8

CUDA (NVIDIA GPUs)

# Requires: CUDA toolkit, nvcc compiler
cd quicksilver/csrc
python setup_cuda.py build_ext --inplace

Metal (macOS)

# Requires: Xcode (not just Command Line Tools)
python quicksilver/backends/metal/compile_shaders.py

CANN (Huawei Ascend NPU)

# Requires: CANN toolkit from https://www.hiascend.com/cann
# Install torch_npu first
pip install torch-npu

# Build CANN kernels
cd quicksilver/backends/cann
python setup_cann.py build_ext --inplace

Quick Start

Python API

from quicksilver import QuantizedEngine

# Load model
engine = QuantizedEngine("model.gguf")

# Generate text
tokens = engine.generate("Hello, world!", max_tokens=50)
print(engine.decode(tokens))

CLI

# Generate text
quicksilver generate -m model.gguf -p "Once upon a time"

# Run benchmark
quicksilver benchmark -m model.gguf -n 100

# Show model info
quicksilver info -m model.gguf

Architecture

quicksilver/
├── core/           # GGUF parsing and quantization
│   ├── parser.py   # Native GGUF file parser
│   ├── tensor.py   # Quantized tensor operations
│   └── quantization.py  # Quantization type definitions
├── csrc/           # C++ kernels
│   └── fused_quantized.cpp  # Optimized transformer kernels
├── quantized_engine.py  # Main inference engine
└── cli.py          # Command-line interface

Supported Quantization Types

Type Bits Block Size Status
Q4_0 4.0 32 ✅ Native GEMV
Q5_0 5.0 32 ✅ Native GEMV
Q8_0 8.0 32 ✅ Native GEMV
Q2_K 2.5 256 ✅ Native GEMV
Q3_K 3.4 256 ✅ Native GEMV
Q4_K 4.5 256 ✅ AVX2 SIMD
Q5_K 5.5 256 ✅ Native GEMV
Q6_K 6.5 256 ✅ AVX2 SIMD
F16 16 1 ✅ Native GEMV

Roadmap

  • CPU inference with quantized kernels
  • Beat llama.cpp performance (84 vs 43 tok/s = +95%)
  • Support 9 quantization types (covers 95%+ of GGUF models)
  • Metal GPU backend shaders
  • CUDA GPU backend kernels
  • OpenAI-compatible API server
  • Streaming generation
  • Batch inference

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black quicksilver/
ruff check quicksilver/

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quicksilver_inference-0.1.0.tar.gz (51.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quicksilver_inference-0.1.0-py3-none-any.whl (61.6 kB view details)

Uploaded Python 3

File details

Details for the file quicksilver_inference-0.1.0.tar.gz.

File metadata

  • Download URL: quicksilver_inference-0.1.0.tar.gz
  • Upload date:
  • Size: 51.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for quicksilver_inference-0.1.0.tar.gz
Algorithm Hash digest
SHA256 88915085988410c6e3e2f86e0a6e07eda492b59e105efe0a1a8236c11ac36a73
MD5 8ce2b95b2522a13284fd8cd1478cfc01
BLAKE2b-256 9c2efb974ecb1bdd78d4c98d8f01d4c679dcf4bcef2acd4d539ffb9dba6a79b9

See more details on using hashes here.

File details

Details for the file quicksilver_inference-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for quicksilver_inference-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 09f7bc97647b8ccfb36e75f3a20a98f176ee5be4ec2f87b1bbd05031030d26f0
MD5 de3a240c70ee5410c6b47004663183d5
BLAKE2b-256 f4657d2b36f8740cd6067cea9ded3d1eb53e99142a1755c7d4c1ad246864c361

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page