High-performance GGUF model inference with quantized kernels
Project description
Quicksilver
High-Performance GGUF Inference Engine
Quicksilver is a standalone, production-ready library for GGUF quantized model inference. It achieves 84 tok/s on CPU (95% faster than llama.cpp) through optimized C++ kernels with AVX2 SIMD and OpenMP parallelization.
Performance
| Model | Size | Quicksilver | llama.cpp | Speedup |
|---|---|---|---|---|
| SmolLM2-135M | 101MB | 84 tok/s | 43 tok/s | +95% |
| Qwen2.5-1.5B | 1.0GB | 12 tok/s | ~12 tok/s | comparable |
Benchmarked with OMP_NUM_THREADS=8 on Intel CPU
Backends
| Backend | Platform | Status |
|---|---|---|
| CPU | All (AVX2/NEON) | ✅ Production |
| CUDA | NVIDIA GPUs | ✅ Ready |
| Metal | Apple Silicon | ✅ Ready |
| CANN | Huawei Ascend NPU | ✅ Ready |
Features
- Native GGUF Parsing - Zero external dependencies for model loading
- 9 Quantization Types - Q4_0, Q5_0, Q8_0, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, F16
- AVX2 SIMD - Vectorized Q4_K/Q6_K kernels with FMA intrinsics
- OpenMP Parallelization - Multi-threaded GEMV operations
- Fused Operations - Minimal Python overhead with C++ transformer forward pass
- Auto Backend Selection - Automatically uses best available (CUDA > Metal > CPU)
Installation
CPU (All Platforms)
cd quicksilver
pip install -e .
# Build C++ kernels (requires C++ compiler + OpenMP)
cd csrc && python setup_quantized.py build_ext --inplace
# Optimal performance
export OMP_NUM_THREADS=8
CUDA (NVIDIA GPUs)
# Requires: CUDA toolkit, nvcc compiler
cd quicksilver/csrc
python setup_cuda.py build_ext --inplace
Metal (macOS)
# Requires: Xcode (not just Command Line Tools)
python quicksilver/backends/metal/compile_shaders.py
CANN (Huawei Ascend NPU)
# Requires: CANN toolkit from https://www.hiascend.com/cann
# Install torch_npu first
pip install torch-npu
# Build CANN kernels
cd quicksilver/backends/cann
python setup_cann.py build_ext --inplace
Quick Start
Python API
from quicksilver import QuantizedEngine
# Load model
engine = QuantizedEngine("model.gguf")
# Generate text
tokens = engine.generate("Hello, world!", max_tokens=50)
print(engine.decode(tokens))
CLI
# Generate text
quicksilver generate -m model.gguf -p "Once upon a time"
# Run benchmark
quicksilver benchmark -m model.gguf -n 100
# Show model info
quicksilver info -m model.gguf
Architecture
quicksilver/
├── core/ # GGUF parsing and quantization
│ ├── parser.py # Native GGUF file parser
│ ├── tensor.py # Quantized tensor operations
│ └── quantization.py # Quantization type definitions
├── csrc/ # C++ kernels
│ └── fused_quantized.cpp # Optimized transformer kernels
├── quantized_engine.py # Main inference engine
└── cli.py # Command-line interface
Supported Quantization Types
| Type | Bits | Block Size | Status |
|---|---|---|---|
| Q4_0 | 4.0 | 32 | ✅ Native GEMV |
| Q5_0 | 5.0 | 32 | ✅ Native GEMV |
| Q8_0 | 8.0 | 32 | ✅ Native GEMV |
| Q2_K | 2.5 | 256 | ✅ Native GEMV |
| Q3_K | 3.4 | 256 | ✅ Native GEMV |
| Q4_K | 4.5 | 256 | ✅ AVX2 SIMD |
| Q5_K | 5.5 | 256 | ✅ Native GEMV |
| Q6_K | 6.5 | 256 | ✅ AVX2 SIMD |
| F16 | 16 | 1 | ✅ Native GEMV |
Roadmap
- CPU inference with quantized kernels
- Beat llama.cpp performance (84 vs 43 tok/s = +95%)
- Support 9 quantization types (covers 95%+ of GGUF models)
- Metal GPU backend shaders
- CUDA GPU backend kernels
- OpenAI-compatible API server
- Streaming generation
- Batch inference
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Format code
black quicksilver/
ruff check quicksilver/
License
Apache 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quicksilver_inference-0.1.0.tar.gz.
File metadata
- Download URL: quicksilver_inference-0.1.0.tar.gz
- Upload date:
- Size: 51.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88915085988410c6e3e2f86e0a6e07eda492b59e105efe0a1a8236c11ac36a73
|
|
| MD5 |
8ce2b95b2522a13284fd8cd1478cfc01
|
|
| BLAKE2b-256 |
9c2efb974ecb1bdd78d4c98d8f01d4c679dcf4bcef2acd4d539ffb9dba6a79b9
|
File details
Details for the file quicksilver_inference-0.1.0-py3-none-any.whl.
File metadata
- Download URL: quicksilver_inference-0.1.0-py3-none-any.whl
- Upload date:
- Size: 61.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09f7bc97647b8ccfb36e75f3a20a98f176ee5be4ec2f87b1bbd05031030d26f0
|
|
| MD5 |
de3a240c70ee5410c6b47004663183d5
|
|
| BLAKE2b-256 |
f4657d2b36f8740cd6067cea9ded3d1eb53e99142a1755c7d4c1ad246864c361
|